Data Related Objects¶

Datasources¶

class driftai.data.datasource.Datasource(data_uri)[source]¶

Bases: abc.ABC

Abstract datasource

get_data()[source]¶: Get all datasource data

get_info()[source]¶

Datasource summary

Returns:	Dictionary used to serialize DriftAI Datasource instance
Return type:	dict

get_infolist()[source]¶

Get list of labeled indices

Returns:	First element of the tuple is the index and the second element is the label
Return type:	list of tuples

get_path()[source]¶

Get the location of datasource

Returns:	File system datasource location
Return type:	str

get_uri()[source]¶

Get datasource location URI formated

Returns:	Datasource location
Return type:	str

static load_from_data(data)[source]¶

Create datasource from serialized data

Parameters:	data (dict) – Dictionary containing serialized datasource data
Returns:
Return type:	Datasource

class driftai.data.datasource.DirectoryDatasource(path, parsing_pattern)[source]¶

Bases: driftai.data.datasource.Datasource

Parameters:	path (str) – Location of the dataset. Accept formats are: Filesystem path File URI parsing_pattern (Pattern to get the label and data from file. Example: {testset}/{class}/{filename}.[txt\|tsv]) –

get_data()[source]¶

Get all data under the datasource path

Returns:	First element of the tuple is the index and the second element is the label
Return type:	list of tuples

get_info()[source]¶

Directory datasource summary

Returns:	Dictionary used to serialize an DriftAI DirectoryDatasource instance
Return type:	dict

get_infolist()[source]¶

Get list of labeled indices

Returns:	First element of the tuple is the index and the second element is the label
Return type:	list of tuples

loader¶

class driftai.data.datasource.FileDatasource(path, label=None, first_line_heading=True)[source]¶

Bases: driftai.data.datasource.Datasource

Datasource subclass Responsible of handling datasets comming from a local file like csv files

Parameters:	path_to_data (str) – Location of the dataset. Accept formats are: Filesystem path File URI label (str, optional) – Name of the label. If label is left to None the default label is assumed to be the last column first_line_heading (bool, optional) – If True considers that first line is the header

get_data()[source]¶

Get the content of csv file

Returns:	DataFrame wrapping the csv content
Return type:	pandas.DataFrame

get_info()[source]¶

Datasource summary

Returns:	Dictionary used to serialize DriftAI Datasource instance
Return type:	dict

get_infolist()[source]¶

Get list of labeled indices

Returns:	First element of the tuple is the index and the second element is the label
Return type:	list of tuples
Raises:	`OptAppFileDatasourceNotCompatibeException` – If file extension is not compatible with DriftAI

label¶

class driftai.data.datasource.ImageDatasource(path, parsing_pattern='{testset}/{class}_{}.[png|jpg|jpeg]')[source]¶

Bases: driftai.data.datasource.DirectoryDatasource

loader(idx)[source]¶

Dataset¶

class driftai.data.dataset.Dataset[source]¶

Indexed dataset over a datasource

Parameters:	datasource (Datasource) – Datasource of the dataset problem_type (str, optional) – Objective of the algorithm. If problem type is not set manually, driftai will infere it automatically Possible values are: binary_clf, clf or regression creation_date (datetime) – Creation date of the dataset. Should not be set manually id (str) – Unique identifier for Dataset

static collection()[source]¶

Get table containing datasets

Returns:
Return type:	TinyDB instance

static from_dir(path, path_pattern=None, datatype='img')[source]¶

Create a Dataset from dir

Parameters:	path (str) – DataSource location path path_pattern (str, optional) – Pattern to generate metadate. If path_pattern is left to None the default path_pattern is taken datatype (str, optional) – Directory datatype
Returns:
Return type:	DirectoryDatasource

generate_subdataset(method, by)[source]¶

Creates a subdataset of the current Dataset

Parameters:	method (str) – Evaluation sets split approach. Can be: `train_test` `k_fold` by (float, int) – If train_test method is specified, by represents the traininig set size. For example: .85 If k_fold method is specified, by is the number of folds

get_data()[source]¶: Get datasource data

get_info()[source]¶

Get info to serialize a Dataset instance

Returns:

Dictionariy containing a Dataset object summary:

{
    "datasource": dict containing path, first_line_heading and label of the datasource,
    "infolist": <TODO>,
    "problem_type": <multiclass clf, regression, binary clf>,
    "creation_date": <creation date of the dataset>,
    "id": <unique identifier>
}

Return type: dict

get_labels()[source]¶

Get all the labels

Returns:	List with all labels
Return type:	list

id¶

Get the unique identifier of the Persistent instance

Returns:	Unique identifier
Return type:	str

classmethod load_from_data(data)[source]¶

Creates a Dataset object from serialized JSON data coming from TinyDB

Parameters:	data (dict) – JSON data from TinyDB
Raises:	`OptAppInvalidStructureException` – In case file keys are incorrect
Returns:	New Dataset instance
Return type:	driftai.Dataset

static read_file(path, label=None, first_line_heading=True)[source]¶

Create a Dataset from a file

Parameters:	path (str) – DataSource location path label (str, optional) – Name of the label. If label is left to None the default label is assumed to be the last column first_line_heading (bool, optional) – If True considers that first line is the header

SubDataset¶

class driftai.data.dataset.SubDataset[source]¶

Parameters:

dataset (Dataset) – DriftAI dataset which the current subdataset inherits from
method (str) – Evaluation sets split approach. Can be: train_test, k_fold
by (float, int, optional) – If train_test method is specified, by represents the traininig set size. For example: .85 If k_fold method is specified, by is the number of folds

indices (dict) –

Contains the number of sets and the indices of each set:

{
    "method": str
    "indices:" {
        "train": list of int
        "test": list of int
    }
}

Should not be set by the developer

id (str, optional) – Unique identifier
creation_date (str, datetime, optional) – Creation date of the subdataset. Should not be set manually

static collection()[source]¶

Get table containing subdatasets

Returns:
Return type:	TinyDB instance

get_info()[source]¶

Get info to serialize a SubDataset instance

Returns:

Contains subdataset essential information:

{
    "dataset": str, parent dataset path,
    "creation_date": str, Subdataset creation date,
    "id": str,
    "indices": dict, structure specified at the costructor parameters documentation,
    "path": str, subdataset path
}

Return type: dict

get_test_data(subset)[source]¶

Get the test data of a subset

Parameters:	subset (str) – subset identifier
Returns:	Containing all instances which belog to test set with its label: { "X": list, "y": list }
Return type:	dict

get_test_labels(subset)[source]¶

Get the labels of test set of an specific subset

Parameters:	subset (str) – subset identifier
Returns:	Ground truths of subset’s test data
Return type:	list

get_train_data(subset)[source]¶

Get the training data of a subset

Parameters:	subset (str) – subset identifier
Returns:	Containing each training set instance with its label: { "X": list, "y": list }
Return type:	dict

get_train_labels(subset)[source]¶

Get the labels of training set of an specific subset

Parameters:	subset (str) – subset identifier
Returns:	Ground truths of subset’s training data
Return type:	list

id¶

Get the unique identifier of the Persistent instance

Returns:	Unique identifier
Return type:	str

classmethod load_from_data(data)[source]¶

Loads a subdataset from data coming from TinyDB

Parameters:	data (dict) – JSON data
Raises:	`OptAppSubDatasetInfoFileWrongStructureException` – If data has worng keys
Returns:	New SubDataset instance
Return type:	driftai.SubDataset

Table of Contents

Previous topic

Next topic

This Page