Data Related Objects¶
Datasources¶
-
class
driftai.data.datasource.Datasource(data_uri)[source]¶ Bases:
abc.ABCAbstract datasource
-
get_info()[source]¶ Datasource summary
Returns: Dictionary used to serialize DriftAI Datasource instance Return type: dict
-
get_infolist()[source]¶ Get list of labeled indices
Returns: First element of the tuple is the index and the second element is the label Return type: list of tuples
-
get_path()[source]¶ Get the location of datasource
Returns: File system datasource location Return type: str
-
static
load_from_data(data)[source]¶ Create datasource from serialized data
Parameters: data (dict) – Dictionary containing serialized datasource data Returns: Return type: Datasource
-
-
class
driftai.data.datasource.DirectoryDatasource(path, parsing_pattern)[source]¶ Bases:
driftai.data.datasource.DatasourceParameters: - path (str) –
- Location of the dataset. Accept formats are:
- Filesystem path
- File URI
- parsing_pattern (Pattern to get the label and data from file. Example: {testset}/{class}/{filename}.[txt|tsv]) –
-
get_data()[source]¶ Get all data under the datasource path
Returns: First element of the tuple is the index and the second element is the label Return type: list of tuples
-
get_info()[source]¶ Directory datasource summary
Returns: Dictionary used to serialize an DriftAI DirectoryDatasource instance Return type: dict
-
get_infolist()[source]¶ Get list of labeled indices
Returns: First element of the tuple is the index and the second element is the label Return type: list of tuples
-
loader¶
- path (str) –
-
class
driftai.data.datasource.FileDatasource(path, label=None, first_line_heading=True)[source]¶ Bases:
driftai.data.datasource.DatasourceDatasource subclass Responsible of handling datasets comming from a local file like csv files
Parameters: - path_to_data (str) –
- Location of the dataset. Accept formats are:
- Filesystem path
- File URI
- label (str, optional) – Name of the label. If label is left to None the default label is assumed to be the last column
- first_line_heading (bool, optional) – If True considers that first line is the header
-
get_data()[source]¶ Get the content of csv file
Returns: DataFrame wrapping the csv content Return type: pandas.DataFrame
-
get_info()[source]¶ Datasource summary
Returns: Dictionary used to serialize DriftAI Datasource instance Return type: dict
-
get_infolist()[source]¶ Get list of labeled indices
Returns: First element of the tuple is the index and the second element is the label Return type: list of tuples Raises: OptAppFileDatasourceNotCompatibeException– If file extension is not compatible with DriftAI
-
label¶
- path_to_data (str) –
Dataset¶
-
class
driftai.data.dataset.Dataset[source]¶ Indexed dataset over a datasource
Parameters: - datasource (Datasource) – Datasource of the dataset
- problem_type (str, optional) – Objective of the algorithm. If problem type is not set manually, driftai will infere it automatically Possible values are: binary_clf, clf or regression
- creation_date (datetime) – Creation date of the dataset. Should not be set manually
- id (str) – Unique identifier for Dataset
-
static
from_dir(path, path_pattern=None, datatype='img')[source]¶ Create a Dataset from dir
Parameters: - path (str) – DataSource location path
- path_pattern (str, optional) – Pattern to generate metadate. If path_pattern is left to None the default path_pattern is taken
- datatype (str, optional) – Directory datatype
Returns: Return type:
-
generate_subdataset(method, by)[source]¶ Creates a subdataset of the current Dataset
Parameters: - method (str) – Evaluation sets split approach.
Can be:
train_testk_fold - by (float, int) – If train_test method is specified, by represents the traininig set size. For example: .85 If k_fold method is specified, by is the number of folds
- method (str) – Evaluation sets split approach.
Can be:
-
get_info()[source]¶ Get info to serialize a Dataset instance
Returns: Dictionariy containing a Dataset object summary: { "datasource": dict containing path, first_line_heading and label of the datasource, "infolist": <TODO>, "problem_type": <multiclass clf, regression, binary clf>, "creation_date": <creation date of the dataset>, "id": <unique identifier> }
Return type: dict
-
id¶ Get the unique identifier of the Persistent instance
Returns: Unique identifier Return type: str
-
classmethod
load_from_data(data)[source]¶ Creates a Dataset object from serialized JSON data coming from TinyDB
Parameters: data (dict) – JSON data from TinyDB Raises: OptAppInvalidStructureException– In case file keys are incorrectReturns: New Dataset instance Return type: driftai.Dataset
-
static
read_file(path, label=None, first_line_heading=True)[source]¶ Create a Dataset from a file
Parameters: - path (str) – DataSource location path
- label (str, optional) – Name of the label. If label is left to None the default label is assumed to be the last column
- first_line_heading (bool, optional) – If True considers that first line is the header
SubDataset¶
-
class
driftai.data.dataset.SubDataset[source]¶ Parameters: - dataset (Dataset) – DriftAI dataset which the current subdataset inherits from
- method (str) – Evaluation sets split approach. Can be: train_test, k_fold
- by (float, int, optional) – If train_test method is specified, by represents the traininig set size. For example: .85 If k_fold method is specified, by is the number of folds
- indices (dict) –
Contains the number of sets and the indices of each set:
{ "method": str "indices:" { "train": list of int "test": list of int } } Should not be set by the developer
- id (str, optional) – Unique identifier
- creation_date (str, datetime, optional) – Creation date of the subdataset. Should not be set manually
-
get_info()[source]¶ Get info to serialize a SubDataset instance
Returns: Contains subdataset essential information: { "dataset": str, parent dataset path, "creation_date": str, Subdataset creation date, "id": str, "indices": dict, structure specified at the costructor parameters documentation, "path": str, subdataset path }
Return type: dict
-
get_test_data(subset)[source]¶ Get the test data of a subset
Parameters: subset (str) – subset identifier Returns: Containing all instances which belog to test set with its label: { "X": list, "y": list }
Return type: dict
-
get_test_labels(subset)[source]¶ Get the labels of test set of an specific subset
Parameters: subset (str) – subset identifier Returns: Ground truths of subset’s test data Return type: list
-
get_train_data(subset)[source]¶ Get the training data of a subset
Parameters: subset (str) – subset identifier Returns: Containing each training set instance with its label: { "X": list, "y": list }
Return type: dict
-
get_train_labels(subset)[source]¶ Get the labels of training set of an specific subset
Parameters: subset (str) – subset identifier Returns: Ground truths of subset’s training data Return type: list
-
id¶ Get the unique identifier of the Persistent instance
Returns: Unique identifier Return type: str