Data Generator
Data generation tool is used to generating artificial datasets with configurable size and complexity. The tool currently supports classification, regression and timeseries problem type.
Click Data > Data generator > Fill the configuration based on problem type which is available in the following sections:
Configuration of Classification data
number_samples | Number of samples (records) in dummy dataset (default is 100) |
number_numerical_features | Number of numerical features in dummy dataset (default is 25) |
number_categorical_features | Number of categorical features in dummy dataset (default is 2) |
number_text_features | Number of text features in dummy dataset (default is 2) |
missing_proportion | Create records with missing values based on a predefined proportion. For example: if there are 100 records in the dataset, and missing proportion is 0.1, number of missing records will be 100*0.1=10 (default is 0.1) |
number_informative | Number of informative features. (Default is 20, i.e out of 25 total number features, only 20 features are useful of the model, and the other 5 are uninformative or redundant.) |
number_class | Number of classes for classification problem (default is 2) |
weights | The proportions of samples assigned to each class. If None, then classes are balanced. (Default is [0.4, 0.6] with number_class=2). |
shift | Shift features by the specified value (default is 0.0). This value can be a float, or an array with a size equals to number of numerical features. |
value_range_dict | Set the range of values of different features based on features id (default value is {“0”: [1,2]}, meaning that feature with id 0 has the range between 1 and 2. Users can set with multiple feature ids if needed. |
Configuration of Regression data
number_samples | Number of samples (records) in dummy dataset (default=100) |
number_numerical_features | Number of numerical features in dummy dataset (default=25) |
number_categorical_features | Number of categorical features in dummy dataset (default=2) |
number_text_feature | Number of text features in dummy dataset (default is 2) |
missing_proportion | Create records with missing values based on a predefined proportion. (Default is 0.1 meaning that if there are 100 records in the dataset, and the missing proportion is 0.1, the number of records with missing value will be 100*0.1=10.) |
number_informative | Number of informative features. (Default is 20, i.e out of 25 total number features, only 20 features are useful of the model, and the other 5 are uninformative or redundant.) |
number_target | Number of target features |
bias | The bias term (or offset/y-intercept) in the underlying linear model (default is 0.0, meaning that no bias term is added up). |
noise | The standard deviation of the gaussian noise applied to the dataset (default is 0.0, meaning that there is no noise applied to the dataset). |
value_range_dict | Set the range of values of different features based on features id (default value is {“0”: [1,2]}, meaning that feature with id 0 has the range between 1 and 2. Users can set up multiple feature ids if needed. |
Configuration of Time series data
number_samples | Number of samples (records) in dummy dataset (default=100) |
number_numerical_features | Number of numerical features in dummy dataset (default=25) |
number_categorical_features | Number of categorical features in dummy dataset (default=2) |
number_text_feature | Number of text features in dummy dataset (default is 2) |
missing_proportion | Create records with missing values based on a predefined proportion. (Default is 0.1 meaning that if there are 100 records in the dataset, and the missing proportion is 0.1, the number of records with missing value will be 100*0.1=10.) |
number_informative | Number of informative features. (Default is 20, i.e out of 25 total number features, only 20 features are useful of the model, and the other 5 are uninformative or redundant.) |
number_target | Number of target features |
bias | The bias term (or offset/y-intercept) in the underlying linear model (default is 0.0, meaning that no bias term is added up). |
noise | The standard deviation of the gaussian noise applied to the dataset (default is 0.0, meaning that there is no noise applied to the dataset). |
value_range_dict | Set the range of values of different features based on features id (default value is {“0”: [1,2]}, meaning that feature with id 0 has the range between 1 and 2. Users can set up multiple feature ids if needed. |
Univariate | Set to True if the user wants to generate univariate time series data. Otherwise, set to False to generate multivariate time series data. |
Start Time | Left bound for generating dates. |
End Time | Right bound for generating dates. |