Data Generator

Data generation tool is used to generating artificial datasets with configurable size and complexity. The tool currently supports classification, regression and timeseries problem type.

Click Data > Data generator > Fill the configuration based on problem type which is available in the following sections:

Configuration of Classification data

number_samples Number of samples (records) in dummy dataset (default is 100)
number_numerical_features Number of numerical features in dummy dataset (default is 25)
number_categorical_features Number of categorical features in dummy dataset (default is 2)
number_text_features Number of text features in dummy dataset (default is 2)
missing_proportion Create records with missing values based on a predefined proportion. For example: if there are 100 records in the dataset, and missing proportion is 0.1, number of missing records will be 100*0.1=10 (default is 0.1)
number_informative Number of informative features. (Default is 20, i.e out of 25 total number features, only 20 features are useful of the model, and the other 5 are uninformative or redundant.)
number_class Number of classes for classification problem (default is 2)
weights The proportions of samples assigned to each class. If None, then classes are balanced. (Default is [0.4, 0.6] with number_class=2).
shift Shift features by the specified value (default is 0.0). This value can be a float, or an array with a size equals to number of numerical features.
value_range_dict Set the range of values of different features based on features id (default value is {“0”: [1,2]}, meaning that feature with id 0 has the range between 1 and 2. Users can set with multiple feature ids if needed.

Configuration of Regression data

number_samples Number of samples (records) in dummy dataset (default=100)
number_numerical_features Number of numerical features in dummy dataset (default=25)
number_categorical_features Number of categorical features in dummy dataset (default=2)
number_text_feature Number of text features in dummy dataset (default is 2)
missing_proportion Create records with missing values based on a predefined proportion. (Default is 0.1 meaning that if there are 100 records in the dataset, and the missing proportion is 0.1, the number of records with missing value will be 100*0.1=10.)
number_informative Number of informative features. (Default is 20, i.e out of 25 total number features, only 20 features are useful of the model, and the other 5 are uninformative or redundant.)
number_target Number of target features
bias The bias term (or offset/y-intercept) in the underlying linear model (default is 0.0, meaning that no bias term is added up).
noise The standard deviation of the gaussian noise applied to the dataset (default is 0.0, meaning that there is no noise applied to the dataset).
value_range_dict Set the range of values of different features based on features id (default value is {“0”: [1,2]}, meaning that feature with id 0 has the range between 1 and 2. Users can set up multiple feature ids if needed.

Configuration of Time series data

number_samples Number of samples (records) in dummy dataset (default=100)
number_numerical_features Number of numerical features in dummy dataset (default=25)
number_categorical_features Number of categorical features in dummy dataset (default=2)
number_text_feature Number of text features in dummy dataset (default is 2)
missing_proportion Create records with missing values based on a predefined proportion. (Default is 0.1 meaning that if there are 100 records in the dataset, and the missing proportion is 0.1, the number of records with missing value will be 100*0.1=10.)
number_informative Number of informative features. (Default is 20, i.e out of 25 total number features, only 20 features are useful of the model, and the other 5 are uninformative or redundant.)
number_target Number of target features
bias The bias term (or offset/y-intercept) in the underlying linear model (default is 0.0, meaning that no bias term is added up).
noise The standard deviation of the gaussian noise applied to the dataset (default is 0.0, meaning that there is no noise applied to the dataset).
value_range_dict Set the range of values of different features based on features id (default value is {“0”: [1,2]}, meaning that feature with id 0 has the range between 1 and 2. Users can set up multiple feature ids if needed.
Univariate Set to True if the user wants to generate univariate time series data. Otherwise, set to False to generate multivariate time series data.
Start Time Left bound for generating dates.
End Time Right bound for generating dates.