Data Generator

Data generation tool is used to generating artificial datasets with configurable size and complexity. The tool currently supports classification, regression and timeseries problem type.

Click Data > Data generator > Fill the configuration based on problem type which is available in the following sections:

Configuration of Classification data


number_samples	Number of samples (records) in dummy dataset (default is 100)
number_numerical_features	Number of numerical features in dummy dataset (default is 25)
number_categorical_features	Number of categorical features in dummy dataset (default is 2)
number_text_features	Number of text features in dummy dataset (default is 2)
missing_proportion	Create records with missing values based on a predefined proportion. For example: if there are 100 records in the dataset, and missing proportion is 0.1, number of missing records will be 100*0.1=10 (default is 0.1)
number_informative	Number of informative features. (Default is 20, i.e out of 25 total number features, only 20 features are useful of the model, and the other 5 are uninformative or redundant.)
number_class	Number of classes for classification problem (default is 2)
weights	The proportions of samples assigned to each class. If None, then classes are balanced. (Default is [0.4, 0.6] with number_class=2).
shift	Shift features by the specified value (default is 0.0). This value can be a float, or an array with a size equals to number of numerical features.
value_range_dict	Set the range of values of different features based on features id (default value is {“0”: [1,2]}, meaning that feature with id 0 has the range between 1 and 2. Users can set with multiple feature ids if needed.

Configuration of Regression data


number_samples	Number of samples (records) in dummy dataset (default=100)
number_numerical_features	Number of numerical features in dummy dataset (default=25)
number_categorical_features	Number of categorical features in dummy dataset (default=2)
number_text_feature	Number of text features in dummy dataset (default is 2)
missing_proportion	Create records with missing values based on a predefined proportion. (Default is 0.1 meaning that if there are 100 records in the dataset, and the missing proportion is 0.1, the number of records with missing value will be 100*0.1=10.)
number_informative	Number of informative features. (Default is 20, i.e out of 25 total number features, only 20 features are useful of the model, and the other 5 are uninformative or redundant.)
number_target	Number of target features
bias	The bias term (or offset/y-intercept) in the underlying linear model (default is 0.0, meaning that no bias term is added up).
noise	The standard deviation of the gaussian noise applied to the dataset (default is 0.0, meaning that there is no noise applied to the dataset).
value_range_dict	Set the range of values of different features based on features id (default value is {“0”: [1,2]}, meaning that feature with id 0 has the range between 1 and 2. Users can set up multiple feature ids if needed.

Configuration of Time series data


number_samples	Number of samples (records) in dummy dataset (default=100)
number_numerical_features	Number of numerical features in dummy dataset (default=25)
number_categorical_features	Number of categorical features in dummy dataset (default=2)
number_text_feature	Number of text features in dummy dataset (default is 2)
missing_proportion	Create records with missing values based on a predefined proportion. (Default is 0.1 meaning that if there are 100 records in the dataset, and the missing proportion is 0.1, the number of records with missing value will be 100*0.1=10.)
number_informative	Number of informative features. (Default is 20, i.e out of 25 total number features, only 20 features are useful of the model, and the other 5 are uninformative or redundant.)
number_target	Number of target features
bias	The bias term (or offset/y-intercept) in the underlying linear model (default is 0.0, meaning that no bias term is added up).
noise	The standard deviation of the gaussian noise applied to the dataset (default is 0.0, meaning that there is no noise applied to the dataset).
value_range_dict	Set the range of values of different features based on features id (default value is {“0”: [1,2]}, meaning that feature with id 0 has the range between 1 and 2. Users can set up multiple feature ids if needed.
Univariate	Set to True if the user wants to generate univariate time series data. Otherwise, set to False to generate multivariate time series data.
Start Time	Left bound for generating dates.
End Time	Right bound for generating dates.