Using EDA

EDA stands for Exploratory Data Analysis where user can analyze and investigate data sets and summarize their main characteristics, employing data visualization methods.

About this task

Procedure

  1. After uploading the data file, you can view the dataset of top 10 records.
  2. To see the remaining details, click the EDA button.
  3. On clicking the EDA button, a window will pop up that enables configuring the features and the sample of data that needed to be selected for EDA.
  4. Select features and data subsampling size for exploratory data analysis.
  5. Finally, click Next.
    • Data Overview: The summary statistics of the features.
    • Data Distribution: It shows the data distribution type for each feature of a dataset.

      For numerical data, select the numerical feature and click on Show Data Distribution to see the distribution of data.

      For text data, select the text feature and click on the Show Data Distribution. The word cloud is generated.

      More details can be seen by clicking the ? icon on the right of the Data Distribution tab.

    • Feature Importance: The relevant features depending on the variance ratio can be viewed in the bar chart. The higher the variance ratio more relevant the feature is.
    • Correlation Analysis: The strength of relationships varying from 0 to 1 among various features. A value closer to 1 shows that the features are highly correlated and a value close to 0 shows that the features are less correlated.

      Example: The correlation value of feature row record_dateTime and feature column id is 1 which shows that these two features are highly correlated and considering both the feature is irrelevant.

    • Unsupervised Clustering: The unsupervised clustering of the data can be viewed.
    • Data Deep Dive: It offers an interface to inspect the correlation between data points for all the features of a dataset. Each item in the visualization represents the data point of an ingested dataset. Clicking on an individual item shows the key pairs that represent the features of that record whose values may be strings or numbers.
      Note: If the data ingestion is huge, Data Deep Dive may not be responsive, or data may fail to load.
    • Pair Graph: The variation of two selected features can be compared.
    • Fairness Metrics: The bias identification of data can be viewed by the fairness metrics. There are three metrics- Statistical Parity Difference, Disparate Impact and Theil Index to check for unwanted bias in the dataset. Both are measures of discrimination (i.e., deviation from fairness).
    • Statistical Parity: Measures the difference that the majority and protected classes receive a favorable outcome. This measure must be equal to 0 to be fair.
    • Disparate Impact: Compares the proportion of individuals that receive a favorable outcome for two groups, a majority group, and a minority group. This measure must be equal to 1 to be fair.

      Select the feature (Categorical) and target feature and then click Show Metrics to analyze the potential bias between the selected feature and the target variable.

    • Theil Index: Measures inequality among individuals in the group for a classification problem type. Ideally it should be zero for perfect fairness and biasness increase with increase in value greater than zero.
      Note: The acceptable threshold is highlighted in green. A class is considered biased if the metrics score is outside the threshold range.
  6. Click Download Reports to download a summarized report of the EDA. The downloaded report is of Excel format.