6

I'm curious as to see what questions other people ask themselves when faced with new data. What are some common questions you ask yourself when trying to understand the data or performing EDA? What labels do you deem necessary? Is there a 'correct' response? How would you differentiate which columns are helpful or unhelpful?

Any resources/links/books would be much appreciated!

zampoan
  • 85
  • 1
  • 5

3 Answers3

4

Good question - everyone talks about understanding your data, but no one actually explains how. Having the ethos of: "How is the real world encoded in my data"? will serve you well here.

If I have a new dataset that I need to deeply understand, I normally take the following approach (have never formalised before):

  • Split the analysis into 3 sections:
    1. Target analysis
    2. Univariate feature analysis
    3. Multivariate feature analysis
  • For each section:
    1. Domain: What data source that feature has come from (e.g. what sql tables)? What does it mean in the real world?
    2. Basic stats: What type/scale of data is it? (ordinal, continuous, categorical, cyclical) How is your information stored? (object, categorical, datetime, float, int) Min/Max/skew/kurtosis/stationary?
    3. Distribution: Plot the distribution(s). Use the simplist plot possible to understand the unique characteristics it has. How does it relate to the target? How does it change over time?

There is a much bigger list of what to look for, but following these steps should help. Put your modelling hat on when doing this, if you see something interesting that could be leveraged by your model, note it down. For example, if you see a interesting relationship between two features, think about what it means to combine the two, this will help feature engineering down the line.

GooJ
  • 435
  • 2
  • 11
4

That's a very good but also very general question, because obviously a lot of the asked questions are naturally relevant with the domain of the data and the problem at hand. But I have tried and put together a general list that applies in most cases.

When I am handed a new dataset, the first thing that I do is plot the data in order to get a good sense of their basic proporties and dynamics. I usually start with some simple scatter plots, box plots or line plots (if there are time-series within the data) (for numeric data), or bar plots (for categorical data). Those basic plots help me understand the type and scale of the different variables, or even give me some first sense about relationships within the dataset.

After that, I believe the questions that I ask myself could be divided into pre-processing-related and analysis-related ones.

Pre-processing related question can be relavant with:

  • Existance and handling of outliers
  • Are there any/how to handle missing values
  • Identification (and handling) of multicollinearity
  • Distribution of different variables: many machine learning or even simple statistical modeling methods include assumptions on the data distributions, we need to make sure we understand the data well to ensure that the data meet the necessary assumptions
  • Suspicious data that do not make sense given the domain (e.g. if the dataset includes time-series with temperatures in different regions and there are instances with scale in hundreds, or if the dataset includes electricity consumption measurements and there are negative values) and ways (if any) to correct those (e.g. there might have been a mix of celcius and Kelvin measurements, there might be measurements that the meter has been reverted etc.)

Analysis related questions:

  • The first and most important question is to understand what is the problem we are asked to solve with the data. Is it a regression problem, a classification problem, time-series predictions, etc., and what are the state-of-the-art tools that can be used for this purpose.

  • Data limitations, in relation to the state-of-the-art methods to solve the problem, considering the potential specificities of the data at hand e.g. lots of categorical variables, class imbalances, very long or very short time-series etc.

  • Dimensionality reduction/Feature extraction: Can the data be used to extract other, more concrete/informative features?

  • Feature selection: We want to identify and only use the most informative features for the analysis (can be relevant with multicollinearity mentioned above)

missrg
  • 578
  • 2
  • 12
2

There is a lot to explore when you see a new dataset.

  • Check if there are missing values. If yes then think of ways to deal with them.

  • Which is the target variable (the one you want to predict or classify)?

  • Is the dataset imbalanced?

  • For continuous variables plot a histogram to see the distribution of values and look for:
    Outliers, Median, Min, Max, Mean, Skewness etc.

  • For categorical variables, bar chart or pie chart can be used.

  • Try to group data according to different categories and try different plots to see new insights from data.

  • You can use correlation plot with heatmap to see correlation of all features with the target variable.

  • Also think creatively to make new features from existing ones.

missrg
  • 578
  • 2
  • 12