That's a very good but also very general question, because obviously a lot of the asked questions are naturally relevant with the domain of the data and the problem at hand. But I have tried and put together a general list that applies in most cases.
When I am handed a new dataset, the first thing that I do is plot the data in order to get a good sense of their basic proporties and dynamics. I usually start with some simple scatter plots, box plots or line plots (if there are time-series within the data) (for numeric data), or bar plots (for categorical data). Those basic plots help me understand the type and scale of the different variables, or even give me some first sense about relationships within the dataset.
After that, I believe the questions that I ask myself could be divided into pre-processing-related and analysis-related ones.
Pre-processing related question can be relavant with:
- Existance and handling of outliers
- Are there any/how to handle missing values
- Identification (and handling) of multicollinearity
- Distribution of different variables: many machine learning or even simple statistical modeling methods include assumptions on the data
distributions, we need to make sure we understand the data well to ensure that the data meet the necessary assumptions
- Suspicious data that do not make sense given the domain (e.g. if the dataset includes time-series with temperatures in different regions and there are instances with scale in hundreds, or if the dataset includes electricity consumption measurements and there are negative values) and ways (if any) to correct those (e.g. there might have been a mix of celcius and Kelvin measurements, there might be measurements that the meter has been reverted etc.)
Analysis related questions:
The first and most important question is to understand what is the problem we are asked to solve with the data. Is it a regression
problem, a classification problem, time-series predictions, etc., and what are the state-of-the-art tools that can be used for this purpose.
Data limitations, in relation to the state-of-the-art methods to solve the problem, considering the potential specificities of the data at hand e.g. lots of categorical variables, class imbalances, very long or very short time-series etc.
Dimensionality reduction/Feature extraction: Can the data be used to extract other, more concrete/informative features?
Feature selection: We want to identify and only use the most informative features for the analysis (can be relevant with multicollinearity mentioned above)