Questions tagged [data-cleaning]

Data cleaning is a preliminary step to statistical analysis in which the data-set is edited to correct errors and to put it into a form suitable for processing by statistical software.

Data cleaning is a preliminary step to statistical analysis in which the data-set is edited to correct errors and to put it into a form suitable for processing by statistical software. Exploratory data analysis techniques are often used to identify problems.

762 questions
6
votes
7 answers

Good practices for manual modifications of data

More often than not, data I am working with is not 100% clean. Even if it is reasonably clean, still there are portions that need to be fixed. When a fraction of data needs it, I write a script and incorporate it in data processing. But what to do…
Piotr Migdal
  • 756
  • 5
  • 15
3
votes
4 answers

Is it good practice to convert columns with a number to a range between 0 and 1?

Relatively new to data science. I heard something about converting columns which contain integers into a range between 0 and 1. I think the reasoning was that so all the columns will be more similar in their range. I think along with that there…
3
votes
1 answer

How do I remove outliers from my data? Should I use RobustScaler? I am aware I can use DecisionTree but I want to use XGBoost

How do I remove outliers from my data? Should I use RobustScaler? I am aware I can use DecisionTree but I want to use XGBoost... Please can you help me, This is a bit urgent, I am not sure how to do it, I have researched and seen previous question…
omkaartg
  • 155
  • 8
3
votes
1 answer

Simple Excel Question: VLookup Error

My data looks like this: Why is this error showing up?
Minu
  • 805
  • 2
  • 9
  • 18
2
votes
1 answer

Remove indiferent respondents in survey data

I have data for a product rating survey, which requires the respondents to rate a product in five levels: Very Bad, Bad, Regular, Good and Very Good. This survey was applied to several communities of clients. After taking a look on the data, I…
2
votes
0 answers

What tools are available for semi-automated matching of dirty columnar data

Are there any automated or semi-automated tools for finding matching "similar" or data in two columnar data sets? The data I'm working with was collected (and handled) by different organizations. Some rows describe the same events and even carry…
D. Woods
  • 121
  • 4
1
vote
1 answer

Handling Missing Inconsistent Educational Data

I'm an educational researcher learning about machine learning so I can further explore my data beyond the usual statistics. I currently have some assessment data but I am not sure how to appropriately handle features with 'no data'. For example, a…
VBNub
  • 11
  • 1
1
vote
0 answers

Is there a pandoc for data manipulation?

You know how pandocs converts markdown to HTML and pdf etc. Is there a pandocs for data manipulation? Like from SQL to pandas and SQL to dplyr etc?
xiaodai
  • 630
  • 1
  • 5
  • 13
1
vote
0 answers

Term for an identifier that has been superseded

Is there a 'proper' term for an ID (or IDs) that have been superseded by (or merged into) another ID? My use case: rsIDs are used by geneticists to refer to a SNP. These take the form of a string 'rs#####' Over time, some of these rsIDs are merged…
pufferfish
  • 141
  • 4
1
vote
1 answer

Normalization and Outlier on Target variable which is continuous

I have doubt that should I perform outlier analysis and normalization even on target variable which is continuous?
Navneeth
  • 11
  • 1
  • 2
1
vote
0 answers

Extracting the time duration for events from an event log

I have an event file (very similar to a log file) that logs event information that includes user data and time stamps. The events have an ordering for each user. So event E1 occurs before E2 that occurs before E3 and so on. For each user, I need to…
1
vote
1 answer

Symlink or Rename files with spaces?

Premise: I often get files from colleagues that I need to work on. Often times, these files have spaces in the names. Working with these files at the command line or in scripts can be tedious. Possible solutions: With the rename program (on nix…
John
  • 111
  • 4
1
vote
3 answers

When to clean data?

I am very new to data science / ML and I have what I think is a very basic question - when to 'clean' the data? Do I clean data before using it to train a classifier (a binary classifier in my experiments)? Do I clean data that I try to classify…
Jimmy Collins
  • 253
  • 2
  • 4
1
vote
5 answers

Cleaning time series data

I have a time series data about daily usage of a computer program, here is an example 2017-11-10: 0 2017-11-09: 14 2017-11-08: 0 2017-11-07: 6 2017-11-06: 102 2017-11-05: 0 2017-11-04: 0 As you can see 11-06 has a spike at 102. Due to our way of…
melih
  • 133
  • 1
  • 6
1
vote
1 answer

Describing the data cleaning process

The term "Data Cleaning" is used to describe outlier checking, date parsing, missing value imputation to structuring datasets (organizing data values within a dataset) to facilitate analysis. The latter is commonly referred to as "Data Tidying"…
grldsndrs
  • 567
  • 4
  • 11
1
2 3