Approaches to pre-processing the huge but organised text data, with & without the generators

Question

I've a huge text file, hence I'm reading it line-by-line, applying some basic cleaning, and separately writing the X & Y to 2 different csv files. Further I'm preparing 3 directories for each csv - train, val & test and writing each line as a separate csv to appropriate directories - This aids in using the fit_generator() method conveniently, by reading these files 1-at-a-time and train the model.

The concern is, before training, I've pre-processing steps and performing those on these many files, 1 file at a time, doesn't seem to be a practical approach(it won't be time-efficient as the operations wouldn't be vectorized, besides there would be lot of read/write on disk since storing each processed file is also inevitable), are there any other approaches in dealing with such scenarios? What are the best practices? Are custom generator functions the only way? Appreciate any help.

Update: Also, What if my processed data-set is a coo-matrix? Is there a viable way other than converting it to dense before writing? Moreover, my concern is neither about optimal resource utilization nor about time efficiency, it is more about what are the different ways of handling such scenarios, an example might help.

How big is "huge" in this context? Both number of lines and typical line size, please? — Dr Xorile, Mar 26 '20 at 23:28
Few million records as of now but it would increase a lot very soon, also the line length is around ~8k characters but this won't be a issue. — Keyshov Borate, Mar 27 '20 at 18:11
Can you define "a lot" as a percentage? Can you give us an upper bound? — Dr Xorile, Mar 27 '20 at 18:58
Did you try to use PySpark? Spark is a good solution to deal with huge amount of data. — Catalina Chircu, Mar 27 '20 at 20:56

fuwiak · Answer 1 · 2020-03-27T14:02:45.223

4

Could you reduce data types, for example, int32 to int16, but you should be careful, you have to be sure that you don't lose important a piece of information by reducing of memory.
Iteratively read CSV and dump lines into the SQLite table. Working on database is faster than on CSV file.
Use the library for parallel computing in Python like Dask or Pandarallel.

edited Mar 27 '20 at 14:02

answered Mar 27 '20 at 12:21

fuwiak

1,373
8
13
26

score 2 · Answer 2 · answered Mar 27 '20 at 21:05

Try to use PySpark and its optimizations for huge amount of data (paralellization, batch reading, etc.)
Do your pre-processing step by step and save the result of each step in a file/table.
Do your pre-processing in one DataFrame and at the end just write it in your files, just truncating it in order to get X, Y and X_test set.

score 1 · Answer 3 · answered Mar 27 '20 at 19:49

1

One option is to move to a cloud computing service and rent a larger, faster computer that is not memory constrained.

answered Mar 27 '20 at 19:49

Brian Spiering

21,136
2
26
109

Approaches to pre-processing the huge but organised text data, with & without the generators

3 Answers3