3

I've a huge text file, hence I'm reading it line-by-line, applying some basic cleaning, and separately writing the X & Y to 2 different csv files. Further I'm preparing 3 directories for each csv - train, val & test and writing each line as a separate csv to appropriate directories - This aids in using the fit_generator() method conveniently, by reading these files 1-at-a-time and train the model.

The concern is, before training, I've pre-processing steps and performing those on these many files, 1 file at a time, doesn't seem to be a practical approach(it won't be time-efficient as the operations wouldn't be vectorized, besides there would be lot of read/write on disk since storing each processed file is also inevitable), are there any other approaches in dealing with such scenarios? What are the best practices? Are custom generator functions the only way? Appreciate any help.

Update: Also, What if my processed data-set is a coo-matrix? Is there a viable way other than converting it to dense before writing? Moreover, my concern is neither about optimal resource utilization nor about time efficiency, it is more about what are the different ways of handling such scenarios, an example might help.

3 Answers3

4
  • Could you reduce data types, for example, int32 to int16, but you should be careful, you have to be sure that you don't lose important a piece of information by reducing of memory.
  • Iteratively read CSV and dump lines into the SQLite table. Working on database is faster than on CSV file.
  • Use the library for parallel computing in Python like Dask or Pandarallel.
fuwiak
  • 1,373
  • 8
  • 13
  • 26
2
  • Try to use PySpark and its optimizations for huge amount of data (paralellization, batch reading, etc.)
  • Do your pre-processing step by step and save the result of each step in a file/table.
  • Do your pre-processing in one DataFrame and at the end just write it in your files, just truncating it in order to get X, Y and X_test set.
Catalina Chircu
  • 346
  • 1
  • 3
  • 11
1

One option is to move to a cloud computing service and rent a larger, faster computer that is not memory constrained.

Brian Spiering
  • 21,136
  • 2
  • 26
  • 109