I have two CSV files (each of the file size is in GBs) which I am trying to merge, but every time I do that, my computer hangs. Is there no way to merge them in chunks in pandas itself?
Asked
Active
Viewed 2.3k times
7
-
By merge, do you mean performing JOIN operations or appending one file to another? – Rohan Jul 29 '16 at 16:51
-
JOIN operation . Appending isn't that costly. – enterML Jul 29 '16 at 17:21
-
2Can you hold at least one of them in RAM? If so, you can use iterate over the second frame in chunks to do your join, and append the results to a file in a loop. – user666 Jul 29 '16 at 18:08
-
AFAIK it is not possible in Python. You could use Spark with Hive. You can load data and run SQL like queries on it. – Rohan Jul 29 '16 at 18:20
2 Answers
3
When faced with such situations (loading & appending multi-GB csv files), I found @user666's option of loading one data set (e.g. DataSet1) as a Pandas DF and appending the other (e.g. DataSet2) in chunks to the existing DF to be quite feasible.
Here is the code I implement:
import pandas as pd
amgPd = pd.DataFrame()
for chunk in pd.read_csv(path1+'DataSet1.csv', chunksize = 100000, low_memory=False):
amgPd = pd.concat([amgPd,chunk])

vsdaking
- 236
- 1
- 6
-
But pandas holds its DataFrames in memory, would you really have enough RAM for large data sets? – NoName Feb 02 '20 at 06:57