Merging large CSV files in pandas

Question

I have two CSV files (each of the file size is in GBs) which I am trying to merge, but every time I do that, my computer hangs. Is there no way to merge them in chunks in pandas itself?

By merge, do you mean performing JOIN operations or appending one file to another? — Rohan, Jul 29 '16 at 16:51
Can you hold at least one of them in RAM? If so, you can use iterate over the second frame in chunks to do your join, and append the results to a file in a loop. — user666, Jul 29 '16 at 18:08
AFAIK it is not possible in Python. You could use Spark with Hive. You can load data and run SQL like queries on it. — Rohan, Jul 29 '16 at 18:20

score 4 · Accepted Answer · edited Dec 13 '18 at 07:33

4

No, there is not. You will have to use an alternative tool like dask, drill, spark, or a good old fashioned relational database.

edited Dec 13 '18 at 07:33

coldspeed

103
4

answered Jul 28 '16 at 17:32

Emre

10,491
1
29
39

score 3 · Answer 2 · answered Aug 06 '17 at 09:58

3

When faced with such situations (loading & appending multi-GB csv files), I found @user666's option of loading one data set (e.g. DataSet1) as a Pandas DF and appending the other (e.g. DataSet2) in chunks to the existing DF to be quite feasible.

Here is the code I implement:

import pandas as pd

amgPd = pd.DataFrame()
for chunk in pd.read_csv(path1+'DataSet1.csv', chunksize = 100000, low_memory=False):
    amgPd = pd.concat([amgPd,chunk])

answered Aug 06 '17 at 09:58

vsdaking

236
1
6

But pandas holds its DataFrames in memory, would you really have enough RAM for large data sets? – NoName Feb 02 '20 at 06:57

Merging large CSV files in pandas

2 Answers2

Linked