1

I used this solution to read in all CSV files from a Google Drive into a data frame in a Colab notebook using this solution (reading csv file with specific name in Python). Each file has the same naming convention and I want to split the file name into two new columns and append those to the dataframe.

The file name are structured like this: Platform_Company.csv (example Instagram_Microsoft.csv) and I want the columns to be appended at the beginning of the dataframe.

platform company employee id employee email
Instagram Microsoft person 1 [email protected]

So far, I've used this to read in the files. I'm not sure what the layer number is or whether I need it.

from pathlib import Path
import pandas as pd

ls_data = []

csv_directory = '/content/drive/MyDrive/Colab Notebooks/'

for idx, filename in enumerate(Path(csv_directory).glob('*Instagram_*.csv')):
    df_temp = pd.read_csv(filename)
    df_temp.insert(0, 'layer_number', idx)
    ls_data.append(df_temp) 

df = pd.concat(ls_data, axis=0)

I tried incorporating the following script (Read multiple csv files and Add filename as new column in pandas), but it isn't working and I'm not sure how to add it into the current version.

import glob
import os
import pandas as pd

path = r'\OUTPUT'
all_files = glob.glob(os.path.join(path, "*.csv"))     

df_from_each_file = (pd.read_csv(f, delimiter='|') for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
concatenated_df['filename'] =(all_files[f] for f in all_files)

Thanks for any guidance and/or suggestions!

1 Answers1

0

You can use (Platform, Company) as the key of a dict then use pd.concat to get the expected output:

import pandas as pd
import pathlib

csv_directory = pathlib.Path('/content/drive/MyDrive/Colab Notebooks/')

data = {}
for filename in csv_directory.glob('*Instagram_*.csv'):
    df = pd.read_csv(filename)
    platform, company = filename.stem.split('_')
    data[platform, company] = df

df = (pd.concat(data, axis=0).droplevel(-1)
        .rename_axis(['platform', 'company']).reset_index())

Output:

>>> df
    platform    company employee id             employee email
0  Instagram  Microsoft    person 1  [email protected]
1  Instagram   Facebook    person 2   [email protected]
2  Instagram   Facebook    person 3   [email protected]

Input files

# Instagram_Microsoft.csv
employee id,employee email
person 1,[email protected]

# Instagram_Facebook.csv
employee id,employee email
person 2,[email protected]
person 3,[email protected]
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I want to try just grabbing the text before the underscore and tried removing company, but keep getting errors. It doesn't seem to find all of the files. How can I return just the part of the filename before the underscore to the list? I tried this - s.str.rsplit("/", n=1, expand=True) - but got this error AttributeError: 'PosixPath' object has no attribute 'split' –  Mar 18 '23 at 02:53
  • I also tried modifying it like this: https://pastebin.com/QeLGiyV3 and got this error: 'PosixPath' object has no attribute 'split' –  Mar 18 '23 at 03:17