How to use train_test_split with existing dataset?

Question

I am looking for an example of how to use train_test_split with an existing dataset. I have a CSV that can be bought into a dataset with:

data = pd.read_csv('c:\MyData.csv')

My aim is to use this data with a One-Class SVM. When I look at examples of using train_test_split though they all seem to want to generate a random dataset and then use that. This is usually done with the X, y = followed by the parameters you want to give the data.

Looking at the sklearn webpage you then use:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

This would give you a test size of 33%. How do you specify that the X, y = is data or am I barking up completely the wrong tree and not thinking about this correctly?

Any help as always is greatly appreciated.

score 1 · Answer 1 · answered Oct 18 '21 at 18:19

You would first need to split your data variable into an X variable containing the features you want to use and a y variable that contains the value you are trying to predict. After this you can use train_test_split to split the data into a training and test dataset. Combining these two steps would look something like this:

import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv('C:\MyData.csv')
X, y = data[["feature_1", "feature_2"]], data["target_column"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

How to use train_test_split with existing dataset?

1 Answers1