Query data dimension

Question

import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
import pandas as pd
from sklearn.linear_model import LinearRegression

df = pd.read_csv('Downloads/breast-cancer-wisconsin.data.txt',skiprows=1)
df.replace('?', -99999, inplace=True)
df.drop('id', 1, inplace=True )

X= np.array(df.drop(['class'],1))
y= np.array(df['class'])

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)

#clf = neighbors.KNeighborsClassifier()
clf = LinearRegression(normalize=True)
clf.fit(X_train, y_train)

accuracy= clf.score(X_test, y_test)
print(accuracy)

example_measures = np.array([[4,2,1,1,1,2,3,2,1],[4,2,1,2,2,2,3,2,1]])
example_measures = example_measures.reshape(1,-1)

prediction = clf.predict(example_measures)     ##(example_measures)

print(prediction)

Problem arises when I run the above command line at Ubuntu or Anaconda:

ValueError: query data dimension must match training data dimension

How to solve that problem ? I am sure that by method of isolating individual commandline-- and find it appears Error at :

prediction = clf.predict(example_measures)

I try to use :

prediction = clf.predict(X_test).

It is ok. I really want to predict the example I create. How can I change the code?

Can you please format your code correctly so that we can read it. — JahKnows, May 18 '18 at 06:17

ignoring_gravity · Answer 1 · 2018-05-18T12:38:36.853

How many columns do X_train and X_test have?

I'd imagine (though can't confirm, as I don't have access to your data) that they have fewer (or more) than 18 columns.

This is because your code

example_measures = np.array([[4,2,1,1,1,2,3,2,1],[4,2,1,2,2,2,3,2,1]])
example_measures = example_measures.reshape(1,-1)

produces an array of shape (1, 18).

EDIT: I've tried matching your dataset, and got the following:

import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_breast_cancer

X= load_breast_cancer().data
y= load_breast_cancer().target

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)

clf = LinearRegression(normalize=True)
clf.fit(X_train, y_train)

accuracy= clf.score(X_test, y_test)
print(accuracy)

If I call X.shape, I get (569, 30). So, if you want to make your own array to pass to clf, it needs to have 30 columns (one for each feature).

Query data dimension

1 Answers1