Object detection/recognition, pre-processing error

Question

My Imports:

# Importing modules 
import numpy as np 
import pandas as pd 
import os
import matplotlib.pyplot as plt
import cv2
from keras.utils import to_categorical
from keras.layers import Dense,Conv2D,Flatten,MaxPool2D,Dropout
from keras.models import Sequential
from sklearn.model_selection import train_test_split

My Code:

np.random.seed(1)
train_images = []
train_labels = []
shape = (108,108)
label_path = 'Beer/ModelSet/'
train_labels.append('miller_lite')
train_labels.append('stella_artois')
train_labels.append('michelob_ultra')
train_labels.append('belgian_blue')
for folder in os.listdir(label_path):
    for files in os.listdir(label_path+folder):
        img = cv2.imread(os.path.join(label_path,folder,files))
        train_images.append(img)
train_labels = np.asarray(pd.get_dummies(train_labels).values)
train_images = np.asarray(train_images)
x_train, x_val, y_train, y_val = train_test_split(train_images, train_labels, random_state=1)

The part that fails

    x_train, x_val, y_train, y_val = train_test_split(train_images, train_labels, 
random_state=1)

The reason it fails

ValueError: Found input variables with inconsistent numbers of samples: [20000, 4]

My Error:

2020-08-03 23:47:11.117431: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "/Attempt-2.py", line 40, in <module>
    x_train, x_val, y_train, y_val = train_test_split(train_images, train_labels, random_state=1)
  File "python3.8/site-packages/sklearn/model_selection/_split.py", line 2127, in train_test_split
    arrays = indexable(*arrays)
  File "python3.8/site-packages/sklearn/utils/validation.py", line 293, in indexable
    check_consistent_length(*result)
  File "python3.8/site-packages/sklearn/utils/validation.py", line 256, in check_consistent_length
    raise ValueError("Found input variables with inconsistent numbers of"
ValueError: Found input variables with inconsistent numbers of samples: [20000, 4]

The Goal:

Create a classification model that distinguishes between 4 different brands of beer bottles. MillerLite,StellaArtois,MichelobUltra, and BelgianBlue.

Background Note:

I have never had any practical/work experience with either software engineering, data science, software development, machine learning etc. I am just a student messing around for my own amusement/fun.

The Question:

How exactly do I fix this? I understand the problem is that matrix x is not the same size as matrix y. X is 20000 images of size 108,108 and 3 channels RGB. Y is the label matrix: [[1 0 0 0][0 1 0 0][0 0 1 0][0 0 0 1]] in order to split into train/test images the error says I need to have the same len/size array/matrix for both x and y.

You need 20K values for train_labels i.e. one for each data point (image). Then create the dummies and do the split — 10xAI, Aug 04 '20 at 07:51

score 0 · Answer 1 · answered Aug 04 '20 at 05:26

You have to create labels for each of the images and then split it into train and test. I believe you have 20,000 images - so you have to also have 1 label for each image not jus the 4 categories alone in an array. One of the most important steps in training a DNN model to do image classification (as is your case) or any image related task is creating labels (in your case) and otherwise annotations for other problem statements. I think you have just created a array of what your labels are and havent associated them with the actual image itself.