Analyzing survey data for predictions

Question

I've got survey data that resembles:

|-------------| Q1a | Q1b | Q1c | Q2a | Q2b | Q2c | Classification
| Respondent  | 1   | 0   | 0   | 1   | 0   | 0   | Red
| Respondent  | 0   | 0   | 1   | 1   | 0   | 0   | Green
| Respondent  | 0   | 1   | 0   | 0   | 0   | 1   | Yellow

I am trying to predict the classification for new respondents. Currently I'm using a Naive Bayes, and getting pretty bad accuracy (~20%). I don't have much training data, and the training data is hand scraped from non-standard sources (internal company procedures are a mess here).

I'm looking for other ways to predict the classification.

I'm thinking about assigning weights to each question, and magically predicting the result based on those, somehow. Although I don't really know where to start learning about how to do that, and whether it's appropriate for this data. I have very little background in this :(

Any ideas or tips on predicting the classification column with no training data?

score 2 · Accepted Answer · answered May 01 '15 at 20:57

2

Can you give a bit more information on the size of the data you're training on (and if it's really 6 parameters you're basing the predictions off of)? If it's really 6 questions with binary answers (1, 0 as you suggest), then there are 2^6 (i.e. 64) unique answer combinations, and to determine a probability for them you'll want a multiple entries per combination. Standard error scales like 1/sqrt(n) so for 10% accuracy you'll need roughly 6,400 inputs which given your description, sounds like more data than you may have. You may want to invest time into automating data collection.

If on the other hand, you have a reasonably large data set and are hoping for some alternative models, both boosted decision trees and random forest models sound like good candidates for this problem.

answered May 01 '15 at 20:57

j.a.gartner

1,215
1
9
18

My survey has roughly 25 questions, but for any particular classification I only look at two or three specific questions that are designed to indicate a classification. Each of those questions are multiple choice. They could be either select-one or select-any. So I indicate a selected answer with 1 and a non-selected answer with 0. This approach may be incorrect from the get go. I certainly don't have as much data as you indicate. – Ryan Day May 04 '15 at 00:11
Also, I'm trying to avoid hard coded rules (If question 1 == choice A), because we tend to re-write the questions from time to time, and even change a select-one to a select-any. We add questions and answer choices, and drop questions and answer choices. So I think that any hardcoded rules will change somewhat often. Thanks for the std err calc, more information, and other options as well! – Ryan Day May 04 '15 at 00:14
Without knowing the survey content it's hard to say if the 1-0 scale is appropriate. You could consider binary classification, i.e. a->1, b->2, c->4, then take their sum, and feed it to a random forest regression model. – j.a.gartner May 04 '15 at 04:36

Analyzing survey data for predictions

1 Answers1