1

I have about 120 users with a total of 4500 data points. The minimum user has about 5 data points and the maximum has about 100 data points. I would like to build a model that will make predictions for each user.

What is the optimal approach? Do I create a single model for each user or do I create a single model with a categorical variable to specify the user?

I would imagine the single model approach would leverage the correlation between users, but the model per user approach might suffer from not enough data to generalize well.

I would consider a pooling method described here: possible duplicate, but there is not enough information in the independent variables to distinguish the users from one another, which is why I would have to create a categorical variable to distinguish users. Meaning, there are many users with the same input variables, but systematically different outputs.

The input variables include: arrival time, day of week, and temperature. The output variables include departure time and miles charged.

Leonard Strnad
  • 206
  • 2
  • 5
  • Have you considered making a joint model (i.e. one that applies to all users), or does it have to be personalized (i.e. separate output for each user)? – Djib2011 Aug 28 '18 at 10:23
  • Well, I think there is useful information in the user name as a variable. Otherwise, there will be multiple users with the same arrival time, however, they will have systematically different departure times, which is something I am trying to predict. – Leonard Strnad Aug 29 '18 at 12:57
  • 1
    If you want to extract some information from the name (e.g. sex, nationality), I'd suggest doing it at a pre-processing step (e.g. through some rule or regular expressions) and storing that information in a separate variable. Otherwise, if you think your model is sophisticated enough to extract any useful information by itself you can pass the name as a variable. This seems to me like a typical structured ML problem. I don't see why you need every person to have its own variable. – Djib2011 Aug 29 '18 at 17:55
  • Is there any other data you can use specify user? I'm saying this because this variation of user data would somewhat account for why distribution is conditional on user. – Daniel Aug 29 '18 at 20:40

1 Answers1

0

One option is to train the neural network with all the data. Then take that global model and fine-tune separate models for each individual customer.

This minimizes the cold-start problem for new customers while creating custom predictions for the unique properties of each customer.

Brian Spiering
  • 21,136
  • 2
  • 26
  • 109