Regression model for a count proces

Question

In R I have data where head(data) gives

day   count   promotion
1        33        20.8
2        23        17.1  
3        19         1.6  
4        37        20.8

Now day is simply the day (and is in order). promotion is the promotion-value for the day. It is simply the number of times an advertisement has been on television. count is the number of new users we got that day.

I want to investigate the impact the promotion-value has on new users (count). Since we have a count process I thought it would be best to make a poisson regression model.

model=glm(formula= data$count ~ data$promotion, data=data)

When we type summary(model) we get

Coefficients:
           (Intercept)              good_users$promotion  
              13.40216                   0.24342

Degrees of Freedom: 793 Total (i.e. Null);  792 Residual
Null Deviance:      9484 
Residual Deviance: 9325     AIC: 12680

Here is a plot of the data.

But when I plot the fitted values for the model

points(model$promotion, model$fitted, col="blue")

we get this

Here is another plot that shows the same but where days with 0 promotion are removed.

How should I chose my regression model (should I use lm instead of glm) or is the another better approach to solve this? Because the data is not pretty but more random like this what should one do ?

Updated

Finding the sweet spot

I have done the following for finding a sweet spot. I divide data into 10 groups. group1 is simply a subset where the promotion-value is within 1:10. group2 is data where the promotion-value is between 11:20, and so on for the other groups. So in R we have

group1 <- subset(data, data$promotion %in% 1:10)
group2 <- subset(data, data$promotion %in% 11:20)
group3 <- subset(data, data$promotion %in% 21:30)
...
group10 <- subset(data, data$promotion %in% 91:100)

Now I can use wilcox.test to test if there is a significantly difference between the groups by typing

wilcox.test(group2, group1, alternative="greater")

which gives a low p-value, ie group2 has significant higher new_good_users than group1. The same goes for

wilcox.test(group3, group2, alternative="greater")

but for wilcox.test(group4, group3, alternative="greater") I get a p-value at 0.20, ie there is no significant difference in new_good_users between group4 and group3. And the same goes for the rest of the group-pairs up to 10.

So this must mean that if we increase promotion in the first groups we have an increase in new_good_users but in the last groups we do not have that increase. This means that we have a sweet spot at group3 where the promotion-value is 21:30. Is this not correct ?

Your data is not at all Poisson distributed so that this is not giving a good fit makes sense, I do not know a better approach however — Jan van der Vegt, Jun 02 '16 at 11:28
It seems like possibly the design of the analysis is out of whack in some sense. For a start, have you plotted the new users against either the days or against the cumulative number of promotions? I also note that apparently you expect a non-linear relationship (or at least a relationship that implies a square term - you want to find the 'sweet spot') but you don't appear to have attempted to fit a model that will account for that effect. — Robert de Graaf, Jun 02 '16 at 11:39
Yes I have plotted new users against promotion here: http://datascience.stackexchange.com/questions/11915/chose-the-right-regression-analysis/11926?noredirect=1#comment12079_11926 . So by square you mean I should make a regression model where I square the independent variable ? — Ole Petersen, Jun 02 '16 at 12:04
Having a quick look at the other question I note that your plots suggest 'new users' has a relationship with promotion, but 'new good users' does not . Yes, squaring the promotion term is broadly what I meant, as a quadratic has a maximum, which you say is what you are trying to find, but again while the 'new users' vs 'promotion' plot looks like a maximum could plausibly found in those data, the plot of 'new good users' doesn't look all that likely to yield such a relationship. I say 'broadly what I meant' because it may really be non-linear, as said previously in the other question by XR SC. — Robert de Graaf, Jun 03 '16 at 01:36
Your plot is wrong. plotting point(model$fitted) will plot the points with 1:N on the X-axis, yet the fitted values are those for the corresponding values of promotion in the data. Try plotting points(data$promotion, m$fitted). it should look a bit better, but there's still no obvious linear trend in your data... — Spacedman, Jun 04 '16 at 12:18
Also, your model summary doesn't come from the glm call you gave, because the text isn't right. If you can make a reproducible example we can probably help you. — Spacedman, Jun 04 '16 at 12:20
Your attempt at grouping the data for the wilcox test doesnt work with the data frame you gave. data$promotion %in% 1:10 matches where promotion is one of the integer values from 1 to 10, not any value between 1 and 10. So when I try this with the data you say you have, I get empty data frames and wilcox.test fails with an error message. So we have no idea what you have done and we can't help you and the business with the wilcox test is so completely different to your original post that you should probably make a new question and it should probably be on the statistics stackexchange site. — Spacedman, Jun 18 '16 at 08:41
I'm voting to close this question as off-topic because this is a statistics question and needs migrating to the stats site — Spacedman, Jun 18 '16 at 08:42

score 6 · Accepted Answer · answered Jun 09 '16 at 17:06

I have to quote Tukey, perhaps the grandfather of data science:

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

I see nothing wrong with your Poisson model. In fact its a pretty good fit to the data. The data is noisy. There is nothing you can do about it. Perhaps the noise if due to whatever else is on TV at the time, or the weather, or the phase of the moon. Whatever it is, its not in your data.

If you reasonably think the weather might be affecting your data, get the weather data and add it. If it decreases the log-likelihood enough for each degree of freedom then it's doing a good job and you leave it in. This is regression modelling 101.

Of course there's a zillion other things you can do. Scale the data by any old transformation you want. Fit a quadratic. A quartic. A quintic. A spline. You could include the date and possible temporal correlation effects. But always bear in mind what Tukey was saying - if your data is noisy, you won't get anything much out of it. So it goes.

I had though about the weather as well. One idea I had was to subset the data to group, for example 2 groups where one group 'low' has promotion-value less than 50 and one group 'high' has promotion-value higher than 50. Then I would test if there is a significant difference from the two groups. And there actually is (I used Wilcox). Now one could simply make 10 new groups the same way and see if there are groups where there is no significant difference. This means we have found the sweet spot. Do you think this approach is useful? Thanks. — Ole Petersen, Jun 10 '16 at 07:18
The data is noisy, but the lower bound to the data shows some trend. I would split the data into quartiles per day and look at the trends per quartile : whats the trend in minimum expected pay off per promotional spend, median, maximum. Then try to identify explanatory differences between these groupings. — AnserGIS, Jun 13 '16 at 12:17

Regression model for a count proces

1 Answers1