0

Let's say I wan't to predict the lifespan of an ad in a listing.

I know a bunch of thing from the ad like:

  • the title
  • the price
  • the location
  • etc

The target value is the duration of the ad in the listing before it's being removed (item has been sold).

What would be the best approach for engineering the target?

I've tried categorizing the log of the duration, but it's not leveraging the cyclic pattern you can see in the histogram of the lifespan (in hours):

Lifespan

x-axis : lifespan in hours

  • What's your x-axis? The peaks and valleys could be a sign of a lurking variable, and I wonder if there is a way your model could take advantage of that (i.e., hour of the day, etc.?). – The Lyrist Jul 19 '18 at 16:16
  • edited: "x-axis : lifespan in hours" I have the information of the hour of the day when the ad was removed. What's a lurking variable? – Benjamin Toueg Jul 19 '18 at 17:00
  • It was wondering if there are some other potential features impacting your prediction result that is not immediately evident. For instance, with a quick glance it seems the peaks are the 1 + n x 24 hr and the vallays are are 13 + n x 24 hr. Potentially n and mod 24 of x could help explain the frequency, etc. – The Lyrist Jul 19 '18 at 17:24
  • re: lurking variable: https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/regression-models/what-is-a-lurking-variable/ – The Lyrist Jul 19 '18 at 17:25

2 Answers2

2

I think you need to come up with a way to treat the data such that you're thinking in days, not hours, right? The peaks look they're just at 24, 48, 72, 96, (1 day, 2 days, 3 days, 4 days) etc, and are pretty much normally distributed around those peaks.

I think a good test might be to try a categorical approach to start, to see how well you can predict which 'peak' the ad belongs to (is this an ad in the 24-hour normal distribution? the 48-hour?). If you can figure out which peak that ad belongs to then you might be able to identify features that indicate whether it's more likely to be on the short- or long-side of the hump. If you have bad results putting the ads into categories that might tell you something too.

If you do try this, be careful of how you measure performance as your dataset will be unbalanced by the higher occurrence rate of quickly-pulled ads.

Matthew
  • 1,284
  • 7
  • 12
1

If I understand your chart correctly, the y-axis is the # of ads and x-axis is the duration of the ads.

What sort of products are these, who and where are the buyers from, are the purchases time-bound and when can the ads be posted? These might explain the cyclical patterns.

Srikrishna
  • 146
  • 2
  • The y-axis is the # of ads and the y-axis is the duration of the ads. In my business, "ad" refers to a job or a task made available on a mobile app (mobile crowd-tasking). Tasks consists of store checks for retailer clients. As a consequence, those tasks can only be accomplished during opening-hours, which explains the cyclical patterns. – Benjamin Toueg Jul 19 '18 at 18:23