5

Concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways.

With reference to the classic house price prediction use case:

House prices change over time thus the model I use today could make no sense in the future.

What is the best approach to address concept drift?

  • Do we keep updating the input replacing older house prices of yesteryear?
  • Do we add an extra feature for Date of Sale - by including a temporal aspect as a feature with larger data sets?
  • Do we eventually change model hyperparameters during training to build a model that fits better the new data?
Ben
  • 2,562
  • 3
  • 15
  • 29
  • What you are talking about is no longer cross-section regression that looks at features that determine prices at single point in time. Obviously you can build time series model that will include e.g. lag_house_price. However, to measure impact of different features on house prices over time you would have to build panel data model. You can have a look at some introductory econometric analysis resources. Btw. it is more cross-validated question. – An economist Apr 17 '18 at 15:50
  • Appreciated, but in all honesty if this is not mentioned in such a course, then this is remis. I recognize time series from stock markets and such, but did not quite equate the two. What I think you are saying is that, at the very least, we should refresh datasets. On a final note, not sure why it would not be part of Data Science. Thx. – thebluephantom Apr 17 '18 at 17:34
  • I do not think it is time series in any event. – thebluephantom Apr 17 '18 at 23:23
  • Ok, I if you talk about It strikes me that house prices changes over time. or replacing older house prices of yesteryear clearly specifies that you want to add extra dimension to the problem. However, your question is horribly unclear since you do not provide model you are talking about. Therefore, are you trying to predict house price $Y = X \beta + \varepsilon$ or your variables have not only $ i = 1, ..., n$ dimension but also $t$ and therefore you aim to build panel data model. It seems to me you cannot distinguish between cross-section and panel problem. – An economist Apr 18 '18 at 07:44
  • Don't I state that I am talking about Linear Regression? Which may have more than one feature. I think it is a valid question, but thanks anyway. It's interesting that you mentioned time series. I thought about it, we don't know when next a house will be sold. – thebluephantom Apr 18 '18 at 08:17
  • That is exactly my point saying Multiple Regression does not say if you are looking at cross section or time series it just specifies "model" you are using. I also assume you are using OLS estimation. In case of cross section you can try to model date of sale using day of the week or month dummies. – An economist Apr 18 '18 at 08:27
  • I am seeking guidance as I think it is an interesting point glossed over. age vs height is of course possible, which would imply it can be a feature in this way, so would year of sale be relevant or rather age of house? – thebluephantom Apr 18 '18 at 08:30
  • Now such question refers to variable selection and there are various ways to measure which model to choose. This should be a good start. You have to experiment with your model. Try including both, one of them and just test. – An economist Apr 18 '18 at 08:39
  • Yes I could do that, but I think there was a bigger point to my question. Anyway. – thebluephantom Apr 18 '18 at 08:43
  • What you re looking is the drift in the data. ML build model from past data to try to predict future (supervised) or describe past (unsupervised). The fact is data is evolving with time sometimes slowly sometimes more faster on every dimension combination you can imagine. So a model can works well for a some times but if the model doesn’t follow the drift when it happens it won t be able to grasp reality. To conclude I would say that there is not permanent best model, except if data follow the same pattern again and again, but different versions of one over time. – KyBe Apr 21 '18 at 06:52
  • Interesting how such a question can be put on hold or considered opinion based. I think there is some merit to the question as I would not have asked it otherwise. Why is it on hold then? How would it be needed to be rephrased. When I look at the last comment by KyBe it is actually how I perceive the answer to be, but as I am a learner and such a topic was not covered in coursera on linear regression, I posed it here. Therefore I would ask the on hold is lifted. – thebluephantom Apr 21 '18 at 09:41
  • @KyBe This was indeed one of the outcomes I felt, but then then adding year or sale or age (via when built) - albeit not the same things, I thought could be an option. In fact I am going to experiment with this now. – thebluephantom Apr 21 '18 at 09:42
  • @thebluephantom I don't understand well your last comment but i tried to rewritte your question with hope it will be reopen. But i'm curious about what you want to achieve. I will be happy to discuss with you about it in a more appropriate place. Please be welcome to Clustering4Ever Gitter place hoping we could go further into our intuitions. – KyBe Apr 21 '18 at 15:24
  • I re-framed the question as an explanation about concept drift which is a valid machine learning concept. Please check the edit and let me know your thoughts. A valid answer should contain information included on https://en.wikipedia.org/wiki/Concept_drift#Possible_remedies plus model updates common in time series analysis/time series prediction/autoregressive models and or training weight of older data points. – wacax Apr 21 '18 at 20:59
  • OK, tomorrow or Monday we can connect. @KyBe, I will explain, is that a chatroom? – thebluephantom Apr 21 '18 at 21:58
  • @wacax As I am a novice I could not have expressed that, but the edit looks good. Thanks – thebluephantom Apr 21 '18 at 22:00
  • @thebluephantom, yes gitter are chateroom on specific topics, this one is from a recent github repo we created about scalable machine learning, drift being one ML aspect I thought it can be a sweet place to exchange. You may prefer another place, please tell me. – KyBe Apr 21 '18 at 23:56
  • @KyBe, Your timezone is? I am CET and would think tomorrow evening may be a good idea. – thebluephantom Apr 22 '18 at 12:30
  • I m coming back to CET tomorrow, I presume I will be in the plane this evening CET. We may have a discussion from tomorrow to anytime. – KyBe Apr 23 '18 at 00:32
  • Tuesday is fine. – thebluephantom Apr 23 '18 at 05:44
  • @Stephen Rauch The post has been edited by someone more learned than myself. Curious as to how you folks judge a post on a topic like this to be opinon based. In the quest for knowledge there will always be the notion of intermediate opinions until a final conclusion is reached. – thebluephantom Apr 23 '18 at 05:47
  • @KyBe will have to be next week - I will look into the chat thing before hand – thebluephantom Apr 24 '18 at 09:34
  • @KyBe: Not sure why after the edit the question was closed. I am preparing some some data and stuff in Grpahlab to discuss next week in chat. – thebluephantom Apr 26 '18 at 14:10

1 Answers1

0

It's not really possible to adress concept drift in general. But I can bring two similar answers for drift of houses prices :

  • As other prices the drift is usually well measured and studied. As one would correct price for inflation, one can correct past house prices with a housing index (typically this index for the US). It will help your model having prices that are comparable over years.

  • Another way to tackle drift is to consider a ratio with a relevant variable that has a similar drift. For housing price, that might be median income of the neighborhood. This will give you a variable that is less sensitive to the overall drift.

As you can see those two methods are pretty much equivalent here in practice, as it mainly consist in correcting features and eventually, targets. The main difference is that in the first case you talk about dollars directly which is often more business oriented. Application of those methods can get a bit difficult if you try to use your model to predict the future and need to project housing index or median wage.

Lucas Morin
  • 2,196
  • 5
  • 21
  • 42