Split on dataset with some shared features?

Question

I have a dataset with financial stock data, some of the features are shared, for example daily gold prices, while the stock price for each individual stock is different, the gold price would be the same for everybody that day.

When I split 80/10/10 randomly, it's "cheating" and while the result accuracy is great the actual real world live result is bad.

When I split sequentially, ie first 8 years of data in training, next year in validation, last year in testing. The result accuracy is bad, and live testing is also bad.

What I want to ask is, should I do random split between just training and validation on first 9 years of data, then do testing on last year of data separately?

OR is sequentially as good as it's gonna get and I simply can't predict the future?

Welcome to SE:AI! I think this has gotten close voted per the solo "datasets" tag. My sense is this is, in fact, a question about AI theory, and the better method of splitting such data? Is this correct? Are there other tag you could potentially apply? — DukeZhou, Jun 03 '21 at 23:27
It is about theory and I am putting it in practice, at the end, the question boil down to: can training set and validation set share information? OR is that consider cheating? It is well known in our industry already that sharing information between training and testing is a big no no, but what about training and validation? I am not sure what other tag to put this under. — Bill Software Engineer, Jun 04 '21 at 16:04
My advice is to edit your question to reinforce the on-topic nature and add a couple more tags. (We get a lot of pure data set questions, not related to theory, which are off topic.) — DukeZhou, Jun 04 '21 at 22:46

Split on dataset with some shared features?

0 Answers0