Do we use validation and test sets for training a reinforcement learning agent?

Question

I am pretty new to reinforcement learning and was working with some code for the PPO and DQN algorithms. After looking at the code, I noticed that the authors did not include any code to setup a validation or testing dataloader. In most other machine learning training loops, we generally include a validation and testing dataset to assure that the model is not overfitting the training data. However, in reinforcement learning the data is all simulated from the same environment, so perhaps the overfitting issue is not such a big deal?

Anyhow, could someone please indicate whether it is standard practice to only use a training dataset or dataloader for reinforcement learning, and to ignore validation or testing datasets?

no. you don't use data in the traditional sense in reinforcement learning. I would imagine that the data loader you saw in the code would be to load in a) the current data trajectory for PPO b) the experience replay data, sampled from a buffer, in DQN. There is no validation/testing data used for these updates. Testing would typically be done by evaluating a trained agent on it's intended task. This could be the original task, e.g. the Atari game it was trained to play, or it could be a similar task to the train task if you're doing e.g. meta-learning. — David, Nov 22 '21 at 16:53
@DavidIreland Yeah, that make sense. That is what I was thinking. We don't have to worry about overfitting the data as in other standard ML tasks. In this case, the training process really just has to avoid instabilities in the prediction or divergence--more akin to numerically solving a PDE or ODE. So that makes sense. Thanks for the explanation and for validating some of intuition here. — krishnab, Nov 22 '21 at 16:58

David · Accepted Answer · 2023-12-07T15:34:45.357

No, we typically don't use a validation/test data set in Reinforcement Learning (RL). This is because of how we use the data in RL. The use of a data set is very different to the classic supervised/unsupervised paradigms. Some RL algorithms don't even have a data-set as such. For instance, the vanilla tabular Q-learning does not use a data-set -- it will see an experience tuple $(s, a, r, s')$ and make an update based on this, and discard it, until it is potentially see again during training.

I have not looked at the code you have looked at for PPO and DQN but I would wager that the data loader they use is for a) in PPO when they are optimising the most recent trajectory, or b) use a data loader for the sampled experience from a replay buffer in DQN.

Note that the replay buffer is technically a dataset, but it is not a traditional dataset as in the other paradigms. That is essentially because

the dataset is non-stationary, experience is added as it is collected, and typically it is deleted to make room for new experience once a size limit of the buffer has been reached;
We don't necessarily use a data point in the buffer at all before it is removed -- consider a large buffer but small batch size. As a somewhat simple example, consider a replay buffer of size 10,000 and a batch size of 1, i.e. for every update we only sample 1 data point from the buffer. Assuming we sample uniformly at random as in vanilla DQN, then the probability of the point never being seen is 0.368.

To validate RL agents we typically assess how the trained agent performs on it's intended task.

There are some instances where keeping a train/test split would be useful. One such instance is in procedurally generated environments, such as that introduced by OpenAI. The idea is that, rather than play the same game over and over, in which case the agent can memorise upon experiencing enough data how to play the game, that the agent should learn more general skills that transfer to unseen instances of the same type of task. To do this, they introduce an environment that, upon each reset, procedurally generates a new instance of the environment -- an example would be if you are learning how to navigate in a maze, then upon every reset the maze changes. So, in this setup, one might store some (large number) of seeds to train on, and keep some seeds withheld to validate on, to see how the agent is really able to perform when deployed on unseen instances of the task.

In RL, there are cases where you use a "test set", for example, transfer learning (or from simulation to real). See my answers here and here. — nbro, Nov 22 '21 at 19:00
@David If I use RL as a classifier, then does it make sense to split the dataset to train and test sets? — Zahra, Aug 09 '22 at 09:48

Do we use validation and test sets for training a reinforcement learning agent?

1 Answers1

Linked