No, we typically don't use a validation/test data set in Reinforcement Learning (RL). This is because of how we use the data in RL. The use of a data set is very different to the classic supervised/unsupervised paradigms. Some RL algorithms don't even have a data-set as such. For instance, the vanilla tabular Q-learning does not use a data-set -- it will see an experience tuple $(s, a, r, s')$ and make an update based on this, and discard it, until it is potentially see again during training.
I have not looked at the code you have looked at for PPO and DQN but I would wager that the data loader they use is for a) in PPO when they are optimising the most recent trajectory, or b) use a data loader for the sampled experience from a replay buffer in DQN.
Note that the replay buffer is technically a dataset, but it is not a traditional dataset as in the other paradigms. That is essentially because
- the dataset is non-stationary, experience is added as it is collected, and typically it is deleted to make room for new experience once a size limit of the buffer has been reached;
- We don't necessarily use a data point in the buffer at all before it is removed -- consider a large buffer but small batch size. As a somewhat simple example, consider a replay buffer of size 10,000 and a batch size of 1, i.e. for every update we only sample 1 data point from the buffer. Assuming we sample uniformly at random as in vanilla DQN, then the probability of the point never being seen is 0.368.
To validate RL agents we typically assess how the trained agent performs on it's intended task.
There are some instances where keeping a train/test split would be useful. One such instance is in procedurally generated environments, such as that introduced by OpenAI. The idea is that, rather than play the same game over and over, in which case the agent can memorise upon experiencing enough data how to play the game, that the agent should learn more general skills that transfer to unseen instances of the same type of task. To do this, they introduce an environment that, upon each reset, procedurally generates a new instance of the environment -- an example would be if you are learning how to navigate in a maze, then upon every reset the maze changes. So, in this setup, one might store some (large number) of seeds to train on, and keep some seeds withheld to validate on, to see how the agent is really able to perform when deployed on unseen instances of the task.