Markov Decision Processes with variable epoch lengths

Question

I am working on modeling a transportation problem as an MDP. Multiple trucks move material from one node to various other nodes in a network. However, the time it takes a truck to travel between any 2 nodes is different based on distance, and decisions are made when a truck arrives at a node. There lies the problem. Is it possible to have an MDP where the length of time between decision epochs is not uniform?

The most similar MDP formulation I could find was the Semi-Markov Decision process, but that uses a random length epoch.

Hello. What do you mean by "epoch" here? Do you mean the time/distance between one action and the next one? — nbro, Aug 26 '21 at 11:06

Neil Slater · Answer 1 · 2021-08-26T13:14:41.130

Timesteps in an MDP do not need to be even, and they have no units, just an index. It is OK for real world clock time between time steps to vary as needed.

The MDP formulation assumes a "turn-based" process where the agent picks and action, the environment processes that action plus its own inherent rules, then the agent is contacted again when it is time to make a new decision.

This most common scenario assumes you always have adequate time to make a decision, and whenever a decision is needed, the timestep will be incremented, the agent presented with the reward and next state by the environment (or code that interfaces with the environment).

It looks like this would match your case, and that there is nothing to be concerned about.

There are real time systems where actions may be required reactively, and as fast and accurately as possible given sensor data collected at high frequency compared to the time it would take to analyse state and make a decision. There are some different approaches to this, and it is an area of active research, since the MDP formulation does not capture this. For instance the paper Real-Time Reinforcement Learning attempts to use a variation of MDP with time steps based on a real time metric, and allowances for decisions taking longer than one time step.

In practice, even video-game-playing systems such as classic Atari games are often treated as turn-based instead of real time. During training the emulator may be paused whilst the agent processes state information and learns.

Markov Decision Processes with variable epoch lengths

1 Answers1