Peter Keller
Imagine, a robot is moving through a maze that consists of equal sized rooms. Adjacent rooms are connected by doors. The robot can now perform actions, choosing the next door with respect to a given policy (discrete probability distribution on the available doors). After a new room is entered, the next room is chosen only with respect to the current position. Penalizing the robot's movement with a negative reward (for example -1 for each room change), the robot can learn the optimal policy that leads him through the maze as quickly as possible. We give an introduction to the solution methods of this type of problem via Markov Reward- and Decision-Processes optimizing a discrete Bellman-type equation and show some basic simulations/implementations of Q-Learning.
The Zoom access data are available on the programm of the Research Seminar.