Martino Bernasconi, Federico Cacciamani, Simone Fioravanti, Nicola Gatti, Francesco Trovò
In this paper, we study the mean dynamics of the soft-max policy gradient algorithm in multi-agent settings by resorting to evolutionary game theory and dynamical system tools. Such a study is crucial to understand the algorithm’s weaknesses when employed in multi-agent settings. Unlike most multi-agent reinforcement learning algorithms, whose mean dynamics is a slight variant of the replicator dynamics not affecting the properties of the original dynamics, the softmax policy gradient dynamics presents a structure significantly different from that of the replicator. Indeed the dynamics is equivalent to the replicator dynamics in a different game derived by a non-convex transformation of the payoffs of the original game. First we recover the properties—already known for the discrete-time soft-max policy gradient—for the continuous-time mean dynamics in the case of learning a best response. As it commonly happens, the continuous-time dynamics allow for a simpler analysis and deeper understanding of the algorithm that we use to characterize fully the dynamics and improve on its theoretical understanding. Then, we resort to models based on single- and multi-population games, showing that the dynamics preserve the volume as prove that, in arbitrary instances, it is not possible to obtain last-iterate convergence when the equilibrium of the game is fully mixed. Furthermore, we give empirical evidence that dynamics starting from close initial points may expand over time, thus showing that the behaviour of the dynamics in games with fully-mixed equilibrium is chaotic.