Our Recent Works

AAAI 2021, ML cube
Policy Optimization as Online Learning with Mediator Feedback
Authors:

Alberto Maria Metelli, Matteo Papini, Pierluca D’Oro, Marcello Restelli

Conference:

AAAI 2021

Abstract:

Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space. The additional available information, compared to the standard bandit feedback, allows reusing samples generated by one policy to estimate the performance of other policies. Based on this observation, we propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RANDOMIST), for regret minimization in PO, that employs a randomized exploration strategy, differently from the existing optimistic approaches. When the policy space is finite, we show that under certain circumstances, it is possible to achieve constant regret, while always enjoying logarithmic regret. We also derive problem-dependent regret lower bounds. Then, we extend RANDOMIST to compact policy spaces. Finally, we provide numerical simulations on finite and compact policy spaces, in comparison with PO and bandit baselines.

Full paper

AAAI 2021, ML cube
Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate
Authors:

Mirco Mutti, Lorenzo Pratissoli, Marcello Restelli

Conference:

AAAI 2021

Abstract:

In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy? In this paper, we argue that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target. Especially, we present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of the state distribution entropy. In contrast to known methods, MEPOL is completely model-free as it requires neither to estimate the state distribution of any policy nor to model transition dynamics. Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning meaningful reward-based tasks downstream.

Full paper

AAAI 2021, ML cube
Newton Optimization on Helmholtz Decomposition for Continuous Games
Authors:

Giorgia Ramponi, Marcello Restelli

Conference:

AAAI 2021

Abstract:

Many learning problems involve multiple agents optimizing different interactive functions. In these problems, the standard policy gradient algorithms fail due to the non-stationarity of the setting and the different interests of each agent. In fact, algorithms must take into account the complex dynamics of these systems to guarantee rapid convergence towards a (local) Nash equilibrium. In this paper, we propose NOHD (Newton Optimization on Helmholtz Decomposition), a Newton-like algorithm for multi-agent learning problems based on the decomposition of the dynamics of the system in its irrotational (Potential) and solenoidal (Hamiltonian) component. This method ensures quadratic convergence in purely irrotational systems and pure solenoidal systems. Furthermore, we show that NOHD is attracted to stable fixed points in general multi-agent systems and repelled by strict saddle ones. Finally, we empirically compare the NOHD’s performance with that of state-of-the-art algorithms on some bimatrix games and continuous Gridworlds environment.

Full paper

AAAI 2021, ML cube
Learning Probably Approximately Correct Maximin Strategies in Games with Infinite Strategy Spaces
Authors:

Alberto Marchesi, Francesco Trovò, Nicola Gatti

Conference:

AAAI 2021

Abstract:

We tackle the problem of learning equilibria in simulationbased games. In such games, the players’ utility functions cannot be described analytically, as they are given through a black-box simulator that can be queried to obtain noisy estimates of the utilities. This is the case in many real-world games in which a complete description of the elements involved is not available upfront, such as complex military settings and online auctions. In these situations, one usually needs to run costly simulation processes to get an accurate estimate of the game outcome. As a result, solving these games begets the challenge of designing learning algorithms that can find (approximate) equilibria with high confidence, using as few simulator queries as possible. We focus on two-player zero-sum games with infinite strategy spaces. Drawing from the best arm identification literature, we design two algorithms with theoretical guarantees to learn maximin strategies in these games. The first one works in the fixed-confidence setting, guaranteeing the desired confidence level while minimizing the number of queries. Instead, the second algorithm fits the fixed-budget setting, maximizing the confidence without exceeding the given maximum number of queries. First, we formally prove δ-PAC theoretical guarantees for our algorithms under some regularity assumptions, which are encoded by letting the utility functions be drawn from a Gaussian process. Then, we experimentally evaluate our techniques on a testbed made of randomly generated games and instances representing simple real-world security settings.

Full paper

AAAI 2021, ML cube
Online Learning in Non-Cooperative Configurable Markov Decision Process
Authors:

Giorgia Ramponi, Alberto Maria Metelli, Alessandro Concetti, Marcello Restelli

Conference:

AAAI 2021

Abstract:

In the Configurable Markov Decision Processes there are two entities, a Reinforcement Learning agent and a configurator which can modify some parameters of the environment to improve the performance of the agent. What if the configurator does not have the same intentions as the agent? In this paper, we introduce the Non-Cooperative Configurable Markov Decision Process, a framework that allows having two (possibly different) reward functions for the configurator and for the agent. In this setting, we consider an online learning problem, where the configurator has to find the best among a finite set of possible configurations. We propose a learning algorithm to minimize the configurator expected regret, which exploits the structure of the problem. While a naïve application of the UCB algorithm yields a regret that grows indefinitely over time, we show that our approach suffers only bounded regret. Furthermore, we empirically show the performance of our algorithm in simulated domains.

Full paper

NeurIPS2020, ML cube
Inverse Reinforcement Learning from a Gradient-based Learner
Authors:

Giorgia Ramponi, Gianluca Drappo, Marcello Restelli

Conference:

NeurIPS 2020

Abstract:

Inverse Reinforcement Learning addresses the problem of inferring an expert’s reward function from demonstrations. However, in many applications, we not only have access to the expert’s near-optimal behaviour, but we also observe part of her learning process. In this paper, we propose a new algorithm for this setting, in which the goal is to recover the reward function being optimized by an agent, given a sequence of policies produced during learning. Our approach is based on the assumption that the observed agent is updating her policy parameters along the gradient direction. Then we extend our method to deal with the more realistic scenario where we only have access to a dataset of learning trajectories. For both settings, we provide theoretical insights into our algorithms’ performance. Finally, we evaluate the approach in a simulated GridWorld environment and on the MuJoCo environments, comparing it with the state-of-the-art baseline.

Full paper

NeurIPS2020, ML cube
An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits
Authors:

Andrea Tirinzoni, Matteo Pirotta, Marcello Restelli, Alessandro Lazaric

Conference:

NeurIPS 2020

Abstract:

In the contextual linear bandit setting, algorithms built on the optimism principle fail to exploit the structure of the problem and have been shown to be asymptotically suboptimal. In this paper, we follow recent approaches of deriving asymptotically optimal algorithms from problem-dependent regret lower bounds and we introduce a novel algorithm improving over the state-of-the-art along multiple dimensions. We build on a reformulation of the lower bound, where context distribution and exploration policy are decoupled, and we obtain an algorithm robust to unbalanced context distributions. Then, using an incremental primal-dual approach to solve the Lagrangian relaxation of the lower bound, we obtain a scalable and computationally efficient algorithm. Finally, we remove forced exploration and build on confidence intervals of the optimization problem to encourage a minimum level of exploration that is better adapted to the problem structure. We demonstrate the asymptotic optimality of our algorithm, while providing both problem-dependent and worst-case finite-time regret guarantees. Our bounds scale with the logarithm of the number of arms, thus avoiding the linear dependence common in all related prior works. Notably, we establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits. Finally, we verify that our algorithm obtains better empirical performance than state-of-the-art baselines.

Full paper

NeurIPS2020, ML cube
No-Regret Learning Dynamics for Extensive-Form Correlated Equilibrium
Authors:

Andrea Celli, Alberto Marchesi, Gabriele Farina, Nicola Gatti

Conference:

NeurIPS 2020

Abstract:

The existence of simple, uncoupled no-regret dynamics that converge to correlated equilibria in normal-form games is a celebrated result in the theory of multi-agent systems. Specifically, it has been known for more than 20 years that when all players seek to minimize their internal regret in a repeated normal-form game, the empirical frequency of play converges to a normal-form correlated equilibrium. Extensive-form (that is, tree-form) games generalize normal-form games by modeling both sequential and simultaneous moves, as well as private information. Because of the sequential nature and presence of partial information in the game, extensive-form correlation has significantly different properties than the normal-form counterpart, many of which are still open research directions. Extensive-form correlated equilibrium (EFCE) has been proposed as the natural extensive-form counterpart to normal-form correlated equilibrium. However, it was currently unknown whether EFCE emerges as the result of uncoupled agent dynamics. In this paper, we give the first uncoupled no-regret dynamics that converge to the set of EFCEs in n-player general-sum extensive-form games with perfect recall. First, we introduce a notion of trigger regret in extensive-form games, which extends that of internal regret in normal-form games. When each player has low trigger regret, the empirical frequency of play is a close to an EFCE. Then, we give an efficient no-trigger-regret algorithm. Our algorithm decomposes trigger regret into local subproblems at each decision point for the player, and constructs a global strategy of the player from the local solutions at each decision point.

Full paper

NeurIPS2020, ML cube
Online Bayesian Persuasion
Authors:

Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Nicola Gatti

Conference:

NeurIPS 2020

Abstract:

In Bayesian persuasion, an informed sender has to design a signaling scheme that discloses the right amount of information so as to influence the behavior of a self-interested receiver. This kind of strategic interaction is ubiquitous in real economic scenarios. However, the original model by Kamenica and Gentzkow makes some stringent assumptions which limit its applicability in practice. One of the most limiting assumptions is arguably that, in order to compute an optimal signaling scheme, the sender is usually required to know the receiver’s utility function. In this paper, we relax this assumption through an online learning framework in which the sender faces a receiver with unknown type. At each round, the receiver’s type is chosen adversarially from a finite set of possible types. We are interested in no-regret algorithms prescribing a signaling scheme at each round of the repeated interaction with performances close to that of the best-in-hindsight signaling scheme. First, we prove a hardness result on the per-iteration running time required to achieve the no-regret property. Then, we provide algorithms for the full and partial information model which exhibit regret sublinear in the number of rounds and polynomial in the parameters of the game.

Full paper

ICML 2020, ML cube
Sequential transfer in reinforcement learning with a generative model
Authors:

Andrea Tirinzoni, Riccardo Poiani, Marcello Restelli

Conference:

ICML 2020

Abstract:

We are interested in how to design reinforcement learning agents that provably reduce the sample complexity for learning new tasks by transferring knowledge from previously-solved ones. The availability of solutions to related problems poses a fundamental trade-off: whether to seek policies that are expected to immediately achieve high (yet sub-optimal) performance in the new task or whether to seek information to quickly identify an optimal solution, potentially at the cost of poor initial behaviour. In this work, we focus on the second objective when the agent has access to a generative model of state-action pairs. First, given a set of solved tasks containing an approximation of the target one, we design an algorithm that quickly identifies an accurate solution by seeking the state-action pairs that are most informative for this purpose. We derive PAC bounds on its sample complexity which clearly demonstrate the benefits of using this kind of prior knowledge. Then, we show how to learn these approximate tasks sequentially by reducing our transfer setting to a hidden Markov model and employing spectral methods to recover its parameters. Finally, we empirically verify our theoretical findings in simple simulated domains.

Full paper

ICML 2020, ML cube
Control frequency adaptation via action persistence in batch reinforcement learning
Authors:

Alberto Maria Metelli, Flavio Mazzolini, Lorenzo Bisi, Luca Sabbioni, Marcello Restelli

Conference:

ICML 2020

Abstract:

The choice of the control frequency of a system has a relevant impact on the ability of reinforcement learning algorithms to learn a highly performing policy. In this paper, we introduce the notion of action persistence that consists in the repetition of an action for a fixed number of decision steps, having the effect of modifying the control frequency. We start analyzing how action persistence affects the performance of the optimal policy, and then we present a novel algorithm, Persistent Fitted Q-Iteration (PFQI), that extends FQI, with the goal of learning the optimal value function at a given persistence. After having provided a theoretical study of PFQI and a heuristic approach to identify the optimal persistence, we present an experimental campaign on benchmark domains to show the advantages of action persistence and proving the effectiveness of our persistence selection method.

Full paper

AAMAS2020, ML cube
Driving exploration by maximum distribution in gaussian process bandits
Authors:

Alessandro Nuara, Francesco Trovò, Dominic Crippa, Nicola Gatti, Marcello Restelli

Conference:

AAMAS 2020

Abstract:

The problem of finding optimal solutions of stochastic functions over continuous domains is common in several real-world applications, such as, e.g., advertisement allocation, dynamic pricing, and power control in wireless networks. The optimization process is customarily performed by selecting input points sequentially and receiving a noisy observation from the function. In this paper, we resort to the Multi-Armed Bandit approach, aiming at optimizing stochastic functions when keeping at a pace the regret (i.e., the loss incurred during the learning process) during the learning process. In particular, we focus on smooth stochastic functions, as it is known that any algorithm suffers from a constant per-round regret when the domain is continuous, and the function does not satisfy any kind of regularity. Our main original contribution is the provision of a general family of algorithms, which, under the mild assumption that stochastic functions are a realization of a Gaussian Process, provides a regret of the order of O(√γTT,being γT the maximum information gain and T the time horizon used for the learning process. Furthermore, we design a specific algorithm of our family, called DAGP-UCB, which exploits the structure of GPs to select the next arm to pull more effectively than the previous algorithms available in the state of the art, thus speeding up the learning process. In particular, we show the superior performance of DAGP-UCB in both synthetic and applicative settings, comparing it with the state-of-the-art algorithms.

Full paper

AAMAS2020, ML cube
Learning Probably Approximately Correct Maximin Strategies in Simulation-Based Games with Infinite Strategy Spaces
Authors:

Alberto Marchesi, Francesco Trovò, Nicola Gatti

Conference:

AAMAS 2020

Abstract:

We tackle the problem of learning equilibria in simulation-based games. In such games, the players’ utility functions cannot be described analytically, as they are given through a black-box simulator that can be queried to obtain noisy estimates of the utilities. This is the case in many real-world games in which a complete description of the elements involved is not available upfront, such as complex military settings and online auctions. In these situations, one usually needs to run costly simulation processes to get an accurate estimate of the game outcome. As a result, solving these games begets the challenge of designing learning algorithms that can find (approximate) equilibria with high confidence, using as few simulator queries as possible. Moreover, since running the simulator during the game is unfeasible, the algorithms must first perform a pure exploration learning phase and, then, use the (approximate) equilibrium learned this way to play the game. In this work, we focus on two-player zero-sum games with infinite strategy spaces. Drawing from the best arm identification literature, we design two algorithms with theoretical guarantees to learn maximin strategies in these games. The first one works in the fixed-confidence setting, guaranteeing the desired confidence level while minimizing the number of queries. Instead, the second algorithm fits the fixed-budget setting, maximizing the confidence without exceeding the given maximum number of queries. First, we formally prove δ-PAC theoretical guarantees for our algorithms under some regularity assumptions, which are encoded by letting the utility functions be drawn from a Gaussian process. Then, we experimentally evaluate our techniques on a testbed made of randomly generated games and instances representing simple real-world security settings.

Full paper

Polimi, ML cube
A combinatorial-bandit algorithm for the online joint bid/budget optimization of pay-per-click advertising campaigns
Authors:

Alessandro Nuara, Francesco Trovo, Nicola Gatti, Marcello Restelli

Conference:

AAAI 2018

Abstract:

Pay-per-click advertising includes various formats (eg, search, contextual, and social) with a total investment of more than 140 billion USD per year. An advertising campaign is composed of some subcampaigns-each with a different ad-and a cumulative daily budget. The allocation of the ads is ruled exploiting auction mechanisms. In this paper, we propose, for the first time to the best of our knowledge, an algorithm for the online joint bid/budget optimization of pay-per-click multi-channel advertising campaigns. We formulate the optimization problem as a combinatorial bandit problem, in which we use Gaussian Processes to estimate stochastic functions, Bayesian bandit techniques to address the exploration/exploitation problem, and a dynamic programming technique to solve a variation of the Multiple-Choice Knapsack problem. We experimentally evaluate our algorithm both in simulation-using a synthetic setting generated from real data from Yahoo!-and in a real-world application over an advertising period of two months.

Full paper

Polimi, ML cube
Dealing with interdependencies and uncertainty in multi-channel advertising campaigns optimization
Authors:

Alessandro Nuara, Nicola Sosio, Francesco TrovÃ, Maria Chiara Zaccardi, Nicola Gatti, Marcello Restelli

Conference:

WWW 2019

Abstract:

In 2017, Internet ad spending reached 209 billion USD worldwide, while, e.g., TV ads brought in 178 billion USD. An Internet advertising campaign includes up to thousands of sub-campaigns on multiple channels, e.g., search, social, display, whose parameters (bid and daily budget) need to be optimized every day, subject to a (cumulative) budget constraint. Such a process is often unaffordable for humans and its automation is crucial. As also shown by marketing funnel models, the sub-campaigns are usually interdependent, e.g., display ads induce awareness, increasing the number of impressions-and, thus, also the number of conversions-of search ads. This interdependence is widely exploited by humans in the optimization process, whereas, to the best of our knowledge, no algorithm takes it into account. In this paper, we provide the first model capturing the sub-campaigns interdependence. We also provide the IDIL algorithm, which, employing Granger Causality and Gaussian Processes, learns from past data, and returns an optimal stationary bid/daily budget allocation. We prove theoretical guarantees on the loss of IDIL w.r.t. the clairvoyant solution, and we show empirical evidence of its superiority in both realistic and real-world settings when compared with existing approaches.

Full paper

Polimi, ML cube
Targeting optimization for internet advertising by learning from logged bandit feedback
Authors:

Margherita Gasparini, Alessandro Nuara, Francesco Trovò, Nicola Gatti, Marcello Restelli

Conference:

IJCNN 2018

Abstract:

In the last two decades, online advertising has become the most effective way to sponsor a product or an event. The success of this advertising format is mainly due to the capability of the Internet channels to reach a broad audience and to target different groups of users with specific sponsored announces. This is of paramount importance for media agencies, companies whose primary goal is to design ad campaigns that target only those users who are interested in the sponsored product, thus avoiding unnecessary costs due to the display of ads to uninterested users. In the present work, we develop an automatic method to find the best user targets (a.k.a. contexts) that a media agency can use in a given Internet advertising campaign. More specifically, we formulate the problem of target optimization as a Learning from Logged Bandit Feedback (LLBF) problem, and we propose the TargOpt algorithm, which uses a tree expansion of the target space to learn the partition that efficiently maximizes the campaign revenue. Furthermore, since the problem of finding the optimal target is intrinsically exponential in the number of the features, we propose a tree-search method, called A-TargOpt, and two heuristics to drive the tree expansion, aiming at providing an anytime solution. Finally, we present empirical evidence, on both synthetically generated and real-world data, that our algorithms provide a practical solution to find effective targets for Internet advertising.

Full paper

Polimi, ML cube
A characterization of quasi-perfect equilibria
Authors:

Nicola Gatti, Mario Gilli, Alberto Marchesi

Journal:

Games and Economic Behavior

Abstract:

We provide a characterization of quasi-perfect equilibria in n-player games, showing that any quasi-perfect equilibrium can be obtained as limit point of a sequence of Nash equilibria of a certain class of perturbed games in sequence form, and any limit point of a sequence of Nash equilibria of these perturbed games is a quasi-perfect equilibrium. We prove that, in games with three or more players, we need trembles defined as rational functions of the perturbation magnitude ε, whereas, in two-player games with nature, trembles expressed in terms of polynomial functions of ε suffice. Exploiting the relationship between sequence form and extensive form, we also provide a similar characterization in terms of perturbed games in extensive form, though not compliant with Selten’s definition of perturbed game.

Full paper

Polimi, ML cube
Combining reinforcement learning with rule-based controllers for transparent and general decision-making in autonomous driving
Authors:

Nicola Gatti, Mario Gilli, Alberto MarchesiAmarildo Likmeta, Alberto Maria Metelli, Andrea Tirinzoni, Riccardo Giol, Marcello Restelli, Danilo Romano

Journal:

Robotics and Autonomous Systems

Abstract:

The design of high-level decision-making systems is a topical problem in the field of autonomous driving. In this paper, we combine traditional rule-based strategies and reinforcement learning (RL) with the goal of achieving transparency and robustness. On the one hand, the use of handcrafted rule-based controllers allows for transparency, i.e., it is always possible to determine why a given decision was made, but they struggle to scale to complex driving scenarios, in which several objectives need to be considered. On the other hand, black-box RL approaches enable us to deal with more complex scenarios, but they are usually hardly interpretable. In this paper, we combine the best properties of these two worlds by designing parametric rule-based controllers, in which interpretable rules can be provided by domain experts and their parameters are learned via RL. After illustrating how to apply parameter-based RL methods (PGPE) to this setting, we present extensive numerical simulations in the highway and in two urban scenarios: intersection and roundabout. For each scenario, we show the formalization as an RL problem and we discuss the results of our approach in comparison with handcrafted rule-based controllers and black-box RL techniques.

Full paper

ML cube