Dr Luke Marris
I am an artificial intelligence engineer and researcher. I have expertise in machine learning, optimization, deep learning, reinforcement learning, game theory and multiagent systems. In particular, I am interested in training deep reinforcement agents, at scale, in manyplayer mixedmotive games, with a focus on building principled learning algorthims that provably select and compute equilibria. Games are more than activities we play with our friends: any interaction between multiple selfinterested players can be modelled as a game. Markets, trade negotiations, social norms, defence policy, environmental treaties, auctions, animal behaviour, and coordinated pandemic response, are all complex games. Understanding and calculating fair, stable, and rewarding solutions to games is crucial to solving many realworld problems.
 Senior Research Engineer, Google DeepMind, London
 PhD, University College London, Thesis: Multiagent Training in NPlayer GeneralSum Games
 Information Engineering, Bachelors and Masters, First Class, University of Cambridge
Software and Libraries

twoxtwogame LaTeX Package
This is a package for the visualization of 2x2 normal form games. The package is based on PGF/TikZ and produces beautiful vector graphics that are indented for use in scientific publications. The commands include the creation of graphical representations of 2x2 games, the visualization of equilibria in 2x2 games and game embeddings for 2x2 games.
Papers and Publications

Multiagent Training in NPlayer GeneralSum Games
Luke Marris UCL Discovery NPlayer, GeneralSum, Multiagent Training, PhD Thesis
Recent successes in twoplayer, purely competitive (socalled zerosum) games has made headlines around the world. Initially, perfect information games, with vast state spaces, such as Go and Chess, were conquered using a combination of search, deep neural network function approximation, and selfplay reinforcement learning. Subsequently more challenging, imperfect information games such as StarCraft were tackled using recurrent deep neural networks, humanplay priors, and policyspace response oracle reinforcement learning. Although these are important advances, from a game theoretic perspective twoplayer zerosum games are the simplest class, and many algorithms are known to converge to a solution concept called Nash Equilibrium in this setting. For twoplayer zerosum games the Nash Equilibrium is unexploitable (there is no strategy that an opponent could play to that would reduce one's reward) and interchangeable (if there is more than one Nash equilibrium in a game, all equilibria are equally good, and it is not necessary for players to coordinate on the equilibria being played). But most importantly, Nash equilibrium coincides with the Minimax solution in twoplayer zerosum, which is a singleobjective optimization. Therefore when playing against rational opponents, the Nash equilibrium is the obvious, perhaps fundamental, objective to optimize for when training learning agents in twoplayer zerosum games. Progress on nplayer generalsum games, however, has remained limited. It is both unclear what objective to optimize for and difficult to build algorithms that converge to any interesting objective. This is disappointing because the majority of the world's interactions are not purely competitive, have interesting mixedmotive dynamics, and have more than two players. This is an area of interest to economists, sociologists, policy makers, artificial intelligence researchers, and any other discipline that concerns multiple agents that have to compete or cooperate together to achieve their objectives. Previous work has mainly involved applying single agent techniques, or twoplayer zerosum techniques and hoping for the best. The overall aim of this work is to unlock progress beyond the narrow domains of twoplayer zerosum, to the most general space of nplayer, generalsum games using principled game theoretic techniques. In this end, this work utilizes more flexible mediated equilibrium solution concepts, correlated equilibrium and coarse correlated equilibrium, which are more suitable for generalsum games. This choice is for convenience and to enable the main goal of this work: building algorithms that scale in nplayer generalsum settings. This thesis a) builds an intuition for the space of games by studying transforms that do not alter their equilibrium, b) proposes a novel normalform game embedding, c) develops tools for visualizing large extensiveform games, d) proposes an efficient equilibrium selection criteria, e) builds fast neural network solvers for finding equilibria in all normalform games, f) proposes population based learning algorithms for training policies in nplayer generalsum extensiveform games, g) proposes learning algorithms for training policies in Markov games games, and h) develops novel ratings algorithms to evaluate strategies in nplayer generalsum games.

Visualizing 2x2 NormalForm Games: twoxtwogame LaTeX Package
Luke Marris, Ian Gemp, Siqi Liu, Joel Z. Leibo, Georgios Piliouras arXiv 2024, GitHub, CTAN NormalForm Games, 2x2 Games, Latex, Visualization
Normalform games with two players, each with two strategies, are the most studied class of games. These socalled 2x2 games are used to model a variety of strategic interactions. They appear in game theory, economics, and artificial intelligence research. However, there lacks tools for describing and visualizing such games. This work introduces a LaTeX package for visualizing 2x2 games. This work has two goals: first, to provide highquality tools and vector graphic visualizations, suitable for scientific publications. And second, to help promote standardization of names and representations of 2x2 games. The LaTeX package, twoxtwogame, is maintained on GitHub and mirrored on CTAN, and is available under a permissive Apache 2 license.

NfgTransformer: Equivariant Representation Learning for Normalform Games
Siqi Liu, Luke Marris, Georgios Piliouras, Ian Gemp, Nicolas Heess arXiv 2024, Open Review 2024, GitHub 2024 NormalForm Games, Neural Networks, Game Thoery, Representations
Normalform games (NFGs) are the fundamental model of strategic interaction. We study their representation using neural networks. We describe the inherent equivariance of NFGs  any permutation of strategies describes an equivalent game  as well as the challenges this poses for representation learning. We then propose the NfgTransformer architecture that leverages this equivariance, leading to stateoftheart performance in a range of gametheoretic tasks including equilibriumsolving, deviation gain estimation and ranking, with a common approach to NFG representation. We show that the resulting model is interpretable and versatile, paving the way towards deep learning systems capable of gametheoretic reasoning when interacting with humans and with each other.

Approximating the Core via Iterative Coalition Sampling
Ian Gemp, Marc Lanctot, Luke Marris, Yiran Mao, Edgar DuéñezGuzmán, Sarah Perrin, Andras Gyorgy, Romuald Elie, Georgios Piliouras, Michael Kaisers, Daniel Hennes, Kalesha Bullard, Kate Larson, Yoram Bachrach arXiv 2024, AAMAS 2024 Core, Neural Networks, Game Thoery
The core is a central solution concept in cooperative game theory, defined as the set of feasible allocations or payments such that no subset of agents has incentive to break away and form their own subgroup or coalition. However, it has long been known that the core (and approximations, such as the leastcore) are hard to compute. This limits our ability to analyze cooperative games in general, and to fully embrace cooperative game theory contributions in domains such as explainable AI (XAI), where the core can complement the Shapley values to identify influential features or instances supporting predictions by blackbox models. We propose novel iterative algorithms for computing variants of the core, which avoid the computational bottleneck of many other approaches; namely solving large linear programs. As such, they scale better to very large problems as we demonstrate across different classes of cooperative games, including weighted voting games, induced subgraph games, and marginal contribution networks. We also explore our algorithms in the context of XAI, providing further evidence of the power of the core for such applications.

States as Strings as Strategies: Steering Language Models with GameTheoretic Solvers
Ian Gemp, Yoram Bachrach, Marc Lanctot, Roma Patel, Vibhavari Dasagi, Luke Marris, Georgios Piliouras, Karl Tuyls arXiv 2024, GitHub 2024, Wired 2024, Quanta 2024 Large Language Models, Game Theory
Game theory is the study of mathematical models of strategic interactions among rational agents. Language is a key medium of interaction for humans, though it has historically proven difficult to model dialogue and its strategic motivations mathematically. A suitable model of the players, strategies, and payoffs associated with linguistic interactions (i.e., a binding to the conventional symbolic logic of game theory) would enable existing gametheoretic algorithms to provide strategic solutions in the space of language. In other words, a binding could provide a route to computing stable, rational conversational strategies in dialogue. Large language models (LLMs) have arguably reached a point where their generative capabilities can enable realistic, humanlike simulations of natural dialogue. By prompting them in various ways, we can steer their responses towards different output utterances. Leveraging the expressivity of natural language, LLMs can also help us quickly generate new dialogue scenarios, which are grounded in real world applications. In this work, we present one possible binding from dialogue to game theory as well as generalizations of existing equilibrium finding algorithms to this setting. In addition, by exploiting LLMs generation capabilities along with our proposed binding, we can synthesize a large repository of formallydefined games in which one can study and test gametheoretic solution concepts. We also demonstrate how one can combine LLMdriven game generation, gametheoretic solvers, and imitation learning to construct a process for improving the strategic capabilities of LLMs.

Neural Population Learning beyond Symmetric Zerosum Games
Siqi Liu, Luke Marris, Marc Lanctot, Georgios Piliouras, Joel Z Leibo, Nicolas Heess arXiv 2024, AAMAS 2024 NeuPL, CCE, nplayer, generalsum, JPSRO
We study computationally efficient methods for finding equilibria in nplayer generalsum games, specifically ones that afford complex visuomotor skills. We show how existing methods would struggle in this setting, either computationally or in theory. We then introduce NeuPLJPSRO, a neural population learning algorithm that benefits from transfer learning of skills and converges to a Coarse Correlated Equilibrium (CCE) of the game. We show empirical convergence in a suite of OpenSpiel games, validated rigorously by exact game solvers. We then deploy NeuPLJPSRO to complex domains, where our approach enables adaptive coordination in a MuJoCo control domain and skill transfer in capturetheflag. Our work shows that equilibrium convergent population learning can be implemented at scale and in generality, paving the way towards solving realworld games between heterogeneous players with mixed motives.

Evaluating Agents using Social Choice Theory
Marc Lanctot, Kate Larson, Yoram Bachrach, Luke Marris, Zun Li, Avishkar Bhoopchand, Thomas Anthony, Brian Tanner, Anna Koop arXiv 2023 Ranking, Rating, Social Choice Theory
We argue that many general evaluation problems can be viewed through the lens of voting theory. Each task is interpreted as a separate voter, which requires only ordinal rankings or pairwise comparisons of agents to produce an overall evaluation. By viewing the aggregator as a social welfare function, we are able to leverage centuries of research in social choice theory to derive principled evaluation frameworks with axiomatic foundations. These evaluations are interpretable and flexible, while avoiding many of the problems currently facing crosstask evaluation. We apply this VotingasEvaluation (VasE) framework across multiple settings, including reinforcement learning, large language models, and humans. In practice, we observe that VasE can be more robust than popular evaluation frameworks (Elo and Nash averaging), discovers properties in the evaluation data not evident from scores alone, and can predict outcomes better than Elo in a complex sevenplayer game. We identify one particular approach, maximal lotteries, that satisfies important consistency properties relevant to evaluation, is computationally efficient (polynomial in the size of the evaluation data), and identifies gametheoretic cycles.

Generative Adversarial Equilibrium Solvers
Denizalp Goktas, David C Parkes, Ian Gemp, Luke Marris, Georgios Piliouras, Romuald Elie, Guy Lever, Andrea Tacchetti arXiv 2023 Generalized Nash Equilibrium, Neural Networks
We introduce the use of generative adversarial learning to compute equilibria in general gametheoretic settings, specifically the generalized Nash equilibrium (GNE) in pseudogames, and its specific instantiation as the competitive equilibrium (CE) in ArrowDebreu competitive economies. Pseudogames are a generalization of games in which players' actions affect not only the payoffs of other players but also their feasible action spaces. Although the computation of GNE and CE is intractable in the worstcase, i.e., PPADhard, in practice, many applications only require solutions with high accuracy in expectation over a distribution of problem instances. We introduce Generative Adversarial Equilibrium Solvers (GAES): a family of generative adversarial neural networks that can learn GNE and CE from only a sample of problem instances. We provide computational and sample complexity bounds, and apply the framework to finding Nash equilibria in normalform games, CE in ArrowDebreu competitive economies, and GNE in an environmental economic model of the Kyoto mechanism.

Approximating Nash Equilibria in NormalForm Games via Stochastic Optimization
Ian Gemp, Luke Marris, Georgios Piliouras arXiv 2023, ICLR OpenReview 2024, ICLR Honorable Mention 2024, ICLR Oral 2024, Nash Equilibrium, Polymatrix Game, NormalForm Game
We propose the first, to our knowledge, loss function for approximate Nash equilibria of normalform games that is amenable to unbiased Monte Carlo estimation. This construction allows us to deploy standard nonconvex stochastic optimization techniques for approximating Nash equilibria, resulting in novel algorithms with provable guarantees. We complement our theoretical analysis with experiments demonstrating that stochastic gradient descent can outperform previous stateoftheart approaches.

EquilibriumInvariant Embedding, Metric Space, and Fundamental Set of 2×2 NormalForm Games
Luke Marris, Ian Gemp, Georgios Piliouras arXiv 2023, DeepMind, EC@EC 2023 Poster Embedding, Invariance, Nash Equilibrium, Correlated Equilibrium, Dimensionality Reduction, Visualization
Equilibrium solution concepts of normalform games, such as Nash equilibria, correlated equilibria, and coarse correlated equilibria, describe the joint strategy profiles from which no player has incentive to unilaterally deviate. They are widely studied in game theory, economics, and multiagent systems. Equilibrium concepts are invariant under certain transforms of the payoffs. We define an equilibriuminspired distance metric for the space of all normalform games and uncover a distancepreserving equilibriuminvariant embedding. Furthermore, we propose an additional transform which defines a betterresponseinvariant distance metric and embedding. To demonstrate these metric spaces we study 2×2 games. The equilibriuminvariant embedding of 2×2 games has an efficient two variable parameterization (a reduction from eight), where each variable geometrically describes an angle on a unit circle. Interesting properties can be spatially inferred from the embedding, including: equilibrium support, cycles, competition, coordination, distances, bestresponses, and symmetries. The bestresponseinvariant embedding of 2×2 games, after considering symmetries, rediscovers a set of 15 games, and their respective equivalence classes. We propose that this set of game classes is fundamental and captures all possible interesting strategic interactions in 2×2 games. We introduce a directed graph representation and name for each class. Finally, we leverage the tools developed for 2×2 games to develop game theoretic visualizations of large normalform and extensiveform games that aim to fingerprint the strategic interactions that occur within.

SearchImproved GameTheoretic Multiagent Reinforcement Learning in General and Negotiation Games
Zun Li, Marc Lanctot, Kevin McKee, Luke Marris, Ian Gemp, Daniel Hennes, Paul Muller, Kate Larson, Yoram Bachrach, Michael P. Wellman arXiv 2023, AAMAS 2023 Algorithms, Nash Bargaining, Correlated Equilibrium
Multiagent reinforcement learning (MARL) has benefited significantly from populationbased and gametheoretic training regimes. One approach, PolicySpace Response Oracles (PSRO), employs standard reinforcement learning to compute response policies via approximate best responses and combines them via metastrategy selection. We augment PSRO by adding a novel search procedure with generative sampling of world states, and introduce two new metastrategy solvers based on the Nash bargaining solution. We evaluate PSRO's ability to compute approximate Nash equilibrium, and its performance in two negotiation games: Colored Trails, and Deal or No Deal. We conduct behavioral studies where human participants negotiate with our agents (N=346). We find that search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian coplayer prediction, and can produce agents that achieve comparable social welfare negotiating with humans as humans trading among themselves.

Turbocharging Solution Concepts: Solving NEs, CEs and CCEs with Neural Equilibrium Solvers
Luke Marris, Ian Gemp, Thomas Anthony, Andrea Tacchetti, Siqi Liu, Karl Tuyls arXiv 2022, NeurIPS 2022, Poster and Presentation, Open Review, EC@EC 2023 Poster Deep Neural Networks, Invariant Architecture, Equivariant Architecture, Dual Optimization, Invariant Embedding, Corellated Equilibrim, Coarse Corellated Equilibrium, SemiSupervised Learning
Solution concepts such as Nash Equilibria, Correlated Equilibria, and Coarse Correlated Equilibria are useful components for many multiagent machine learning algorithms. Unfortunately, solving a normalform game could take prohibitive or nondeterministic time to converge, and could fail. We introduce the Neural Equilibrium Solver which utilizes a special equivariant neural network architecture to approximately solve the space of all games of fixed shape, buying speed and determinism. We define a flexible equilibrium selection framework, that is capable of uniquely selecting an equilibrium that minimizes relative entropy, or maximizes welfare. The network is trained without needing to generate any supervised training data. We show remarkable zeroshot generalization to larger games. We argue that such a network is a powerful component for many possible multiagent algorithms.

Game Theoretic Rating in Nplayer generalsum games with Equilibria
Luke Marris, Marc Lanctot, Ian Gemp, Shayegan Omidshafiei, Stephen McAleer, Jerome Connor, Karl Tuyls, Thore Graepel arXiv 2022, DeepMind Rating, Ranking, Nash Equilibrium, Correlated Equilibrium, Coarse Correlated Equilibrium, Nash Average
Rating strategies in a game is an important area of research in game theory and artificial intelligence, and can be applied to any realworld competitive or cooperative setting. Traditionally, only transitive dependencies between strategies have been used to rate strategies (e.g. Elo), however recent work has expanded ratings to utilize game theoretic solutions to better rate strategies in nontransitive games. This work generalizes these ideas and proposes novel algorithms suitable for Nplayer, generalsum rating of strategies in normalform games according to the payoff rating system. This enables wellestablished solution concepts, such as equilibria, to be leveraged to efficiently rate strategies in games with complex strategic interactions, which arise in multiagent training and realworld interactions between many agents. We empirically validate our methods on real world normalform data (Premier League) and multiagent reinforcement learning agent evaluation.

Developing, evaluating and scaling learning agents in multiagent environments
Ian Gemp, Thomas Anthony, Yoram Bachrach, Avishkar Bhoopchand, Kalesha Bullard, Jerome Connor, Vibhavari Dasagi, Bart De Vylder, Edgar A DuéñezGuzmán, Romuald Elie, Richard Everett, Daniel Hennes, Edward Hughes, Mina Khan, Marc Lanctot, Kate Larson, Guy Lever, Siqi Liu, Luke Marris, Kevin R McKee, Paul Muller, Julien Pérolat, Florian Strub, Andrea Tacchetti, Eugene Tarassov, Zhe Wang, Karl Tuyls AI Communications, arXiv 2022 Review
The Game Theory & MultiAgent team at DeepMind studies several aspects of multiagent learning ranging from computing approximations to fundamental concepts in game theory to simulating social dilemmas in rich spatial environments and training 3d humanoids in difficult team coordination tasks. A signature aim of our group is to use the resources and expertise made available to us at DeepMind in deep reinforcement learning to explore multiagent systems in complex environments and use these benchmarks to advance our understanding. Here, we summarise the recent work of our team and present a taxonomy that we feel highlights many important open challenges in multiagent research.

Simplex Neural Population Learning: AnyMixture BayesOptimality in Symmetric Zerosum Games
S Liu, M Lanctot, L Marris, N Heess ICML 2022, arXiv 2022, Poster Algroithms, Equilibrium Computation, Conditional Policy, Generalization
Learning to play optimally against any mixture over a diverse set of strategies is of important practical interests in competitive games. In this paper, we propose simplexNeuPL that satisfies two desiderata simultaneously: i) learning a population of strategically diverse basis policies, represented by a single conditional network; ii) using the same network, learn bestresponses to any mixture over the simplex of basis policies. We show that the resulting conditional policies incorporate prior information about their opponents effectively, enabling near optimal returns against arbitrary mixture policies in a game with tractable bestresponses. We verify that such policies behave Bayesoptimally under uncertainty and offer insights in using this flexibility at test time. Finally, we offer evidence that learning bestresponses to any mixture policies is an effective auxiliary task for strategic exploration, which, by itself, can lead to more performant populations.

NeuPL: Neural Population Learning
S Liu, L Marris, D Hennes, J Merel, N Heess, T Graepel ICLR 2022, arXiv 2022, Slides, Website, Talk Algroithms, Equilibrium Computation, Conditional Policy, Generalization
Learning in strategy games (e.g. StarCraft, poker) requires the discovery of diverse policies. This is often achieved by iteratively training new policies against existing ones, growing a policy population that is robust to exploit. This iterative approach suffers from two issues in realworld games: a) under finite budget, approximate bestresponse operators at each iteration needs truncating, resulting in undertrained goodresponses populating the population; b) repeated learning of basic skills at each iteration is wasteful and becomes intractable in the presence of increasingly strong opponents. In this work, we propose Neural Population Learning (NeuPL) as a solution to both issues. NeuPL offers convergence guarantees to a population of bestresponses under mild assumptions. By representing a population of policies within a single conditional model, NeuPL enables transfer learning across policies. Empirically, we show the generality, improved performance and efficiency of NeuPL across several test domains. Most interestingly, we show that novel strategies become more accessible, not less, as the neural population expands.

Multiagent training beyond zerosum with correlated equilibrium metasolvers
L Marris, P Muller, M Lanctot, K Tuyls, T Graepel ICML 2021, arXiv 2021, Slides, Talk, Code, DeepMind Algroithms, Equilibrium Computation, JPSRO, MGCE, Correlated Equilibrium, Coarse Correlated Equilibrium, PSRO, Self Play
Twoplayer, constantsum games are well studied in the literature, but there has been limited progress outside of this setting. We propose Joint PolicySpace Response Oracles (JPSRO), an algorithm for training agents in nplayer, generalsum extensive form games, which provably converges to an equilibrium. We further suggest correlated equilibria (CE) as promising metasolvers, and propose a novel solution concept Maximum Gini Correlated Equilibrium (MGCE), a principled and computationally efficient family of solutions for solving the correlated equilibrium selection problem. We conduct several experiments using CE metasolvers for JPSRO and demonstrate convergence on nplayer, generalsum games.

From motor control to team play in simulated humanoid football
Siqi Liu, Guy Lever, Zhe Wang, Josh Merel, SM Eslami, Daniel Hennes, Wojciech M Czarnecki, Yuval Tassa, Shayegan Omidshafiei, Abbas Abdolmaleki, Noah Y Siegel, Leonard Hasenclever, Luke Marris, Saran Tunyasuvunakool, H Francis Song, Markus Wulfmeier, Paul Muller, Tuomas Haarnoja, Brendan D Tracey, Karl Tuyls, Thore Graepel, Nicolas Heess Science Robotics, arXiv 2021 Video, Two Minute Papers, 60 Minutes, World Economic Forum, WIRED, New Scientist, DeepMind Blog, Altmetrics MultiAgent, Physics, Football, Soccer
Intelligent behaviour in the physical world exhibits structure at multiple spatial and temporal scales. Although movements are ultimately executed at the level of instantaneous muscle tensions or joint torques, they must be selected to serve goals defined on much longer timescales, and in terms of relations that extend far beyond the body itself, ultimately involving coordination with other agents. Recent research in artificial intelligence has shown the promise of learningbased approaches to the respective problems of complex movement, longerterm planning and multiagent coordination. However, there is limited research aimed at their integration. We study this problem by training teams of physically simulated humanoid avatars to play football in a realistic virtual environment. We develop a method that combines imitation learning, single and multiagent reinforcement learning and populationbased training, and makes use of transferable representations of behaviour for decision making at different levels of abstraction. In a sequence of stages, players first learn to control a fully articulated body to perform realistic, humanlike movements such as running and turning; they then acquire midlevel football skills such as dribbling and shooting; finally, they develop awareness of others and play as a team, bridging the gap between lowlevel motor control at a timescale of milliseconds, and coordinated goaldirected behaviour as a team at the timescale of tens of seconds. We investigate the emergence of behaviours at different levels of abstraction, as well as the representations that underlie these behaviours using several analysis techniques, including statistics from realworld sports analytics. Our work constitutes a complete demonstration of integrated decisionmaking at multiple scales in a physically embodied multiagent setting.

Backpropagation and the brain
Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, Geoffrey Hinton Nature Reviews Neuroscience, Open Access, Singularity Hub, Altmetric Biologically Plausible
During learning, the brain modifies synapses to improve behaviour. In the cortex, synapses are embedded within multilayered networks, making it difficult to determine the effect of an individual synaptic modification on the behaviour of the system. The backpropagation algorithm solves this problem in deep artificial neural networks, but historically it has been viewed as biologically problematic. Nonetheless, recent developments in neuroscience and the successes of artificial neural networks have reinvigorated interest in whether backpropagation offers insights for understanding learning in the cortex. The backpropagation algorithm learns quickly by computing synaptic updates using feedback connections to deliver error signals. Although feedback connections are ubiquitous in the cortex, it is difficult to see how they could deliver the error signals required by strict formulations of backpropagation. Here we build on past and recent developments to argue that feedback connections may instead induce neural activities whose differences can be used to locally approximate these signals and hence drive effective learning in deep networks in the brain.

A generalized training approach for multiagent learning
Paul Muller, Shayegan Omidshafiei, Mark Rowland, Karl Tuyls, Julien Perolat, Siqi Liu, Daniel Hennes, Luke Marris, Marc Lanctot, Edward Hughes, Zhe Wang, Guy Lever, Nicolas Heess, Thore Graepel, Remi Munos ICLR 2019, arXiv 2019
This paper investigates a populationbased training regime based on gametheoretic principles called PolicySpaced Response Oracles (PSRO). PSRO is general in the sense that it (1) encompasses wellknown algorithms such as fictitious play and double oracle as special cases, and (2) in principle applies to generalsum, manyplayer games. Despite this, prior studies of PSRO have been focused on twoplayer zerosum games, a regime wherein Nash equilibria are tractably computable. In moving from twoplayer zerosum games to more general settings, computation of Nash equilibria quickly becomes infeasible. Here, we extend the theoretical underpinnings of PSRO by considering an alternative solution concept, αRank, which is unique (thus faces no equilibrium selection issues, unlike Nash) and applies readily to generalsum, manyplayer settings. We establish convergence guarantees in several games classes, and identify links between Nash equilibria and αRank. We demonstrate the competitive performance of αRankbased PSRO against an exact Nash solverbased PSRO in 2player Kuhn and Leduc Poker. We then go beyond the reach of prior PSRO applications by considering 3 to 5player poker games, yielding instances where αRank achieves faster convergence than approximate Nash solvers, thus establishing it as a favorable general games solver. We also carry out an initial empirical validation in MuJoCo soccer, illustrating the feasibility of the proposed approach in another complex domain.

Humanlevel performance in 3D multiplayer games with populationbased reinforcement learning
Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, Thore Graepel Science, arXiv 2019, DeepMind Blog, Two Minute Papers, Ars Technica, Forbes, New York Times, New Scientist, Altmetric
Recent progress in artificial intelligence through reinforcement learning (RL) has shown great success on increasingly complex singleagent environments and twoplayer turnbased games. However, the realworld contains multiple agents, each learning and acting independently to cooperate and compete with other agents, and environments reflecting this degree of complexity remain an open challenge. In this work, we demonstrate for the first time that an agent can achieve humanlevel in a popular 3D multiplayer firstperson video game, Quake III Arena Capture the Flag, using only pixels and game points as input. These results were achieved by a novel twotier optimisation process in which a population of independent RL agents are trained concurrently from thousands of parallel matches with agents playing in teams together and against each other on randomly generated environments. Each agent in the population learns its own internal reward signal to complement the sparse delayed reward from winning, and selects actions using a novel temporally hierarchical representation that enables the agent to reason at multiple timescales. During gameplay, these agents display humanlike behaviours such as navigating, following, and defending based on a rich learned representation that is shown to encode highlevel game knowledge. In an extensive tournamentstyle evaluation the trained agents exceeded the winrate of strong human players both as teammates and opponents, and proved far stronger than existing stateoftheart agents. These results demonstrate a significant jump in the capabilities of artificial agents, bringing us closer to the goal of humanlevel intelligence.

Assessing the scalability of biologicallymotivated deep learning algorithms and architectures
S Bartunov, A Santoro, B Richards, L Marris, GE Hinton, T Lillicrap NeurIPS 2018, arXiv 2018
The backpropagation of error algorithm (BP) is impossible to implement in a real brain. The recent success of deep networks in machine learning and AI, however, has inspired proposals for understanding how the brain might learn across multiple layers, and hence how it might approximate BP. As of yet, none of these proposals have been rigorously evaluated on tasks where BPguided deep learning has proved critical, or in architectures more structured than simple fullyconnected networks. Here we present results on scaling up biologically motivated models of deep learning on datasets which need deep networks with appropriate architectures to achieve good performance. We present results on the MNIST, CIFAR10, and ImageNet datasets and explore variants of targetpropagation (TP) and feedback alignment (FA) algorithms, and explore performance in both fully and locallyconnected architectures. We also introduce weighttransportfree variants of difference target propagation (DTP) modified to remove backpropagation from the penultimate layer. Many of these algorithms perform well for MNIST, but for CIFAR and ImageNet we find that TP and FA variants perform significantly worse than BP, especially for networks composed of locally connected units, opening questions about whether new architectures and algorithms are required to scale these approaches. Our results and implementation details help establish baselines for biologically motivated deep learning schemes going forward.