Facebook develops AI algorithm that learns to play poker on the fly

On Jul 29, 2020

Facebook researchers have developed a general AI framework called Recursive Belief-based Learning (ReBeL) that they say achieves better-than-human performance in heads-up, no-limit Texas hold'em poker while using less domain knowledge than any prior poker AI. They assert that ReBeL is a step toward developing universal techniques for multi-agent interactions — in other words, general algorithms that can be deployed in large-scale, multi-agent settings. Potential applications run the gamut from auctions, negotiations, and cybersecurity to self-driving cars and trucks.

Combining reinforcement learning with search at AI model training and test time has led to a number of advances. Reinforcement learning is where agents learn to achieve goals by maximizing rewards, while search is the process of navigating from a start to a goal state. For example, DeepMind's AlphaZero employed reinforcement learning and search to achieve state-of-the-art performance in the board games chess, shogi, and Go. But the combinatorial approach suffers a performance penalty when applied to imperfect-information games like poker (or even rock-paper-scissors), because it makes a number of assumptions that don't hold in these scenarios. The value of any given action depends on the probability that it's chosen, and more generally, on the entire play strategy.

The Facebook researchers propose that ReBeL offers a fix. ReBeL builds on work in which the notion of “game state” is expanded to include the agents' belief about what state they might be in, based on common knowledge and the policies of other agents. ReBeL trains two AI models — a value network and a policy network — for the states through self-play reinforcement learning. It uses both models for search during self-play. The result is a simple, flexible algorithm the researchers claim is capable of defeating top human players at large-scale, two-player imperfect-information games.

At a high level, ReBeL operates on public belief states rather than world states (i.e., the state of a game). Public belief states (PBSs) generalize the notion of “state value” to imperfect-information games like poker; a PBS is a common-knowledge probability distribution over a finite sequence of possible actions and states, also called a history. (Probability distributions are specialized functions that give the probabilities of occurrence of different possible outcomes.) In perfect-information games, PBSs can be distilled down to histories, which in two-player zero-sum games effectively distill to world states. A PBS in poker is the array of decisions a player could make and their outcomes given a particular hand, a pot, and chips.

ReBeL generates a “subgame” at the start of each game that's identical to the original game, except it's rooted at an initial PBS. The algorithm wins it by running iterations of an “equilibrium-finding” algorithm and using the trained value network to approximate values on every iteration. Through reinforcement learning, the values are discovered and added as training examples for the value network, and the policies in the subgame are optionally added as examples for the policy network. The process then repeats, with the PBS becoming the new subgame root until accuracy reaches a certain threshold.

In experiments, the researchers benchmarked ReBeL on games of heads-up no-limit Texas hold'em poker, Liar's Dice, and turn endgame hold'em, which is a variant of no-limit hold'em in which both players check or call for the first two of four betting rounds. The team used up to 128 PCs with eight graphics cards each to generate simulated game data, and they randomized the bet and stack sizes (from 5,000 to 25,000 chips) during training. ReBeL was trained on the full game and had $20,000 to bet against its opponent in endgame hold'em.

The researchers report that against Dong Kim, who's ranked as one of the best heads-up poker players in the world, ReBeL played faster than two seconds per hand across 7,500 hands and never needed more than five seconds for a decision. In aggregate, they said it scored 165 (with a standard deviation of 69) thousandths of a big blind (forced bet) per game against humans it played compared with Facebook's previous poker-playing system, Libratus, which maxed out at 147 thousandths.

For fear of enabling cheating, the Facebook team decided against releasing the ReBeL codebase for poker. Instead, they open-sourced their implementation for Liar's Dice, which they say is also easier to understand and can be more easily adjusted. “We believe it makes the game more suitable as a domain for research,” they wrote in the a preprint paper. “While AI algorithms already exist that can achieve superhuman performance in poker, these algorithms generally assume that participants have a certain number of chips or use certain bet sizes. Retraining the algorithms to account for arbitrary chip stacks or unanticipated bet sizes requires more computation than is feasible in real time. However, ReBeL can compute a policy for arbitrary stack sizes and arbitrary bet sizes in seconds.”