Real-Time Sequential Decision-Making by Autonomous Agents

On Mar 7, 2019

Autonomous agents, such as self-driving cars and drones, need to make a sequence of decisions in real time to quickly adapt to changes in the environment or the behavior of other agents, particularly to avoid collisions and recover to safe conditions. Such real-time decision-making becomes very challenging for today’s AI when it must consider how other agents behave or when essential information is not directly observable. The AI research community has been advancing the field to address these challenges by defining specific competitions and competing with each other. One example is Pommerman competitions, developed as an environment to advance the field of real-time sequential decision-making.

On December 8, 2018, a Pommerman competition was held at the Thirty-second Conference on Neural Information Processing Systems (NeurIPS 2018) in Montreal, Canada, where teams from IBM Research-Tokyo won first and third places. A key to their success is in new technologies we developed for real-time sequential decision-making by autonomous agents.

In Pommerman competitions, a team of two agents competes against another team on a board of 11×11 grids (see Figure 1). Each agent can observe only a limited area of the board (as indicated in the small panels in Figure 1), and the agents cannot communicate with each other. The goal of a team is to knock down all of the opponents. Towards this goal, the agents place bombs to destroy wooden walls and collect power-up items that might appear from those wooden walls, while avoiding flames and attacking opponents. For details, see the Pommerman GitHub repository.

Figure 1. A Pommerman board after 100 steps.

What makes Pommerman difficult is the constraint on real-time decision-making: an agent needs to choose an action in 100 milliseconds. This constraint significantly limits the applicability of Monte Carlo Tree Search, which has seen success in games such as Chess and Go (for example, a player of Go is typically given 30 seconds for each move even after the main time is depleted). In Pommerman, the branching factor (the number of possibilities in the next step) can be as large as 6⁴=1,296, because four agents take actions simultaneously in each step, and there are six possible actions for each agent. This may be compared against Go, whose branching factor is at most 19²=361. The agents should plan ahead and choose actions by taking into account the explosion of bombs, whose lifetime is 10 steps. Tree search with insufficient depth (less than 10) would ignore the explosion of bombs, which in turn would make the agents easily caught up in flames. Tree search with sufficient depth (at least 10) is practically infeasible to perform in real time due to the large branching factor (the number of possibilities after 10 steps is over 10³¹).

In the new approach we developed, tree search is performed only with a limited depth, but the leaves of the search tree are evaluated on the basis of a deterministic and pessimistic scenario (see Figure 2). The new approach keeps the size of the search tree small, because there are branches only until a limited depth. At the same time, the new approach can take into account critical evens that might occur far ahead in the future, because the leaves are evaluated with a deterministic scenario that can be much longer than what would be possible with branches.

Figure 2. New tree search with deterministic and pessimistic scenarios.

The new approach uses pessimistic scenarios, because good actions are often the ones that perform well under pessimistic scenarios particularly in cases where safety is a primary concern. A pessimistic scenario can be generated in a systematic manner when the state of the environment can be represented by the positions of objects. In Pommerman, these objects are agents, bombs, flames, powerup items, and walls. Some of those objects change their positions by depending on the actions of the agents, which forces tree search to have branches. The new approach generates a pessimistic scenario by allowing the objects to be located at multiple positions even if that is unrealistic or illegal.

The effectiveness of the new approach depends on the level of pessimism, and the overall performance of our agents can be maximized by tuning this level via self-play (i.e., letting the agents play against themselves with varying levels of pessimism). Figure 3 shows the effectiveness of pessimism in the new approach. Each panel shows the number of wins, ties, and losses of our agents against a baseline, where the level of pessimism in our agents is varied as indicated along the horizontal axis. The baseline on the left panel is a team of default agents (SimpleAgents). The baseline on the right panel is a team of agents that are identical to our agents except that level of pessimism is set zero. With the optimized level of pessimism, the team of our agents wins against SimpleAgents 99.7 % of the matches (the remaining 0.3 % is tied).

Figure 3. Effectiveness of pessimism in the new approach.

Further details of the new approach can be found in a recently published research report [1]. Docker images of our agents are also available as multiagentlearning/{hakozakijunctions, dypm.1, dypm.2}. Our approach advances the field of real-time sequential decision-making towards the realization of autonomous agents who can make critical decisions in real time. For example, the new approach would enable autonomous cars to drive similar to experienced human drivers who are prepared for multiple potential risks and can drive safely even in dangerous situations.