How the Computer Beat the Go Master
God moves the player, he in turn the piece.
But what god beyond God begins the round
Of dust and time and sleep and agony?
—Jorge Luis Borges
As I write this column, a computer program called AlphaGo is beating the professional go player Lee Sedol at a highly publicized tournament in Seoul. Sedol is among the top three players in the world, having attained the highest rank of nine dan. The victory over one of humanity’s best representatives of this very old and traditional board game is a crushing 3 to 1, with one more game to come. With this defeat, computers have bettered people in the last of the classical board games, a game known for both depth and simplicity. An era is over and a new one is beginning. The methods underlying AlphaGo, and its recent victory, have startling implications for the future of machine intelligence.
Coming out of nowhere
The ascent of AlphaGo to the top of the go world has been stunning and quite distinct from the trajectory of machines playing chess. Over a period of 10 years a dedicated team of hardware and software engineers, eventually hired by IBM, built and programmed a special-purpose supercomputer, named Deep Blue, that did one thing and one thing only—play chess by evaluating 200 million board positions per second. In a widely expected development, the IBM team challenged then reigning world chess champion Garry Kasparov. In a six-game match played in 1996, Kasparov prevailed against Deep Blue by three wins, two draws and one lossbut lost a year later in a historic rematch 2-1/2 to 3-1/2.
Chess is a classic game of strategy, similar to tic-tac-toe (noughts and crosses), checkers (draughts), reversi (Othello), backgammon and go, in which players take turns placing or moving pieces. Unlike card games where participants only see their own as well as everybody’s discarded cards, players have full access to all relevant information with chance playing no role.
The rules of go are considerably simpler than those of chess. Black and White sides each have access to a bowl of black and white stones and each place one in turn on a 19 by 19 grid. Once placed, stones don’t move. The intent of the game, originating in China more than 2,500 years ago, is to completely surround opposite stones. Such encircled stones are considered captured and are removed from the board. Out of this sheer simplicity great beauty arises—complex battles between Black and White armies that span from the corners to the center of the board.
Strictly logical games, such as chess and go, can be characterized by how many possible positions can arise, bounding their complexity. Depending on the phase of the game, players must pick one out of a small number of possible moves, called the game’s breadth or branching factor b. If it’s White’s turn, she needs to pick one of b possible moves; Black can respond to each of these with b countermoves of his own. That is, after one turn, there are already b times b or b2 moves that White needs to consider in her strategizing. Assuming that a game of chess lasts on average d moves (called the game’s depth), the complete game tree from any one starting position—the list of all moves, countermoves, counter-countermoves and so on until one side or the other wins—contains about b times b times b…, d times in a row, or bd end positions (so-called terminal nodes or leaves of the search tree). Given that a typical chess game has a branching factor of about 35 and lasts 80 moves, the number of possible moves is vast, about 3580 (or 10123), aka the “Shannon number” after the Bell Laboratory pioneer Claude Shannon who not only invented information theory but also wrote the first paper on how to program a machine to play chess, back in 1950. Shannon’s number, at 10123, is huge, in particular considering there are only about 1080 atoms in the entire observable universe of galaxies, stars, planets, dogs, trees and people. But go’s complexity is bigger, much bigger. With its breadth of 250 possible moves each turn (go is played on a 19 by 19 board compared to the much smaller eight by eight chess field) and a typical game depth of 150 moves, there are about 250150, or 10360 possible moves. This is a number beyond imagination and renders any thought of exhaustively evaluating all possible moves utterly and completely unrealistic.
Given this virtually illimitable complexity, go is, much more than chess, about recognizing patterns that arise when clutches of stones surround empty spaces.Players perceive, consciously or not, relationships among groups of stones and talk about such seemingly fuzzy concepts as “light” and “heavy” shapes of stones, and aji, meaning latent possibilities. Such concepts, however, are much harder to capture algorithmically than the formal rules of the game. Accordingly, computer go programs struggled compared with their chess counterparts, and none had ever beat a professional human under regular tournament conditions. Such an event was prognosticated to be at least a decade away.
And then AlphaGo burst into public consciousness via an article in one of the world’s most respected science magazines, Nature, on January 28 of this year. Its software was developed by a 20-person team under the erstwhile chess child prodigy and neuroscientist turned AI pioneer, Demis Hassabis, out of his London-based company DeepMind Technologies, acquired in 2014 by Google. Most intriguingly, the Nature article revealed that AlphaGo had played against the winner of the European go championships, Fan Hui, in October 2015 and won 5 to 0 without handicapping the human player, an unheard of event. [Scientific American is part of Springer Nature.]
Looking under the hood
What is noteworthy is that AlphaGo’s algorithms do not contain any genuinely novel insights or breakthroughs. The software combines good old-fashioned neural network algorithms and machine-learning techniques with superb software engineering running on powerful but fairly standard hardware—48 central processing units (CPUs) augmented by eight graphical processing units (GPUs) developed to render 3-D graphics for the gaming communities and exceedingly powerful for running certain mathematical operations.
At the heart of the computations are neural networks, distant descendants of neuronal circuits operating in biological brains. Layers of neurons, arranged in overlapping layers, process the input—the positions of stones on the 19 by 19 go board—and derive increasingly more abstract representations of various aspects of the game using something called convolutional networks. This same technology has made possible recent breakout performances in automatic image recognition—automatically labeling, for example, all images posted to Facebook.
For any particular board position, two neural networks operate in tandem to optimize performance. A “value network” reduces the effective depth of the search by estimating how likely a given board position will lead to a win without chasing down every node of the search tree, and a “policy network” reduces the breadth of the game, limiting the number of moves for a particular board position the network considers by learning to chose the best moves for that position. The policy network generates possible moves that the value network then judges on their likelihood to vanquish the opponent.
It is ironic that the most powerful techniques for this fully deterministic game—in which every move is entirely determined based on earlier moves—are probabilistic, based on the realization that as the vast majority of the branches of the tree can’t be feasibly explored, it’s best to pick some of the most promising branches almost at random and to evaluate them all the way to the end—that is, to a board position in which either one or the other player wins. The various nodes in the game tree can then be weighted toward those that are most likely to lead to a win. Done over and over again, such a pseudo-random sampling, termed a Monte Carlo tree search, can lead to optimal behavior even if only a tiny fraction of the complete game tree is explored. Monte Carlo techniques—born in sin at the Los Alamos National Laboratory in the late 1940s to design the first nuclear weapons—are extensively used in physics. This Monte Carlo tree technique was successfully implemented in Crazy Stone, one of the earliest programs that played go at a decent amateur level.
Yet a Monte Carlo tree search by itself wasn’t good enough for these programs to compete at the world-class level. That required giving AlphaGo the ability to learn, initially by exposing it to previously played games of professional go players, and subsequently by enabling the program to play millions of games against itself, continuously improving its performance in the process.
In the first stage, a 13-layer policy neural network started as a blank slate—with no prior exposure to go. It then was trained on 30 million board positions from 160,000 real-life games taken from a go database. That number represents far more games that any professional player would encounter in a lifetime. Each board position paired with the actual move chosen by the player (which is why this technique is called supervised learning) and the connections among the networks were adjusted via standard so-called deep machine-learning techniques to make the network more likely to pick the better move the next time. The network was then tested by giving it a board position from a game it had previously never seen. It accurately, although far from perfectly, predicted the move the professional player had picked
In a second stage, the policy network trained itself using reinforcement learning. This technique is a lasting legacy of behaviorism—a school of thought dominant in psychology and biology in the first half of the last century. It professes the idea that organisms—from worms, flies and sea slugs to rats and people—learn by relating a particular action to specific stimuli that preceded it. As they do this over and over again, the organism builds up an association between stimulus and response. This can be done completely unconsciously, using rote learning.
Consider training your dog to roll over and “play dead” on command. You do so by breaking this complex behavior into smaller actions—lie on the ground, turn over, extend paws into the air. As soon as the action occurs either spontaneously or because you show your dog how and she tries to emulate you, it is rewarded (or “reinforced” in the lingo) by a combination of praise or morsel of food. Done enough times, the dog will eventually act dead on command.
Reinforcement learning was implemented years ago in neural networks to mimic animal behavior and to train robots. DeepMind demonstrated this last year with a vengeance when networks were taught how to play 49 different Atari 2600 video games, including Video Pinball, Star Gunner, Robotank, Road Runner, Pong, Space Invaders, Ms Pac-Man, Alien and Montezuma’s Revenge. (In a sign of things to come, atari is a Japanese go term, signifying the imminent capture of one of more stones.)
Each time it played, the DeepMind network “saw” the same video game screen, including the current score that any human player would see. The network’s output was a command to the joystick to move the cursor on the screen. Following the diktat of the programmer to maximize the game score, the algorithm did so and figured out the rules of the game over thousands and thousands trials. It learns to move, to hit alien ships and to avoid being destroyed by them. And for some games, it achieves superhuman performance. The same powerful reinforcement learning algorithm was deployed by AlphaGo for go, starting from the configuration of the policy network after the supervised learning step.
In a third and final stage of training, the value network that estimates how likely a given board position is likely to lead to a win, is trained using 30 million self-generated positions that the policy network chose. It is this feature of self-play, impossible for humans to replicate (for it would require the player’s mind to split itself into two), that enables the algorithm to relentlessly improve.
A peculiarity of AlphaGo is that it will pick a strategy that maximizes the probability of winning regardless of by how much. For example, AlphaGo would prefer to win with 90 percent probability by two stones than an 85 percent probability by 50 stones. Few people would give up a slightly riskier chance to crush their opponent in favor of eking out a narrow but surer victory.
The end result is a program that performed better than any competitor and beat the go master Fan Hui. Hui, however, is not among the top 300 world players—and among the upper echelons of players, differences in their abilities are so pronounced that even a lifetime of training would not enable Hui to beat somebody like Lee Sedol. Thus, based on the five publicly available games between AlphaGo and Hui, Sedol confidently predicted that he would dominate AlphaGo, winning five games to nothing or, perhaps on a bad day, four games to one. What he did not reckon is that the program he was facing in Seoul was a vastly improved version of the one Hui had encountered six months earlier, optimized by relentless self-play.
An interesting difference between Deep Blue and AlphaGo is that the evaluation engine of the former, assigning a positive (good) or a negative (bad) value to any one chessboard position, was explicitly programmed. This distinction enabled Deep Blue’s programmers to add explicit rules, such as “if this position occurs, do that,” to its repertoire of tactics. That is not possible for Deep Blue’s neural network descendent, AlphaGowhere all knowledge is encoded implicitly in the “weights” of the network.
What next?
Deep Blue represented a triumph of machine brawn over a single human brain. Its success was almost completely predicated on very fast processors, built for this purpose. Although its victory over Kasparov was a historic event, the triumph did not lead to any practical application or to any spin-off. Indeed, IBM retired the machine soon thereafter.
The same situation is not likely to occur for AlphaGo. The program runs on off-the-shelf processors. Giving it access to more computational power (by distributing it over a network of 1,200 CPUs and GPUs) only improved its performance marginally. The feature that makes the difference is AlphaGo’s ability to split itself into two, playing against itself and continuously improving its overall performance. At this point it is not clear whether there is any limitation to the improvement AlphaGo is capable of. (If only the same could be said of our old-fashioned brains.) It may be that this constitutes the beating heart of any intelligent system, the Holy Grail that researchers are pursuing—general artificial intelligence, rivaling human intelligence in its power and flexibility.
It is likely that Hassabis’s DeepMind team is contemplating designing more powerful programs, such as versions that can teach themselves go from scratch, without having to rely on the corpus of human games as examples, versions that learn chess, programs that simultaneously play checkers, chess and go at the world-class level or ones that can tackle no-limit Texas hold ‘em poker or similar games of chance.
In a very commendable move, Hassabis and his colleagues described in exhausting details the algorithms and parameter settings the DeepMind team used to generate AlphaGo in the accompanying Nature publication. This further accelerates the frenetic pace of AI research in academic and industrial laboratories worldwide. For reinforcement, algorithms based on trial and error learning can be applied to myriad problems with sufficient labeled data, be they financial markets, medical diagnostic, robotics, warfare and so on. A new era has begun with unknown but potential monumental medium- and long-term consequences for employment patterns, large-scale surveillance and growing political and economic inequity.
What of the effects of AlphaGo on go itself? Despite doomsayers to the contrary, the rise of ubiquitous chess programs revitalized chess, helping to train a generation of ever more powerful players. The same may well happen to the go community. After all, the fact that any car or motorcycle can speed faster than any runner did not eliminate running for fun. More people run marathons than ever. Indeed, it could be argued that by removing the need to continually prove oneself to be the best, humans may now more enjoy the nature of this supremely aesthetic and intellectual game in its austere splendor for its own sake. Indeed, in ancient China one of the four arts any cultivated scholar and gentleman was expected to master was the game of go.
Just as a meaningful life must be lived and justified for its own intrinsic reasons, so should go be played for its intrinsic value—for the joy it gives. As the philosopher Mark Rowlands puts it, this joy can assume many forms:
“There is the joy of focus, the experience of being completely immersed in what one is doing. There is the joy of dedication, the experience of being dedicated to the deed and not the outcome, the activity and not the goal. There is the joy of enduring, the experience of playing the game as hard as you can play it, of giving everything you have to the game and leaving nothing in the tank, no matter the experiential toll this exacts. This is the joy of defiance, wild and fierce: No, you will not break me, not here, not today.”
In his battle against a new and superior go force, Lee Sedol, representative for all of us, has shown this joy. This article is dedicated to him.