Only recently did I have the pleasure of reading the paper Mastering the Game of Go with Deep Neural Networks and Tree Search. As a result, I attempted to write up some notes explaining how AlphaGo works to a non-expert. If you are an expert reading this and you find mistakes in my explanation, please comment. At the time AlphaGo was released, the popular media hype was that the computer just “taught itself” to win. The reality was a bit more nuanced.
At the heart of AlphaGo is the MonteCarlo Search Tree (MCST). To decide on a move, AlphaGo searches this tree: it evaluates each potential move in a given position, simulates the complete game that could occur if it were to take that move, and decides whether it should take this move, given the outcome of the simulation (game won or lost). Considering the dimensions of the Go board (19x19) and all the possible moves, the size of the search space is enormous. AlphaGo uses several machine learning (ML) models to reduce the size of the search space.
The first key element of AlphaGo is a neural network trained using supervised learning (SL). As the baseline network, AlphaGo designers used a neural network akin to those employed for image classification. The input to the network is the 19x19 Go board. They train this network using the records of human expert moves. Roughly, this works as follows: the network selects a move in the given board configuration, compares it with the move that was done in the same configuration by the human, and updates the weights of the network using back propagation. The output of the SL-trained network is a set of probabilities: it attributes the probability of winning the final game to each potential move.
The MonteCarlo Search tree uses the SL-trained network to efficiently find the best move among the myriad of possibilities. The SL network tells it how likely we are to win the game given a particular move, but instead of evaluating all potential moves, we can now evaluate only the ones that have the highest probabilities of winning. That said, when AlphaGo used only the SL-trained network, it could not easily beat humans, because it relied only on human expert strategies. To make AlphaGo more powerful, the designers added another component.
In addition to looking at the winning probabilities of potential moves provided by the SL network, AlphaGo also simulates the outcome of the game after choosing this or that move. The space of all possible game rollouts could be visualized as a tree. Given the selected move, the game could go this or that way depending on the next move selected by the opponent. We already limit the width of the tree by constraining our next potential moves only to the ones considered promising by the SL network. But the depth of such a tree could still be pretty large. The trick that AlphaGo uses to trim the depth is to only consider rollouts with board configurations that are likely to lead to a win. To do that, it needs to estimate the value of a particular board configuration.
How does AlphaGo estimate the value of a board configuration? This is where the “teaching itself how to win” part comes in. AlphaGo uses another neural network, which the authors call the value network, to predict whether a particular player will win the game given a board configuration. Initially, the authors trained this network using board configurations and outcomes of actual recorded Go games, but this approach made the model “memorize” the outcome of a specific game and unable to generalize to new, unseen, board configurations. To generate better training data, the authors made AlphaGo play against itself dozens of million games. From each game they took just one board configuration and the outcome of the game, and used that as the training data for the value network.
To make those self-played games possible, AlphaGo designers began with the SL-trained network, which played against itself. After the game was over, they used the outcome of the game to update the weights of the SL network, using the approach called reinforcement learning (RL). After each update, the network became a bit better at predicting the outcome of the game. To generate the millions of data points for training the value networks, AlphaGo played against different iterations of the RL-policy, creating the illusion of playing against different experts. This self-play method, which relied on the RL-trained network to produce unbiased training data, resulted in a value network that accurately predicted the outcome of the game in a given board configuration.
Combining the recommendations of SL-trained network and the value network seems to be the secret sauce of AlphaGo. But why? The authors hypothesize that the SL network performed better than the value network “presumably because humans select a diverse beam of promising moves, whereas RL optimizes for the single best move”. On the other hand, the RL-trained value network ended up better predicting the outcome of the game given the board configuration — a task that is very difficult to do for a human.
The design of AlphaGo was more nuanced than just letting the machine to its own devices in unsupervised self-play. The gist of AlphaGo is using ML to reduce an enormous search space. This approach is becoming increasingly more common in other domains, such as searching a large configuration space of computer systems.