CS572 Final Project -- Reinforcement Learning with Mancala

Shane Oldenburger and Chris Strasburg

Abstract

In this project we explore the performance of reinforcement learning agents learning to play the game Mancala.  We implement a table-based Q-Learning agent and an Artificial Neural Network based Q-Learning agent.  We train the agents for a fixed number of trials and then play them against each other for a number of games to measure their relative fitness.  We also experiment with post-training move selection to improve their playability, ie. we try keeping track of an opponent's past performance and make moves that are challenging at that skill level.

Section 1: Introduction

Mancala, as a game of interest to the authors, provides a relatively large state space with at most 4814 possible states.  This provides some challenges for Q-Learning based reinforcement learners because there is not always enough memory available to store the entire mapping of state/aciton pairs to values.  We explored using a table based Q-Learner with the state-action pair mappings stored in a database and an Artificial Neural Network based Q-Learner which implicitly stores this mapping in its network weights.  In Section 2 we will give a brief introduction to the history and rules of mancala.  In Section 3 we present the basic algorithms used by our two approaches.  In Section 4 we discuss the results in terms of relative time to train, and relative 'fitness' of the two trained agents. 

Section 2: Mancala History and Basic Rules

Mancala is possibly the oldest game still played today.  There is evidence that it was played in Egypt around 1400BC and that it may have been derived from the accounting boards in use at that time.  It gained popularity quickly, especially in Africa, and is today often referred to as an African game though it is played throughout the world.  There are many variations of Mancala, most of which have their own name.[1,2]

The version that we used in this project is probably the most commonly played in the Western world with 2 goal buckets, and 6 player buckets per side.  Each player starts out with 4 stones in their player buckets. 

Initial Board Configuration

On each turn, a player selects one of their buckets, picks the stones up out of it, and distributes them counter-clockwise around the board, placing one in each successive bucket.  If the last stone is placed in a player's own goal bucket (the right-hand goal bucket) then they get to go again.  If the last stone is placed in a player's own empty player bucket then they put that stone into their goal bucket along with any stones in the player bucket directly across from that bucket.  The game ends when one side runs out of stones in their player buckets.  The opponent then adds their remaining stones to their goal bucket.  The player with the most stones in their goal bucket is the winner.

Section 3: Algorithm Presentation and Analysis

Mancala is a fully observable, stochastic, episodic, static, discrete, and competitive-multi-agent environment.  It is also a very large state space, encompassing at most 4814 = 3.45x1023 possible states.  In reality many of these states are unreachable, but it is still a much larger space than can be easily stored in memory or can be iterated through on the fly.

Table Based Q-Learner:

Because of the large state space we implemented our table based Q-Learner with a MySQL backend.  This caused the table access to be very slow compared to an in-memory hash table.  Training on 100,000 games took approximately 50 hours.  This is a huge hinderance to the table based Q-Learner because, while the opening moves will get tested very frequently and will tend to more closely approximate their true Q values (than states later in the game), the agent will often find itself in states it has never seen before.  With an average branching factor of 3 and an average game length of 14 (as measured during 1000 trials between random players) we have the following number of states at each level of the game (note: the opening states will tend to have a branching factor of 6, while end game states will have a lower branching factor.  Choosing 3 is really a low estimate of the average.  It will probably be closer to 4 or 5):

Player Move (n)
Number of states at n (possible player moves * possible opponent moves) == 9^(n)
Average number of times state is seen in 100,000 training games
Proportion of states  visited (approximate)
0
1
100,000
1
1
9
11,111.1
1
2
81
1234.56
1
3
729
137.174
1
4
6561
15.2416
1
5
59049
1.69351
1
6
531441
0.188168
1/5
7
4782969
0.020908
1/50
8
43046721 0.002323
1/450
9
387420489
0.000258
1/4000
10
3486784401
0.000029
1/35000
11
31381059609
0.000003
1/333333
12
282429536481
< 0.0000005
< 1/2000000
13
2541865828329
< 0.00000005
< 1/20000000
14
22876792454961
< 0.000000005
< 1/200000000

So with 100,000 trials we can only expect the agent to even have a chance at playing well for the first 6 moves it makes.  By the end of the game, the agent is just blindly guessing at which moves to make.  To have an agent with any chance of having seen all the states we would need to run for about 23 trillion trials.  Since 100,000 trials with the database based agent takes around 50 hours, to play an adequate number of games to have seen all states would take 23,000,000,000,000 * (50 / 100,000) = 11500000000 hours, or a little over 1.3 million years.

For the table-based Q Learning agent we used the following algorithm[3]:

Q - The memory of the state,action -> value mapping.  This was a database table with fields state, action, and value.
Nsa - The number of times action a has been tried in state s.  Also stored as a mysql table with fields Action, State, and Trials.
s, a, r - The previous state, action, and reward (initally all null)


Q-Learning-Agent (MancalaBoard)
    if (s != null){
       Nsa[s, a]++;
       Q[a,s] <- Q[a,s] + (1/log(Nsa[s,a]))(r + 1*Max a' (Q[a',s'] - Q[a,s])
    }
    if (s' is an end state){
       s <- null
       a <- null
       r <- null
    } else {
       s <- s'
       a <- f(Q[a',s'], Nsa[a',s')
       r <- r'
    }
    return a

    f returns an action probabilistically with the probability of each action being:
       P(a) = (1/log6(Nsa[s,a] + 1) + Q[a,s]) / Sum over a (Prob(a))

One difference we implemented for efficiency's sake is to keep track of the entire game and to update that whole game sequence when we receive the reward.  This way the agent only has to play one game to generate values for the entire game path instead of having to play n games where n is the path length.  The learning rate is inversely proportional to the number of times the state has been visited.  This compensates for the fact that the state transitions are non-deterministic, allowing the q function to converge..

We chose f to be decreasing in the number of times the action has been tried in this state, and increasing in the Q value of the state, action pair.  We tried several f functions for action selection: random, weighted random, and the function given above.  Random state generation will guarantee that all states are tried an equal number of times, in our environment, we also require that the opponent choose each action with some positive probability.

Each move requires at most 14 database lookups: one for the Q value of the past state/action, one each for all possible moves from the current state, one for the number of times we've visited the past state, and one for the number of times we've executed each possible action in this state.  Using the random move selection strategy instead of the f given above reduces the number of database calls required per move by at most 6, and in practice did not appreciably reduce the running time.

The table-based exploitative player chooses the move with the highest Q value.  If the Q value is unknown, it will use a strategy we call the simple strategy.  It will always pick the bucket closest to its own goal.  This may be cheating a bit as it is an application of our own domain knowledge; this strategy tends to conserve stones on the player's side of the board, thus encouraging their opponent to run out of stones.  Any stones left on the player's side of the board are then counted to them for purposes of determining the winner.

ANN Based Q-Learner:

For the artificial neural network based agent we used the same Q-Learning algorithm as the table based learner, but instead of storing the Q values in a table they are stored implicitly in the network structure and weights.  We chose a network topology with N input nodes, N*2 hidden nodes, and 1 output node, where N is equal to the total number of buckets on the game board plus one node for the proposed action.  Each input node takes the number of  stones in its corresponding bucket as its input except for the action input which takes the bucket number of the bucket to play.  The output node gives the estimated Q value for the state and action as input to the network.
ANN Topology

The weight updates were performed by the following equations and pseudocode:
Let y = current output of the NN, ie. the current Q value
Let d = the updated Q value
Let n = the learning rate
Let zm = the output of the mth neuron in the hidden layer
Let wm = the weight from the mth hidden neuron to the output
Let wkl = the weight from the lth input neuron to the kth hidden neuron

Then the update equations we used are:
wm <- wm + n(d-y)zm
wkl <- wkl + n(d-y)wk(1-zk)zk * xl

Section 4: Results


The following tables show the number of wins for each agent versus other types of agents (the wins recorded are for player 1).  Each Q-Learning agent was trained for 100,000 games and the wins are out of 10,000 games:

Player 1 \ Player 2
Random
Simple
Table-based Exploitative
ANN-Exploitative
Random
X
1945
1630
1473
Simple 7570
X
0
0 (All ties)
Table-based Exploitative 8160
10000
X
10000
ANN-Exploitative 8236
0 (All ties)
0
X

Table-based learner analysis:
One interesting thing to note is that the simple player playing against the random player has games an average of 7 turns long.  However, the Table-based exploitative player versus the random player tends to have games an average of 12.5 turns long.  The number of wins of the table-based player is only slightly more than the simple player indicating that while it gets a minor edge by using its Q values early on, the bulk of its fitness is probably due to the move selection heuristic. I would expect more training to improve the player's later game performance above that of the simple player.

When the Q-learner playes against the simple player, it wins 100% of the time.  Since in an unknown situation the exploitative player will use the simple algorithm itself, this implies that the exploitative player is left in a more advantageous position when the policy of its opponent is also to use that same heuristic.  However, since there is no randomized move selection in either the Simple player or in the exploitative players the result of a trial between the two will always be the same (either all wins, losses, or ties), so the number is not as impresseive as it appears at first glance.

Playing as a human against the exploitative player was also interesting.  While not very strong in terms of game play (even Chris was able to beat it most of the time), it actually does make some intelligent initial moves.  For example it will almost always choose to make the first move the one which ends in its own bucket, thus scoring an additional turn immediately.  The discussion in Section 3 gives the explanation for the good initial performance but poor later-game performance.

ANN-based learner analysis:
Our ANN-based learner didn't fare as well as our table-based Q learner; in fact, it seems to get worse as it learns.  This is largely due to the fact that our table based learner could take advantage of the simple strategy in unknown states, but the ANN based learner would ALWAYS take a shot at approximating the Q function once it has seen even one example.  Since our network topology selection was effectively arbitrary it could either be unable to adequately approximate the optimal (or even a good) move selection function, or it could converge very slowly because we have too many hidden nodes.  We plan to experiment with the hidden layer size in the future.

Section 5: Conclusions, Discussion, and Future Work

We had planned to analyze an adaptive player which would use the estimated Q function to estimate its opponent's ability and try to match it, however the Q estimate was never good enough for even the exploitative player to show a significant improvement over the random/simple player.  We still plan to explore this player type after we have obtained a better Q function approximation.  The major problem with both of our players seems to be the lack of training compared to the overall size of the state space.  One thought we had was that, since the table-based Q algorithm sees the first 4-5 plays very often compared to the rest of the game tree, we could build an agent which uses the table for the first few moves, but when it comes upon a state it hasn't seen it would use the ANN.  The ANN will give a Q value for every state regardless of whether it has been seen or not.  This approach would only require us to keep a small subset of the table we built in this project for the table-based Q learner and thus it would be able to store the table in memory increasing the speed of play and making the overall agent more portable.

We would like to alter the examples sent to the neural network to be a binary representation.  Currently, the examples are sent to the network as the fraction of  the total number of marbles for the state portion and the bucket number selected as the next action divided by the total number of actions.  This strategy conserves the relation between the magnitude of each of these values, but particularly for the actions, it seems to assume too much correlation.  After translating the actions into binary (most likely as one bit for each action), the estimated q values would be less effected by unrelated updates.

We would also like to explore ways to have our ANN agent learn the network structure as well as the weights.  Because we picked our network structure without any formal analysis it could be that the move selection function simply cannot be adequately approximated by our current network topology.  We plan to train the network on a much larger number of trials and see how that affects its fitness.


Files

References

1   http://ndnd.essortment.com/mancalarulespl_rcrd.htm.  Dottie Cohen.
2   http://www.tradgames.org.uk/games/Mancala.htm.  James Masters, 1997-2002.
3   Artificial Intelligence: A Modern Approach.  Russell, Stuart J. and Norvig, Peter.  Pearson Education, Inc., 2003.