QuartoRL | Gerard Calvo Bartra

An AlphaZero-style reinforcement learning agent trained to play the Quarto tabletop game with superhuman accuracy.

PROJECT

QuartoRL

IMPACT AREAS

SERVICES

Python
PyTorch
NumPy

Challenge: Mastering a deceptively complex board game with reinforcement learning

As an avid board game player in a local gaming association, I usually invest time in understanding rules and developing winning strategies. However, one game consistently defeated me: Quarto. Despite its simple rules, the game's deep complexity made conventional strategic thinking ineffective, leading to repeated losses against a particular friend.

Driven by a mix of competitive spirit and curiosity, I decided to apply cutting-edge reinforcement learning techniques to crack this puzzle. The goal was to create an AI agent capable of discovering optimal Quarto strategies through self-play, similar to DeepMind's AlphaZero approach that conquered chess, shogi, and Go.

Quarto's unique mechanics presented several interesting challenges for reinforcement learning:

The game alternates between two distinct action types (choosing a piece for your opponent, then placing a piece you've been given)
Each piece has four binary attributes (tall/short, light/dark, square/circular, hollow/solid), creating complex pattern recognition requirements
The victory condition requires identifying when four pieces share any common attribute
The branching factor is significant, with 16 pieces and 16 board positions creating numerous possibilities

Game Mechanics: Understanding Quarto's distinctive gameplay

Quarto is played on a 4×4 board with 16 unique pieces, each possessing four binary attributes:

Height: tall or short
Color: light or dark-stained wood
Shape: square or circular
Top: hollow or solid

The gameplay follows an unusual pattern. On each turn, a player selects one of the unplayed pieces which their opponent must then place on the board. A player wins by placing a piece that forms a horizontal, vertical, or diagonal row of four pieces sharing at least one common attribute (e.g., all tall, all circular, etc.).

This creates an interesting strategic dynamic: players must simultaneously consider the piece they're handing to their opponent and the board position where they're placing the piece they've been given. The perfect information nature of the game (all possible moves and consequences are visible to both players) makes it an ideal candidate for reinforcement learning approaches.

Approach: Adapting AlphaZero for Quarto

I implemented an AlphaZero-style reinforcement learning approach, which combines Monte Carlo Tree Search (MCTS) with deep neural networks trained through self-play. This method has proven exceptionally effective for perfect information board games, as it balances exploration of the game tree with exploitation of learned patterns.

The core components of my implementation included:

A neural network that evaluated board states and predicted move probabilities
MCTS with decaying Dirichlet noise to encourage exploration
Self-play to generate training data without human examples
A replay buffer to store and sample from game experiences
A combined loss function addressing both policy accuracy and value prediction

One of the key design decisions was how to represent Quarto's unique alternating action types. I chose to encode all possible actions in a single space, with indices 0-15 representing 'choose a piece for your opponent' and indices 16-31 representing 'place the given piece on the board'. During gameplay, invalid actions were masked to ensure only legal moves could be selected.

Implementation: Technical details of the reinforcement learning framework

The implementation involved several technical challenges to effectively represent Quarto's state and action space for the neural network:

State Representation: I encoded the game state as an 8×4×4 float32 array with the following channels:

Channels 0-3: Binary piece attributes (height, color, shape, top) on the board
Channel 4: Available pieces (1 for available, 0 for used)
Channel 5: Currently selected piece (if applicable)
Channel 6: Game phase (0 for piece selection, 1 for piece placement)
Channel 7: Current player (0 or 1)

Action Space: All possible actions were represented in a single space with 32 potential actions:

Actions 0-15: Selecting one of the 16 possible pieces for the opponent
Actions 16-31: Placing the given piece on one of the 16 board positions

Neural Network: The policy/value network took the encoded state as input and output:

A probability distribution over all 32 possible actions (policy head)
A scalar value prediction estimating the win probability from the current position (value head)

Training Process: The agent improved through iterative self-play:

Generate games through self-play, using MCTS guided by the current neural network
Store game states, MCTS policy distributions, and game outcomes in a replay buffer
Sample batches from the replay buffer to train the neural network
Update the neural network using a loss function combining policy loss (comparing MCTS policy to predicted policy) and value loss (comparing actual game outcome to predicted value)

The implementation was trained on Google Colab with GPU acceleration, allowing for reasonably fast iteration despite the computational demands of MCTS-based reinforcement learning.

Results: Assessing the AI's performance

After several hours of training on Google Colab, I evaluated the agent's performance through direct gameplay. The results were mixed:

The agent successfully learned the basic rules and avoided obvious blunders
It understood legal moves and didn't hand over immediate winning opportunities
However, it lacked sophisticated strategy and could be consistently outplayed by a human opponent
When tested against my friend (the original motivation for the project), both the AI and I continued to lose

While the agent showed signs of improvement during training, its performance plateaued below the level needed to defeat skilled human players. This outcome highlighted the challenges of applying reinforcement learning to complex games with limited computational resources.

Observing the agent's play revealed that it had developed a basic understanding of the game mechanics but struggled with long-term planning and recognizing subtle patterns across the four different piece attributes.

Analysis: Investigating potential limitations

Following the somewhat disappointing performance, I analyzed several potential factors that might have limited the agent's capabilities:

Training duration: The few hours of GPU time may have been insufficient for mastering Quarto's complexity
Hyperparameters: The effectiveness of reinforcement learning is highly sensitive to parameters like replay buffer size, MCTS simulation count, exploration constants, and learning rates
Network architecture: The neural network architecture may not have been optimal for capturing the specific patterns relevant to Quarto
State/action representation: The chosen encoding might not have highlighted the critical features efficiently for the network
Reward structure: The simple win/loss signal might not provide enough granularity for effective learning, especially for longer games
Long-term planning: The agent might struggle with the credit assignment problem over many turns, particularly when considering that it must both choose pieces and place them

A particularly interesting question was whether the agent's performance was limited by its ability to look ahead. In Quarto, a winning strategy often involves thinking several moves ahead, considering both piece selection and placement. If the agent's MCTS depth or neural network capacity was insufficient, it might accurately predict losses 10+ moves in the future when playing against an optimal opponent, but be unable to find the complex path to avoid those losses.

Future Improvements: Paths to enhancing the Quarto AI

Based on the analysis, several avenues for improvement emerge for a potential next iteration of the project:

Extended training: Significantly increase training time, potentially using distributed computing for more MCTS simulations
Hyperparameter optimization: Systematically explore different hyperparameter configurations to find optimal settings
Alternative architecture: Explore different neural network architectures, potentially with attention mechanisms to better capture the relationship between piece attributes
Dual agent approach: Implement separate networks for the piece selection and placement phases, allowing specialization
Curriculum learning: Train the agent first on simplified versions of Quarto (e.g., with fewer attributes) before tackling the full game
Human demonstrations: Incorporate learning from human expert games to jumpstart the learning process

Despite not achieving its ultimate goal of defeating my friend, this project provided valuable insights into the challenges of applying reinforcement learning to complex board games with unique mechanics. The experience highlighted the gap between theoretical approaches and practical implementation, especially when working with limited computational resources.

For now, I'll have to continue improving my own Quarto skills the old-fashioned way – through practice and observation – while waiting for another opportunity to revisit this AI challenge with more resources.

Technologies

This project was built with:

↗ Python

↗ PyTorch

↗ NumPy

QuartoRL | AlphaZero-style Quarto AI player