Lila Scavenger Hunt
Let's study software architecture by learning about the how the open-source Lichess website is made, which hosts over 100,000,000 rated chess games every month.
To win the scavenger hunt, complete the missing items. Please use any code
search tools, language servers, or documentation you'd like for this
scavenger hunt. Though I originally answered all of these questions just
using GitHub search (the /
and t
keys) and online editor (the .
key).
Venture out and explore! We do scavenger hunts not intrinsically to find the answers, but because we're curious to see, experience, and learn about the scenery along the way.
Made by Eric for a meeting of the Harvard Systems Reading Group.
Getting situated
Begin at the “Starting from scratch” blog post and source code. The biggest challenge in Lichess's backend is that it requires serving a lot of real-time features that scale to millions of users. It's primarily written in the programming language and uses , a library for asynchronous streams and actors.
Notice that the entire core Lila application, which handles all chess games from every single user on lichess.org, is hosted on just a single machine, named . This machine has 96 logical cores and 192 GiB of RAM. By remaining a monolith and only using a single big machine, Lichess is able to reduce hosting and infrastructure management costs greatly, while simplifying development. The downside is that it's limited to vertical scaling, but computers are fast! Lila is written in a very performant way, to handle hundreds of thousands of concurrent active users just fine.
UI rendering in Lila is done in a hybrid manner, some on the server and some
in the browser. On the server side, you can check the app/ui
folder to see that it uses the library for static templating
of HTML fragments. On the frontend side, helps with dynamic
rendering from TypeScript. The site is styled using , which
compiles to CSS.
Lichess has a library called that renders all chessboards
on its website, in 2D and 3D styles, on both web and mobile. If you inspect one
of the boards on the website, you'll see that it's wrapped in a custom
<cg-board>
element with <piece>
elements inside it.
Internationalization (i18n) is the process of offering software in multiple languages. Within the lila repository, the folder has variants in every language for all written text on the website.
Now let's get curious and examine a feature. On the lobby page of
lichess.org, you notice that the bottom-right has a couple pieces of text
that say “X players / Y games in play,” and you're curious what file this
belongs to and how it works. After some searching and inspection, you find
that this UI element, as well as the entire right column with dark mode
toggle and three “Create a game” buttons is defined in the file (full path).
Real-time web communication
Let's dive into the "X players" indicator. It's updated in real time on the
frontend, meaning that there should be a WebSocket connection sending
messages. Use your Chrome DevTools Network tab to look for open WebSockets
on lichess.org. There should be a bunch of real-time updates streaming in,
distinguished by "t"
field. The messages with the numbers of
players and games in play have this field set to
"t":
.
These messages are handled on the client side by several TypeScript files,
but the one relevant for this particular statistics message is
ui/lobby/src/boot.ts
. This file contains the class, which Lichess uses to manage a socket connection and register event
handlers.
On the server-side, WebSocket connections are handled by a separate Scala service in a different repository called .
Let's move to the WebSocket service repository for a while. The
src/main/scala/ipc/
folder contains shared logic for sending and receiving messages. It connects
inputs and outputs from the browser WebSocket, ClientIn/ClientOut, with data
updates from the server, LilaIn/LilaOut. The class is
responsible for the "number of users and games" message we saw earlier. This
message is sent from the singleton, which responds
to client pings.
We already established that ClientIn/ClientOut messages are sent over WebSocket. LilaIn/LilaOut messages are actually transported over an internal database. They get sent over streams in real time.
We can examine some examples of how this communication works. Look at the
file src/main/scala/Tv.scala
. This defines handlers for Lichess
TV. A connection to Lichess TV returns all recent “bullet”-speed games with
activity in the past , while other games are stored
for .
Now let's try to look at how in-game chat is implemented. In ClientOut, there is a type of message called “talk” that is used by the browser to send a chat. This maps to the case class of ClientOut, as an object in Scala. When it is forward to the Lila side, this case class maps to the command.
Databases
Go back to the main Lila repository. Lila uses a specific MongoDB driver called that intends to avoid blocking operations, running fully asynchronously to handle large numbers of clients.
On top of this driver, Lila defines its own domain-specific language (DSL)
for making queries to its database from Scala. This is defined in the lila.db.dsl
package. Search the codebase for imports of this package; notice that it's used
in many files to make complex queries to the database. When stored in MongoDB,
the field in the Game collection contains all user IDs that are
playing in a given game.
Another place the database is used is in choosing the daily chess puzzle. The query for this automatically filters for puzzles that have been played at least some number of times. A puzzle must have at least plays to be selected as the daily puzzle.
Gameplay logic
Let's move to the scalachess repository now, which implements the rules of
chess. Looking around, there is a chess opening database hardcoded into this
package, in parts. We can also see in the
file that there are 10 chess variants
supported by Lichess, each with their own custom rule definitions.
Now let's look at the compression repository. It uses a coding algorithm to efficiently encode indices in the legal move list. This reduces the amount of space required to store and re-simulate chess games, as they can be represented very compactly.
For many chess algorithms it's useful to get an even more compact representation of the board state and history, which can then be used as the key to a hash table or otherwise. For example, this is common in other games like Go as well. Lichess uses the hash algorithm.
Rating system
Curious about how to get more Internet points, you snoop around and find out that Lichess uses the rating system. The default deviation parameter is set to , and games are treated as having 1 points for wins, 0.5 points for draws, and 0 points for losses.
That module defined the mathematical formula for rating calculations, but you want to understand where this fits into the system as a whole. You find the class in the same rating module, which specifies how ratings are read and stored in the MongoDB database. It also has support for rating refunds!
How are players matched with each other using this rating field? It turns out that the matchmaking function is actually another simple mathematical formula. It's specified in the function, which takes into account rating ranges, how close their ratings are, and how many “waves” of matchmaking a player has missed. The maximum score difference to pair two users is depends on rating, but for ratings above , it's set to the numerical rating value divided by
Waves of matchmaking are specified for every pool of games, which are split by clock time control. A wave happens either after some number of sections or a threshold number of players is waiting for a game, whichever comes first. The pools are defined in the object, while the class is constructed once for each pool and is responsible for actually scheduling waves.
Analytical services
Lichess offers free access to the Stockfish chess engine to analyze games. This is deployed on their distributed “AI cluster” on donated hardware called since Stockfish is compute-intensive. This cluster is separated from Lila. Actually, anyone can run their own client and help analyze Lichess games if they have spare CPU cycles. The client is written in Rust.
If cloud analysis is not available or not yet generated, Lichess will run Stockfish on the user's browser through WebAssembly on up to threads, with shared memory. The package does this. JavaScript interacts with the Stockfish C code by calling the function, which acts as an entrypoint, handling commands in the Universal Chess Interface.
Lichess also offers a free API for developers to access game data. The API is implemented in the module. It's a REST API that uses JSON for data serialization. The API is documented on the Lichess website.
Other services, monitoring, and metrics
Although the core application logic is in Lila, which acts as a monolith, there are many other services that handle smaller tasks, mainly for scalability or isolation reasons. For example, is written in Rust and handles trillions of unique positions for all variants of chess. It compacts opening data in , an embedded database tuned for performance.
Another side service is , which grew out of need
because some Lichess tournaments have tens of thousands of connected clients
at once, and Lila was overloaded with HTTP traffic. It's also written in
Rust, using the axum
library, and the only route it handles is
.
Search happens in the package, which uses the open-source search engine to index games, forums, teams, and studies.
All cloud monitoring happens on the machine, which runs Prometheus, InfluxDB, and Grafana.
Backpressure and fault tolerance
We saw a lot of infrastructure here in this scavenger hunt, but that's just a very high-level picture of how the application works. There's a lot more to dig into. Here's some food for thought.
Based on what we've seen so far, imagine a scenario where Lichess's load quickly increases to an unusually high level. What system would break first? Would it crash, or would it just temporarily stop responding to requests? How long would it take to restart if it did crash? Would your answer change if one of the machines lost power suddenly, or there was somehow a segmentation fault in Lila?
If one of the services is overloaded, would it return errors to requests, or would it just let them time out? Also, would this cause a chain reaction of upstream services? Pick a service, where in the codebase would you look to predict the effects of that going down?
If you were running Lichess and the website stopped working, where would you look first to diagnose the problem, and how could you make it easier to resolve the problem quickly?
to reset the scavenger hunt.