Parallel Checkout Design Notes

The "Parallel Checkout" feature attempts to use multiple processes to parallelize the work of uncompressing the blobs, applying in-core filters, and writing the resulting contents to the working tree during a checkout operation. It can be used by all checkout-related commands, such as clone, checkout, reset, sparse-checkout, and others.

These commands share the following basic structure:

Step 1: Read the current index file into memory.
Step 2: Modify the in-memory index based upon the command, and temporarily mark all cache entries that need to be updated.
Step 3: Populate the working tree to match the new candidate index. This includes iterating over all of the to-be-updated cache entries and delete, create, or overwrite the associated files in the working tree.
Step 4: Write the new index to disk.

Step 3 is the focus of the "parallel checkout" effort described here.

Sequential Implementation

For the purposes of discussion here, the current sequential implementation of Step 3 is divided in 3 parts, each one implemented in its own function:

Step 3a: unpack-trees.c:check_updates() contains a series of sequential loops iterating over the cache_entry's array. The main loop in this function calls the Step 3b function for each of the to-be-updated entries.
Step 3b: entry.c:checkout_entry() examines the existing working tree for file conflicts, collisions, and unsaved changes. It removes files and creates leading directories as necessary. It calls the Step 3c function for each entry to be written.
Step 3c: entry.c:write_entry() loads the blob into memory, smudges it if necessary, creates the file in the working tree, writes the smudged contents, calls fstat() or lstat(), and updates the associated cache_entry struct with the stat information gathered.

It wouldn’t be safe to perform Step 3b in parallel, as there could be race conditions between file creations and removals. Instead, the parallel checkout framework lets the sequential code handle Step 3b, and uses parallel workers to replace the sequential entry.c:write_entry() calls from Step 3c.

Rejected Multi-Threaded Solution

The most "straightforward" implementation would be to spread the set of to-be-updated cache entries across multiple threads. But due to the thread-unsafe functions in the object database code, we would have to use locks to coordinate the parallel operation. An early prototype of this solution showed that the multi-threaded checkout would bring performance improvements over the sequential code, but there was still too much lock contention. A perf profiling indicated that around 20% of the runtime during a local Linux clone (on an SSD) was spent in locking functions. For this reason this approach was rejected in favor of using multiple child processes, which led to better performance.

Multi-Process Solution

Parallel checkout alters the aforementioned Step 3 to use multiple checkout--worker background processes to distribute the work. The long-running worker processes are controlled by the foreground Git command using the existing run-command API.

Overview

Step 3b is only slightly altered; for each entry to be checked out, the main process performs the following steps:

M1: Check whether there is any untracked or unclean file in the working tree which would be overwritten by this entry, and decide whether to proceed (removing the file(s)) or not.
M2: Create the leading directories.
M3: Load the conversion attributes for the entry’s path.
M4: Check, based on the entry’s type and conversion attributes, whether the entry is eligible for parallel checkout (more on this later). If it is eligible, enqueue the entry and the loaded attributes to later write the entry in parallel. If not, write the entry right away, using the default sequential code.

Note: we save the conversion attributes associated with each entry because the workers don’t have access to the main process' index state, so they can’t load the attributes by themselves (and the attributes are needed to properly smudge the entry). Additionally, this has a positive impact on performance as (1) we don’t need to load the attributes twice and (2) the attributes machinery is optimized to handle paths in sequential order.

After all entries have passed through the above steps, the main process checks if the number of enqueued entries is sufficient to spread among the workers. If not, it just writes them sequentially. Otherwise, it spawns the workers and distributes the queued entries uniformly in continuous chunks. This aims to minimize the chances of two workers writing to the same directory simultaneously, which could increase lock contention in the kernel.

Then, for each assigned item, each worker:

W1: Checks if there is any non-directory file in the leading part of the entry’s path or if there already exists a file at the entry' path. If so, mark the entry with PC_ITEM_COLLIDED and skip it (more on this later).
W2: Creates the file (with O_CREAT and O_EXCL).
W3: Loads the blob into memory (inflating and delta reconstructing it).
W4: Applies any required in-process filter, like end-of-line conversion and re-encoding.
W5: Writes the result to the file descriptor opened at W2.
W6: Calls fstat() or lstat() on the just-written path, and sends the result back to the main process, together with the end status of the operation and the item’s identification number.

Note that, when possible, steps W3 to W5 are delegated to the streaming machinery, removing the need to keep the entire blob in memory.

If the worker fails to read the blob or to write it to the working tree, it removes the created file to avoid leaving empty files behind. This is the only time a worker is allowed to remove a file.

As mentioned earlier, it is the responsibility of the main process to remove any file that blocks the checkout operation (or abort if the removal(s) would cause data loss and the user didn’t ask to --force). This is crucial to avoid race conditions and also to properly detect path collisions at Step W1.

After the workers finish writing the items and sending back the required information, the main process handles the results in two steps:

First, it updates the in-memory index with the lstat() information sent by the workers. (This must be done first as this information might be required in the following step.)
Then it writes the items which collided on disk (i.e. items marked with PC_ITEM_COLLIDED). More on this below.

Path Collisions

Path collisions happen when two different paths correspond to the same entry in the file system. E.g. the paths a and A would collide in a case-insensitive file system.

The sequential checkout deals with collisions in the same way that it deals with files that were already present in the working tree before checkout. Basically, it checks if the path that it wants to write already exists on disk, makes sure the existing file doesn’t have unsaved data, and then overwrites it. (To be more pedantic: it deletes the existing file and creates the new one.) So, if there are multiple colliding files to be checked out, the sequential code will write each one of them but only the last will actually survive on disk.

Parallel checkout aims to reproduce the same behavior. However, we cannot let the workers racily write to the same file on disk. Instead, the workers detect when the entry that they want to check out would collide with an existing file, and mark it with PC_ITEM_COLLIDED. Later, the main process can sequentially feed these entries back to checkout_entry() without the risk of race conditions. On clone, this also has the effect of marking the colliding entries to later emit a warning for the user, like the classic sequential checkout does.

The workers are able to detect both collisions among the entries being concurrently written and collisions between a parallel-eligible entry and an ineligible entry. The general idea for collision detection is quite straightforward: for each parallel-eligible entry, the main process must remove all files that prevent this entry from being written (before enqueueing it). This includes any non-directory file in the leading path of the entry. Later, when a worker gets assigned the entry, it looks again for the non-directory files and for an already existing file at the entry’s path. If any of these checks finds something, the worker knows that there was a path collision.

Because parallel checkout can distinguish path collisions from the case where the file was already present in the working tree before checkout, we could alternatively choose to skip the checkout of colliding entries. However, each entry that doesn’t get written would have NULL lstat() fields on the index. This could cause performance penalties for subsequent commands that need to refresh the index, as they would have to go to the file system to see if the entry is dirty. Thus, if we have N entries in a colliding group and we decide to write and lstat() only one of them, every subsequent git-status will have to read, convert, and hash the written file N - 1 times. By checking out all colliding entries (like the sequential code does), we only pay the overhead once, during checkout.

Eligible Entries for Parallel Checkout

As previously mentioned, not all entries passed to checkout_entry() will be considered eligible for parallel checkout. More specifically, we exclude:

Symbolic links; to avoid race conditions that, in combination with path collisions, could cause workers to write files at the wrong place. For example, if we were to concurrently check out a symlink a → b and a regular file A/f in a case-insensitive file system, we could potentially end up writing the file A/f at a/f, due to a race condition.
Regular files that require external filters (either "one shot" filters or long-running process filters). These filters are black-boxes to Git and may have their own internal locking or non-concurrent assumptions. So it might not be safe to run multiple instances in parallel.

Besides, long-running filters may use the delayed checkout feature to postpone the return of some filtered blobs. The delayed checkout queue and the parallel checkout queue are not compatible and should remain separate.

Note: regular files that only require internal filters, like end-of-line conversion and re-encoding, are eligible for parallel checkout.

Ineligible entries are checked out by the classic sequential codepath before spawning workers.

Note: submodules' files are also eligible for parallel checkout (as long as they don’t fall into any of the excluding categories mentioned above). But since each submodule is checked out in its own child process, we don’t mix the superproject’s and the submodules' files in the same parallel checkout process or queue.

The API

The parallel checkout API was designed with the goal of minimizing changes to the current users of the checkout machinery. This means that they don’t have to call a different function for sequential or parallel checkout. As already mentioned, checkout_entry() will automatically insert the given entry in the parallel checkout queue when this feature is enabled and the entry is eligible; otherwise, it will just write the entry right away, using the sequential code. In general, callers of the parallel checkout API should look similar to this:

int pc_workers, pc_threshold, err = 0;
struct checkout state;

get_parallel_checkout_configs(&pc_workers, &pc_threshold);

/*
 * This check is not strictly required, but it
 * should save some time in sequential mode.
 */
if (pc_workers > 1)
        init_parallel_checkout();

for (each cache_entry ce to-be-updated)
        err |= checkout_entry(ce, &state, NULL, NULL);

err |= run_parallel_checkout(&state, pc_workers, pc_threshold, NULL, NULL);