Damian's notes – A race condition in Noiz.

Damian Kula

A race condition in Noiz.

Posted on 2023.11.01

I was recently alerted to a problem with Noiz that only occurred when running calculations in parallel mode. The problem was that a lot of files were missing when running calculations in parallel. This was particularly noticeable with some fast calculations such as PSD calculations.

To give you a little background, noiz has two calculation modes: sequential and parallel. The parallelism is handled by Dask, as I wanted to make noiz Kubernetes friendly. Unfortunately, I never had enough time to make sure that it could really run in the cloud. I also had no experience with distributed computing. However, I tried to design noiz in such a way that it would be relatively easy to make it Kubernetes compatible.

One such k8s-hopeful approach was to centralise a generation of new file paths where processed data would be written. After generating a file path for a given piece of data, a directory tree for that path needs to be materialised. That's a lot of words to describe what this function does.

def parent_directory_exists_or_create(filepath: Path) -> bool:
    """
    Checks if directory of a filepath exists. If doesn't, it creates it.
    Returns bool that indicates if the directory exists in the end. Should be always True.

    :param filepath: Path to the file you want to save and check it the parent directory exists.
    :type filepath: Path
    :return: If the directory exists in the end.
    :rtype: bool
    """
    directory = filepath.parent
    logger.debug(f"Checking if directory {directory} exists")
    if not directory.exists():
        logger.debug(f"Directory {directory} does not exists, trying to create.")
        directory.mkdir(parents=True)
    return directory.exists()

Now a tricky question. What's wrong with this function? Where is the race condition? What can go wrong with such a simple function? The answer to the last question is: quite a lot.

The fixed function looks like this:

def parent_directory_exists_or_create(filepath: Path) -> bool:
    """
    Checks if directory of a filepath exists. If doesn't, it creates it.
    Returns bool that indicates if the directory exists in the end. Should be always True.

    :param filepath: Path to the file you want to save and check it the parent directory exists.
    :type filepath: Path
    :return: If the directory exists in the end.
    :rtype: bool
    """
    directory = filepath.parent
    logger.debug(f"Checking if directory {directory} exists")
    if not directory.exists():
        logger.debug(f"Directory {directory} does not exists, trying to create.")
        directory.mkdir(parents=True, exist_ok=True)
    return directory.exists()

Did you notice the difference? Yes, it's exist_ok=True in a call to create a directory. Why is it a race condition, even though a has checked that this particular directory doesn't exist? Because we are in a parallel computing world and things are being computed in several places at the same time. More precisely, while process A is checking the existence of the path, process B manages to create this directory. Without explicitly taking the existence of the directory into account when trying to create the directory branch, this kind of error will often occur. In the case of noiz, this problem is particularly pronounced because our directory structure groups multiple files within each directory, for example based on the date the data was created.

In general, this function can now be simplified even further, as it doesn't actually need to check for existence before attempting to create. It could be written as follows:

def parent_directory_exists_or_create(filepath: Path) -> bool:
    """
    Checks if directory of a filepath exists. If doesn't, it creates it.
    Returns bool that indicates if the directory exists in the end. Should be always True.

    :param filepath: Path to the file you want to save and check it the parent directory exists.
    :type filepath: Path
    :return: If the directory exists in the end.
    :rtype: bool
    """
    directory = filepath.parent
    logger.debug(f"Trying to create {directory} directory branch.")
    directory.mkdir(parents=True, exist_ok=True)
    return directory.exists()

What's the conclusion of this post? Every function, if it has side effects, can lead to race conditions when working in a distributed way.