Damian's notes – Using conda cache in GitlabCI

Damian Kula

Using conda cache in GitlabCI

Posted on 2020.05.07

How to speed up CI, especially unit testing that do not require elaborated build environment? Cache!

If you are working with conda, you might have had already dealt with cache, most probably cleaning it with conda clean. So how to use it inside of GitlabCI?

Let's find out where does the conda keep the cache. For that purpose I used a Docker image I later want to use in my CI.

$ conda info

      active environment : base
      active env location : /opt/conda
              shell level : 1
        user config file : /root/.condarc
  populated config files :
            conda version : 4.8.2
      conda-build version : not installed
          python version : 3.7.6.final.0
        virtual packages : __glibc=2.28
        base environment : /opt/conda  (writable)
            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                            https://repo.anaconda.com/pkgs/main/noarch
                            https://repo.anaconda.com/pkgs/r/linux-64
                            https://repo.anaconda.com/pkgs/r/noarch
            package cache : /opt/conda/pkgs
                            /root/.conda/pkgs
        envs directories : /opt/conda/envs
                            /root/.conda/envs
                platform : linux-64
              user-agent : conda/4.8.2 requests/2.22.0 CPython/3.7.6 Linux/4.15.0-96-generic debian/10 glibc/2.28
                  UID:GID : 0:0
              netrc file : None
            offline mode : False

In this case, conda keeps the cache in two directories. If you now run for example conda install numpy, it will download all the necessary tarballs to the /opt/conda/pkgs and unpack them afterwards to the same directory.

$ conda install numpy
  (...)
$ ls /opt/conda/pkgs
  total 142M
  drwxr-xr-x 15 root root  4.0K May  5 15:19 ./
  drwxr-xr-x  1 root root  4.0K May  5 15:19 ../
  drwxr-xr-x  3 root root  4.0K May  5 15:19 blas-1.0-mkl/
  -rw-r--r--  1 root root  5.9K May  5 15:19 blas-1.0-mkl.conda
  drwxrwsr-x  2 root root  4.0K May  5 15:18 cache/
  drwxr-xr-x  4 root root  4.0K May  5 15:19 certifi-2020.4.5.1-py37_0/
  -rw-r--r--  1 root root  156K May  5 15:19 certifi-2020.4.5.1-py37_0.conda
  drwxr-xr-x  8 root root  4.0K May  5 15:19 conda-4.8.3-py37_0/
  -rw-r--r--  1 root root  2.9M May  5 15:19 conda-4.8.3-py37_0.conda
  drwxr-xr-x  4 root root  4.0K May  5 15:19 intel-openmp-2020.0-166/
  -rw-r--r--  1 root root  757K May  5 15:19 intel-openmp-2020.0-166.conda
  drwxr-xr-x  6 root root  4.0K May  5 15:19 libgfortran-ng-7.3.0-hdf63c60_0/
  -rw-r--r--  1 root root 1007K May  5 15:19 libgfortran-ng-7.3.0-hdf63c60_0.conda
  drwxr-xr-x  4 root root  4.0K May  5 15:19 mkl-2020.0-166/
  -rw-r--r--  1 root root  129M May  5 15:19 mkl-2020.0-166.conda
  drwxr-xr-x  4 root root  4.0K May  5 15:19 mkl-service-2.3.0-py37he904b0f_0/
  -rw-r--r--  1 root root  219K May  5 15:19 mkl-service-2.3.0-py37he904b0f_0.conda
  drwxr-xr-x  4 root root  4.0K May  5 15:19 mkl_fft-1.0.15-py37ha843d7b_0/
  -rw-r--r--  1 root root  154K May  5 15:19 mkl_fft-1.0.15-py37ha843d7b_0.conda
  drwxr-xr-x  4 root root  4.0K May  5 15:19 mkl_random-1.1.0-py37hd6b4f25_0/
  -rw-r--r--  1 root root  322K May  5 15:19 mkl_random-1.1.0-py37hd6b4f25_0.conda
  drwxr-xr-x  3 root root  4.0K May  5 15:19 numpy-1.18.1-py37h4f9e942_0/
  -rw-r--r--  1 root root  5.3K May  5 15:19 numpy-1.18.1-py37h4f9e942_0.conda
  drwxr-xr-x  5 root root  4.0K May  5 15:19 numpy-base-1.18.1-py37hde5b4d6_1/
  -rw-r--r--  1 root root  4.2M May  5 15:19 numpy-base-1.18.1-py37hde5b4d6_1.conda
  drwxr-xr-x  7 root root  4.0K May  5 15:19 openssl-1.1.1g-h7b6447c_0/
  -rw-r--r--  1 root root  2.6M May  5 15:19 openssl-1.1.1g-h7b6447c_0.conda
  -rw-r--r--  1 root root     0 May  5 15:18 urls
  -rw-r--r--  1 root root   923 May  5 15:19 urls.txt

Consider then, a simple job in a GitlabCI running unit tests with pytest.

unit_tests:
  stage: unit-tests
  image:
    name: continuumio/miniconda3:latest
  script:
    - conda install  python=3.7 -c conda-forge --yes --file requirements.txt
    - python -m pip install -e .
    - pytest .

GitlabCi provides a simple way of adding local cache for your CI. You have bunch of possibilities of sharing this cache only between stages, branches or others. [1] My project is not a big one, so I decided to have a single shared cache for unit-test job.

unit_tests:
  stage: unit-tests
  image:
    name: continuumio/miniconda3:latest
  cache:
    key: unit-test-cache
    paths:
      - /opt/conda/pkgs
  script:
    - conda install  python=3.7 -c conda-forge --yes --file requirements.txt
    - python -m pip install -e .
    - pytest .

The problem is, it doesn't work. At this moment, GitlabCI does not support caching outside of a working directory. [2] But don't worry, there is workaround. Conda supports changing a directory where it stores cache. It is possible by either changing config or passing an environment variable CONDA_PKGS_DIRS. [3] The later is easier to employ, just set a job-wide environment variable.

unit_tests:
  stage: unit-tests
  image:
    name: continuumio/miniconda3:latest
  variables:
    CONDA_PKGS_DIRS: "$CI_PROJECT_DIR/.conda-pkgs-cache/"
  cache:
    key: unit-test-cache
    paths:
      - $CONDA_PKGS_DIRS
  script:
    - conda install  python=3.7 -c conda-forge --yes --file requirements.txt
    - python -m pip install -e .
    - pytest .

Now we face different problem. Since CI is copying content of the repository to the working directory and our cache is in the working directory too, this way of running pytest will also crawl cache for test. Well, it would end up badly so let's ignore this dir in pytest.

unit_tests:
  stage: unit-tests
  image:
    name: continuumio/miniconda3:latest
  variables:
    CONDA_PKGS_DIRS: "$CI_PROJECT_DIR/.conda-pkgs-cache/"
  cache:
    key: unit-test-cache
    paths:
      - $CONDA_PKGS_DIRS
  script:
    - conda install  python=3.7 -c conda-forge --yes --file requirements.txt
    - python -m pip install -e .
    - pytest --ignore=$CONDA_PKGS_DIRS .

Now your CI runner will not be downloading all the packages every time it runs.

But wait, why my log of the job says that there were 50k files cached? That's because conda's pkgs dir contains not only archives with packages but also extracted versions of those archives, for as I assume, optimizations in terms of inter-environmental disk space usage. I decided to limit my cache only to those archives and few additional files without which conda did not recognize those archives.

unit_tests:
  stage: unit-tests
  image:
    name: continuumio/miniconda3:latest
  variables:
    CONDA_PKGS_DIRS: "$CI_PROJECT_DIR/.conda-pkgs-cache/"
  cache:
    key: unit-test-cache
    paths:
      - $CONDA_PKGS_DIRS/*.conda
      - $CONDA_PKGS_DIRS/*.tar.bz2
      - $CONDA_PKGS_DIRS/urls*
      - $CONDA_PKGS_DIRS/cache

  script:
    - conda install  python=3.7 -c conda-forge --yes --file requirements.txt
    - python -m pip install -e .
    - pytest --ignore=$CONDA_PKGS_DIRS .

This, in my case limited number of cached files to ~100. If you want to install dependencies both through pip and conda, and use cache for both of them as well, there is no problem with that. Since pip currently has strictly defined place where is stores cache, you have to add this particular dir to cache. Fortunately, it's inside of the home dir.

unit_tests:
  stage: unit-tests
  image:
    name: continuumio/miniconda3:latest
  variables:
    CONDA_PKGS_DIRS: "$CI_PROJECT_DIR/.conda-pkgs-cache/"
    PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"
  cache:
    key: unit-test-cache
    paths:
      - $CONDA_PKGS_DIRS/*.conda
      - $CONDA_PKGS_DIRS/*.tar.bz2
      - $CONDA_PKGS_DIRS/urls*
      - $CONDA_PKGS_DIRS/cache
      - $PIP_CACHE_DIR
  script:
    - conda install  python=3.7 -c conda-forge --yes --file requirements-conda.txt
    - python -m pip install -r requirements-pip.txt
    - python -m pip install -e .
    - pytest --ignore=$CONDA_PKGS_DIRS .

Now, we are caching everything what is possible for that CI job. How much time did we save on not downloading all those packages every time we run CI? Well, in my case runtime of this job increased.

In scientific spirit, I did a little experiment for all caching cases I described before. I prepared a CI file running the same job 5 times, each in separate stage. Always the same local runner, only one job running at time. Jobs were ran with: #. no caching, #. caching whole conda pkgs dir, #. caching whole conda pkgs dir and whole pip cache dir, #. Caching only selected files in conda and whole pip cache Between each of the runs, the runner's cache were cleared. Due to that the first runs were longer since packages had to be downloaded first and then pushed to cache at the end. The results are as follows:

Sample No Cache Conda Conda + Pip Conda-tar + Pip
1 242s 330s 322s 280s
2 245s 294s 292s 270s
3 242s 294s 296s 263s
4 245s 297s 297s 259s
5 245s 298s 301s 263s
Mean 244s 303s 302s 267s

Why did it happen? I guess, everything is due to the cost of IO operations and cache verification both at pull and push stages. In the end I did not merge that MR. I though about possible energy cost of sending through the network those few megabytes every time but I think this traffic won't rather surpass

[1]https://docs.gitlab.com/ee/ci/caching/
[2]https://gitlab.com/gitlab-org/gitlab/-/issues/14151
[3]https://conda.io/projects/conda/en/latest/user-guide/configuration/use-condarc.html#specify-package-directories-pkgs-dirs