Configuration Files

Two files specify a TXPipe pipeline: pipeline file and a configuration file.

Examples of both can be found in the examples folder. You can run a pipeline with the ceci command:

ceci examples/metacal/pipeline.yml

Pipeline files

The pipeline file defines what pipeline stages are to be run, how they should be run, and where overall inputs and outputs should be. It has these options:

Stages list

The stages list describes which stages to run, e.g.:

stages:
- name: TXSourceSelector
...
- name: TXTwoPoint
  nodes: 1
  nprocess: 4
  threads_per_process: 2
...

Each line indicates a pipeline class to be run.

We can optionally specify parallelization information if the stage can make use of multiple processors. The available options are:

  1. threads_per_process: 2 The number of OpenMP threads to use.

  2. nprocess The number of MPI (or Dask) processes to use.

  3. nodes If running on a cluter or supercomputer, the number of nodes to use.

Path information

The config parameter points to the configuration file to be used (see below).

The inputs dictionary supplies paths for files that are not generated by the pipeline itself, but are instead overall inputs to it. It maps the tag used in class inputs to paths.

The log_dir path is a directory where logs for individual stages are saved.

The pipeline_log path is a file for the top-level pipeline output.

Python information

The modules string should contain a space-separated list of packages that should be imported to search for pipeline stages. Here, that usually just includes TXPipe:

modules: txpipe

The python_paths list can contain a set of paths to add to the PYTHONPATH variable so that stages can import modules from them. In TXPipe these are mostly included in the submodules directory, for example:

python_paths:
    - submodules/WLMassMap/python/desc/
    - submodules/TJPCov

Restarting pipelines

The resume parameter, can be used to resume partially completed pipelines which crashed or stopped part way through. It can be set to true or false.

If resume is set to true then the system will skip any stages whose outputs already exist. If it is set to false then it will always start the pipeline from the beginning.

Site information

Two dictionaries of parameters control the launch of the pipeline itself. One, site, describes the machine on which the pipeline is to be run.

The site parameter requires an option name. This can currently be set to one of:

  1. local for personal and other machines

  2. cori for NERSC’s cori machine in batch mode (i.e. when submitting with sbatch)

  3. cori-interactive for NERSC’s cori machine in interactive mode

All the sites can also accept the options image and volume. Setting image makes ceci run TXPipe using docker (or shifter at NERSC), which is a bit like a virtual machine, and lets us share environments easily. See https://www.docker.com/ for installation instructions. In particular, the image joezuntz/txpipe contains a full TXPipe environment. The volume option sets directories to be available under docker, in the form /path/on/your/machine:/path/on/virtual/machine.

The cori site can also accept these options, with the defaults shown:

mpi_command: srun -un
cpu_type: haswell
queue: debug
max_jobs: 2
account: m1727
walltime: 00:30:00

The local and cori-interactive sites can accept the option max_threads to limit the number of cores used.

Launcher information

The second pipeline dictionary, launcher, controls what workflow engine is used to run it. It too should have a name option, set to one of these:

  1. mini for the simple built-in pipeline runner (best in most current cases)

  2. parsl for the Parsl workflow system

  3. cwl for the Common Workflow Language <https://www.commonwl.org/>`_ launcher (not currently very useful).

The mini pipeline launcher takes the option interval, which is the frequency at which it checks for completed tasks. Unless you have very fast jobs, stick to the default 3 seconds.

The cwl pipeline takes these options (with the defaults as shown):

# The folder in which to put CLWL files:
dir: ./cwl

# The command used to launch pipeline. If this is left as the default some
# additional flags are added:
launch: cwltool

The parsl pipeline currently does not take any other options, though this may change in future.

Config files

The configuration file specifies options and parameters for each stage. Options must be defined in the config_option attribute of pipeline classes.

Options in the global section are shared across all stages, but can be overridden by individual stages:

global:
  chunk_rows: 100000
  pixelization: healpix
  nside: 512

Other sections are specific to a single pipeline stage, for example:

TXTwoPoint:
  binslop: 0.1
  delta_gamma: 0.02
  do_pos_pos: True