Configuration Files
Two files specify a TXPipe pipeline: pipeline file and a configuration file.
Examples of both can be found in the examples folder. You can run a pipeline with the ceci
command:
ceci examples/metacal/pipeline.yml
Pipeline files
The pipeline file defines what pipeline stages are to be run, how they should be run, and where overall inputs and outputs should be. It has these options:
Stages list
The stages
list describes which stages to run, e.g.:
stages:
- name: TXSourceSelector
...
- name: TXTwoPoint
nodes: 1
nprocess: 4
threads_per_process: 2
...
Each line indicates a pipeline class to be run.
We can optionally specify parallelization information if the stage can make use of multiple processors. The available options are:
threads_per_process: 2
The number of OpenMP threads to use.nprocess
The number of MPI (or Dask) processes to use.nodes
If running on a cluter or supercomputer, the number of nodes to use.
Path information
The config
parameter points to the configuration file to be used (see below).
The inputs
dictionary supplies paths for files that are not generated by the pipeline itself, but are instead overall inputs to it. It maps the tag used in class inputs to paths.
The log_dir
path is a directory where logs for individual stages are saved.
The pipeline_log
path is a file for the top-level pipeline output.
Python information
The modules
string should contain a space-separated list of packages that should be imported to search for pipeline stages. Here, that usually just includes TXPipe:
modules: txpipe
The python_paths
list can contain a set of paths to add to the PYTHONPATH
variable so that stages can import modules from them. In TXPipe these are mostly included in the submodules
directory, for example:
python_paths:
- submodules/WLMassMap/python/desc/
- submodules/TJPCov
Restarting pipelines
The resume
parameter, can be used to resume partially completed pipelines which crashed or stopped part way through. It can be set to true
or false
.
If resume
is set to true
then the system will skip any stages whose outputs already exist. If it is set to false
then it will always start the pipeline from the beginning.
Site information
Two dictionaries of parameters control the launch of the pipeline itself. One, site
, describes the machine on which the pipeline is to be run.
The site
parameter requires an option name
. This can currently be set to one of:
local
for personal and other machinescori
for NERSC’s cori machine in batch mode (i.e. when submitting with sbatch)cori-interactive
for NERSC’s cori machine in interactive mode
All the sites can also accept the options image
and volume
. Setting image
makes ceci run TXPipe using docker (or shifter at NERSC), which is a bit like a virtual machine, and lets us share environments easily. See https://www.docker.com/ for installation instructions. In particular, the image joezuntz/txpipe
contains a full TXPipe environment. The volume
option sets directories to be available under docker, in the form /path/on/your/machine:/path/on/virtual/machine
.
The cori
site can also accept these options, with the defaults shown:
mpi_command: srun -un
cpu_type: haswell
queue: debug
max_jobs: 2
account: m1727
walltime: 00:30:00
The local
and cori-interactive
sites can accept the option max_threads
to limit the number of cores used.
Launcher information
The second pipeline dictionary, launcher
, controls what workflow engine is used to run it. It too should have a name
option, set to one of these:
mini
for the simple built-in pipeline runner (best in most current cases)parsl
for the Parsl workflow systemcwl
for the Common Workflow Language <https://www.commonwl.org/>`_ launcher (not currently very useful).
The mini
pipeline launcher takes the option interval
, which is the frequency at which it checks for completed tasks. Unless you have very fast jobs, stick to the default 3 seconds.
The cwl
pipeline takes these options (with the defaults as shown):
# The folder in which to put CLWL files:
dir: ./cwl
# The command used to launch pipeline. If this is left as the default some
# additional flags are added:
launch: cwltool
The parsl
pipeline currently does not take any other options, though this may change in future.
Config files
The configuration file specifies options and parameters for each stage. Options must be defined in the config_option
attribute of pipeline classes.
Options in the global
section are shared across all stages, but can be overridden by individual stages:
global:
chunk_rows: 100000
pixelization: healpix
nside: 512
Other sections are specific to a single pipeline stage, for example:
TXTwoPoint:
binslop: 0.1
delta_gamma: 0.02
do_pos_pos: True