Configuration Files =================== Two files specify a TXPipe pipeline: *pipeline file* and a *configuration file*. Examples of both can be found in the `examples folder `_. You can run a pipeline with the ``ceci`` command: .. code-block:: bash ceci examples/metacal/pipeline.yml Pipeline files -------------- The pipeline file defines what pipeline stages are to be run, how they should be run, and where overall inputs and outputs should be. It has these options: Stages list ^^^^^^^^^^^ The ``stages`` list describes which stages to run, e.g.: .. code-block:: yaml stages: - name: TXSourceSelector ... - name: TXTwoPoint nodes: 1 nprocess: 4 threads_per_process: 2 ... Each line indicates a pipeline class to be run. We can optionally specify parallelization information if the stage can make use of multiple processors. The available options are: #. ``threads_per_process: 2`` The number of OpenMP threads to use. #. ``nprocess`` The number of MPI (or Dask) processes to use. #. ``nodes`` If running on a cluter or supercomputer, the number of nodes to use. Path information ^^^^^^^^^^^^^^^^ The ``config`` parameter points to the configuration file to be used (see below). The ``inputs`` dictionary supplies paths for files that are not generated by the pipeline itself, but are instead overall inputs to it. It maps the tag used in class inputs to paths. The ``log_dir`` path is a directory where logs for individual stages are saved. The ``pipeline_log`` path is a file for the top-level pipeline output. Python information ^^^^^^^^^^^^^^^^^^ The ``modules`` string should contain a space-separated list of packages that should be imported to search for pipeline stages. Here, that usually just includes TXPipe: .. code-block:: yaml modules: txpipe The ``python_paths`` list can contain a set of paths to add to the ``PYTHONPATH`` variable so that stages can import modules from them. In TXPipe these are mostly included in the ``submodules`` directory, for example: .. code-block:: yaml python_paths: - submodules/WLMassMap/python/desc/ - submodules/TJPCov Restarting pipelines ^^^^^^^^^^^^^^^^^^^^ The ``resume`` parameter, can be used to resume partially completed pipelines which crashed or stopped part way through. It can be set to ``true`` or ``false``. If ``resume`` is set to ``true`` then the system will skip any stages whose outputs already exist. If it is set to ``false`` then it will always start the pipeline from the beginning. Site information ^^^^^^^^^^^^^^^^^^^^ Two dictionaries of parameters control the launch of the pipeline itself. One, ``site``, describes the machine on which the pipeline is to be run. The ``site`` parameter requires an option ``name``. This can currently be set to one of: #. ``local`` for personal and other machines #. ``nersc`` for NERSC machines in batch mode (i.e. when submitting with sbatch) #. ``nersc-interactive`` for NERSC machines in interactive mode All the sites can also accept the options ``image`` and ``volume``. Setting ``image`` makes ceci run TXPipe using docker (or shifter at NERSC), which is a bit like a virtual machine, and lets us share environments easily. See https://www.docker.com/ for installation instructions. In particular, the image ``joezuntz/txpipe`` contains a full TXPipe environment. The ``volume`` option sets directories to be available under docker, in the form ``/path/on/your/machine:/path/on/virtual/machine``. The ``nersc`` site can also accept these options, with the defaults shown: .. code-block:: yaml mpi_command: srun -un queue: debug max_jobs: 2 account: m1727 walltime: 00:30:00 The ``local`` and ``nersc-interactive`` sites can accept the option ``max_threads`` to limit the number of cores used. Launcher information ^^^^^^^^^^^^^^^^^^^^^ The second pipeline dictionary, ``launcher``, controls what workflow engine is used to run it. It too should have a ``name`` option, set to one of these: #. ``mini`` for the simple built-in pipeline runner (best in most current cases) #. ``parsl`` for the `Parsl `_ workflow system #. ``cwl`` for the `Common Workflow Language` `_ launcher (not currently very useful). The ``mini`` pipeline launcher takes the option ``interval``, which is the frequency at which it checks for completed tasks. Unless you have very fast jobs, stick to the default 3 seconds. The ``cwl`` pipeline takes these options (with the defaults as shown): .. code-block:: yaml # The folder in which to put CLWL files: dir: ./cwl # The command used to launch pipeline. If this is left as the default some # additional flags are added: launch: cwltool The ``parsl`` pipeline currently does not take any other options, though this may change in future. Config files ------------ The *configuration file* specifies options and parameters for each stage. Options must be defined in the ``config_option`` attribute of pipeline classes. Options in the ``global`` section are shared across all stages, but can be overridden by individual stages: .. code-block:: yaml global: chunk_rows: 100000 pixelization: healpix nside: 512 Other sections are specific to a single pipeline stage, for example: .. code-block:: yaml TXTwoPoint: binslop: 0.1 delta_gamma: 0.02 do_pos_pos: True