Components of a Pipeline

A standard Pipeline consists of a Data Asset followed by a series of Capsules that write results to a Results Bucket.

Similar to how the Code Ocean Environment Editor synchronously configures a Dockerfile, the Visual Pipeline Editor synchronously configures a Nextflow file.

Nextflow is a workflow manager that integrates with Docker and other tools to ensure reproducibility independent of the computing platform. Below is an example of part of a Nextflow file for a Pipeline consisting of two capsules named RSEM and MultiQC:

The Nextflow file displays the actions that are executed to run each Capsule. Data are transferred via created Nextflow channels. The git repository corresponding to each Capsule is cloned and the reproducible run script is executed. Each run script is run in its own job with AWS batch. The ability to run all data in one job and separate jobs in parallel can be controlled by Connection Types. The Nextflow file can be unlocked and customized manually. This will permanently disable the Pipeline Editor.

The running of Capsules via git cloning is important because it means that only files that are in the core file tree system in a Capsule and are tracked by git can be referenced and executed in a Pipeline. In addition, all Capsule changes must be committed to git and a Reproducible Run must have been performed before the Capsule can be used in a Pipeline.

Each Capsule is a standalone and fully reproducible process that reads data from the /data folder and writes results to the /results folder. When implemented in a Pipeline, the contents of each Capsule’s /data folder are ignored. Input data can be specified by attaching a Capsule or Data Asset upstream of the Capsule so that the results of the first Capsule are passed to the /data folder of the second Capsule. Results from each Capsule will only be saved if it is connected to the Results Bucket.

This section covers the main components of a pipeline:

Last updated