Bash Scripting for Pipeline Capsules

When embedded in a pipeline, a capsule may receive inputs that are different from the contents of the capsule’s data folder. Two approaches that can be used to pass different inputs to your capsule without making any changes to the capsule itself are:

  • Mapping the source path to the destination path

  • Using arguments to replace App Panel arguments

If the capsule will be used in many pipelines or many different datasets will be run through the same pipeline, you may want to use Bash commands to make the capsule work regardless of the input data's naming convention.

Below is an example that demonstrates how Bash commands can be used to make your capsule more easily implemented in a pipeline.

Example - FastQC Capsule

FastQC is a command line tool for quality control of sequencing data. The general syntax is:

fastqc seqfile1 seqfile2 .. seqfileN

In a Code Ocean capsule you must specify the results folder as the output directory and you may choose to hardcode filenames, for example:

fastqc --outdir ../results read1.fastq read2.fastq

The hardcoded file names may cause problems when the capsule is embedded in a pipeline that uses input data with different files or naming conventions. Although the Map Paths menu can be used as a workaround (see Map Paths example), a more robust solution is to use bash commands to search for suitable files in the data directory, for example:

fastqc --outdir ../results $(find -L ../data -name "*.fastq*)
  • The code within brackets will find all files in any directory within the data folder that have the extension .fastq. This ensures that once the capsule is added to a pipeline, any fastq dataset with any naming convention can be used without needing to make changes to the FastQC capsule.

  • The -L flag ensures the find command follows symbolic links. This is necessary when using data assets with the Global toggle on because symbolic links will be used to pass data to the destination capsule.

Additional bash commands may optimize the capsule’s performance in a pipeline:

for file in $(find -L ../data -name ".fastq*"); do
    echo "Running FastQC on $(basename -a $file)"
    fastqc --outdir ../results $file &
done
wait
  • Adding an echo command to indicate what step you’re on will make your code easier to debug when something goes wrong

  • Adding & to the end of a command tells Bash to execute the command asynchronously, meaning the next FastQC command in the loop will begin without waiting for the former to end. This saves time when computations are independent of each other.

  • The wait command ensures all running background processes are complete before proceeding to the next step. This should be used with the & command when the next step in the code or the next capsule in the pipeline requires all the outputs from the loop.

Last updated