How-to-Guide: Bioinformatics Tools

This guide is an extension of the Getting Started Guide and demonstrates a bioinformatics workflow in Code Ocean. Two capsules will be created to run popular bioinformatics tools - FastQC and Cutadapt. A pipeline is created with those two capsules for a mini working version of end-to-end analysis for single-end reads. The workflow assesses the initial quality with FastQC, removes adapter sequences with Cutadapt and then reassesses quality with FastQC.

Step 1: Creating Capsules

Both FastQC and Cutadapt are open source tools. They can be installed in individual capsules using the Environment Editor or postInstall script.

Install FastQC using the postInstall script

2. Install Cutadapt using the Environment Editor

3. Ensure that you have the correct fastq files. For this example pipeline, 4 single-end reads have been compiled. Datasets can be compiled via the Dataset dashboard or within a capsule, for additional information view the Compute Capsule Guide.

Now the commands for each capsule in the pipeline can be set up.

Capsule scripts should be set up to use any file as input and should not rely on specific file naming convention.

Example:

If the capsule is setup to receive an input file named read1.fastq, jobs with read2.fastq or read3.fastq will return a "No such file or directory" error, for more information see the Bash Scripting for Pipeline Capsules example.

To ensure data with any fastq naming convention can be used, these capsules will use a Bash file search. The script will look for any file in the data directory with a specific suffix and will run the tool on each file found. This occurs in a for loop. For both FastQC and Cutadapt, the script should look for any present fastq files.

In each capsule:

Click New File in the dropdown menu of the code folder to create a shell script.
Enter the shell script for FastQC:

fastqc.sh
1  #!/usr/bin/env bash
2  
3  for file in $(find -L../data -name "*.fastq"); do
4      echo "Running FastQC on $(basename -a $file)"
5      fastqc --outdir ..//results $file &
6  done
7  wait

3. Enter the shell script for Cutadapt:

cutadapt.sh
1  #!/usr/bin/env bash
2
3  for file in $(find ../data -name "*.fastq"); do
4      extension="${file##*.}"
5      filename=$(basename -a $file)
6      echo "Running cutadapt on $filename"
7      cutadapt -o ../results/$(basename $file .fastq)_trimmed.fastq $file
8   done

4. Attach the single-end reads dataset and set the shell scripts as the file to run in both capsules.

5. Execute a Reproducible Run, confirm that the expected outcome is received.

6. Commit the successful Reproducible Run changes in the capsule timeline.

The bioinformatics capsules are complete and the pipeline can be created.

Step 2: Creating the Pipeline

Click Add Pipeline, Create New.
Find the FastQC and Cutadapt capsules from the Add Capsules menu.
Drag two FastQC capsules and one Cutadapt capsule into the pipeline building area and create capsule connections.

Now that the capsules are configured the data must be attached

4. Click Manage Datasets from the data folder and select the single-end reads dataset.

5. Drag the dataset into into the pipeline space and attach it to the first capsules. See below.

Before running the pipeline, the capsule to Results bucket connections needs to be configured. This ensures that each capsule's results are saved in separate folders and will not be overwritten by each other.

6. Open Map Paths for FastQC to the Results Bucket connection on the path without Cutadapt.

7. Change the Destination path from pipeline/results to pipeline/results/unadapted_QC/

8. Toggle Generate indexed folders to On The results from read1.fastq will be in Folder1, read2fastq in Folder 2, the same with 3 and 4.

9. Open Map Paths for FastQC to the Results Bucket connection on the path with Cutadapt.

10. Change the Destination path from pipeline/results to pipeline/results/adapted_QC/

11. Toggle Generate indexed folders to On.

12. Click Reproducible Run. See the results below:

There are two results folders, unadapted_QC and adapted_QC.

Within each folder, the subfolders are 1, 2, 3, 4. Currently these indexed folders cannot be named. Each contains the results of an individual read for the dataset. Compare the results to see the effect of Cutadapt on the quality of single-end reads.

Once familiar with how to construct and run this pipeline, try it with paired-end reads. Organize the data a bit differently by having each pair in their own subfolder. You can continue to build off of this example to create a full end-to-end bioinformatics analysis.

PreviousTroubleshooting Tips NextAdvanced Details (Technical Structure)

Last updated 1 year ago