RNASeq Quantification Pipeline

This RNASeq Pipeline aligns the reads (single or paired), sorts and indexes the alignment (.bam), counts features and conducts a differential gene expression analysis.

The assumption is that the reads are aligning to hg38, any reference can be used.

The pipeline uses the following Apps Library Capsules

Creating Prerequisite Data Assets

Code Ocean has supplied the datasets needed to run the Pipeline on the codeocean-public-data S3 bucket.

To create datasets from S3:

Navigate to the datasets tab
Click New Data
Select AWS S3.

Example Sequencing Reads

Create an example dataset containing reads from different samples. All reads for each sample will be in a separate folder. The pipeline will pass each folder to the downstream alignment, allowing each sample to process in parallel.

Bucket Name - codeocean-public-data
Path - example_datasets/Normox
hg38 Annotation
Create a dataset containing the gene annotations for hg38 from GENCODE.

Bucket Name - codeocean-public-data
Path - genomes/hg38_Annotation
hg38 Star Index
Create a dataset containing the hg38 STAR reference.

Bucket Name - codeocean-public-data
Path - example_datasets/STAR_GRCh38_GENCODE_Release_21_Index/star_index/

Creating your own index

In order to create a reference for a different genome, visit the STAR Generate Genome Index Capsule in the Apps Library and follow the README in order to create a compatible index for STAR.

Create a Pipeline

From your dashboard:

Select Pipelines
Click on Create New

Attach Data Assets

Click Manage Data Assets

Attach STAR Index, Annotation and Read Data Assets referenced above to the Pipeline.

Design Matrix

The design matrix specifies metadata associated with the samples, i.e. tumor vs normal, tissue type, etc. In order to create the design matrix:

Create a Folder named DesignMatrix
Create metadata.csvwith the following 3 columns:
- Run
- Condition
- Batch.

Run should match the prefix for the .bam file output for the sample. Condition and Batch should indicate any metadata conditions to take into account to differentiate the samples in DESeq2.

Assemble Pipeline

To assemble the pipeline:

Open Code Ocean Apps
Enter STAR
Drag and drop STAR Alignment onto the UI

This process will be repeated for all subsequent Capsules in Pipeline

Configure Connections

In order to configure the connection for each step:

Click Settings
Select Default to attach the Reads Dataset to STAR Alignment
Select Collect to attach Star Index to STAR Alignment
Select Default to connect STAR Alignment to Sambamba Sort & Index
Select Collect to connect Sambamba Sort & Index to FeatureCounts
Select Default to attach Annotation Dataset to Feature Counts
Select Default to connect FeatureCounts to DESeq2. Set the destination to Counts_data

Create and attach DesignMatrix - metadata.csv to DESeq2. Select Default, set the destination to Counts_data

Completed Pipeline

Click in the Pipeline UI to show the connections.

Configure App Panel

To create the app

Click Create App

Click Create App and Finish

Parameters

Reference the READMEs for the Capsules used in order to find out more about the parameters used. Adjust the values in the yellow-highlighted fields on the App Panel.

Configure the App Panel as follows:

DESeq2

Note that the “Design formula” is based on the columns supplied in the design matrix.

FeatureCounts

Sambamba Sort & Index

STAR Alignment

Results

Nextflow
- Consists of logs describing actions of nextflow.
DESeq2_results.csv
- CSV file with the results table
MA_plot.png
- MA plots display a log ratio (M) vs an average (A) in order to visualize the differences between two groups. In general we would expect the expression of genes to remain consistent between conditions and so the MA plot should be similar to the shape of a trumpet with most points residing on a y intercept of 0.
PCA.png
- Visualize how the samples group by treatment
volcano_plot.png
- The volcano plot enables it to simultaneously capture the effect size and significance of each tested gene.
plots_by_gene
- A folder containing a file for each gene that plots the normalized counts for a single gene in order to get an idea of what is occurring for that gene across the sample cohort.

PreviousGene Set Enrichment Overview NextApp Panel Guide

Last updated 5 months ago