RNASeq Quantification Pipeline
This RNASeq Pipeline aligns the reads (single or paired), sorts and indexes the alignment (.bam), counts features and conducts a differential gene expression analysis.
The assumption is that the reads are aligning to hg38, any reference can be used.
The pipeline uses the following Apps Library Capsules
Creating Prerequisite Data Assets
Code Ocean has supplied the datasets needed to run the Pipeline on the codeocean-public-data S3 bucket.
To create datasets from S3:
Navigate to the datasets tab
Click New Data
Select AWS S3.
Example Sequencing Reads
Create an example dataset containing reads from different samples. All reads for each sample will be in a separate folder. The pipeline will pass each folder to the downstream alignment, allowing each sample to process in parallel.
Bucket Name -
codeocean-public-data
Path -
example_datasets/Normox
Create a dataset containing the gene annotations for hg38 from GENCODE.
Bucket Name -
codeocean-public-data
Path -
genomes/hg38_Annotation
hg38 Star Index
Create a dataset containing the hg38 STAR reference.
Bucket Name -
codeocean-public-data
Path -
example_datasets/STAR_GRCh38_GENCODE_Release_21_Index/star_index/
Creating your own index
In order to create a reference for a different genome, visit the STAR Generate Genome Index Capsule in the Apps Library and follow the README in order to create a compatible index for STAR.
Create a Pipeline
From your dashboard:
Select Pipelines
Click on Create New
Attach Data Assets
Click Manage Data Assets
Attach STAR Index, Annotation and Read Data Assets referenced above to the Pipeline.
Design Matrix
The design matrix specifies metadata associated with the samples, i.e. tumor vs normal, tissue type, etc. In order to create the design matrix:
Create a Folder named
DesignMatrix
Create
metadata.csv
with the following 3 columns:Run
Condition
Batch.
Run should match the prefix for the .bam file output for the sample. Condition and Batch should indicate any metadata conditions to take into account to differentiate the samples in DESeq2.
Assemble Pipeline
To assemble the pipeline:
Open Code Ocean Apps
Enter STAR
Drag and drop STAR Alignment onto the UI
This process will be repeated for all subsequent Capsules in Pipeline
Configure Connections
In order to configure the connection for each step:
Select Default to attach the Reads Dataset to STAR Alignment
Select Collect to attach Star Index to STAR Alignment
Select Default to connect STAR Alignment to Sambamba Sort & Index
Select Collect to connect Sambamba Sort & Index to FeatureCounts
Select Default to attach Annotation Dataset to Feature Counts
Select Default to connect FeatureCounts to DESeq2. Set the destination to Counts_data
Create and attach DesignMatrix -
metadata.csv
to DESeq2. Select Default, set the destination to Counts_data
Completed Pipeline
Configure App Panel
To create the app
Click Create App
Click Create App and Finish
Parameters
Reference the READMEs for the Capsules used in order to find out more about the parameters used. Adjust the values in the yellow-highlighted fields on the App Panel.
Configure the App Panel as follows:
Note that the “Design formula” is based on the columns supplied in the design matrix.
Results
Nextflow
Consists of logs describing actions of nextflow.
DESeq2_results.csv
CSV file with the results table
MA_plot.png
MA plots display a log ratio (M) vs an average (A) in order to visualize the differences between two groups. In general we would expect the expression of genes to remain consistent between conditions and so the MA plot should be similar to the shape of a trumpet with most points residing on a y intercept of 0.
PCA.png
Visualize how the samples group by treatment
volcano_plot.png
The volcano plot enables it to simultaneously capture the effect size and significance of each tested gene.
plots_by_gene
A folder containing a file for each gene that plots the normalized counts for a single gene in order to get an idea of what is occurring for that gene across the sample cohort.
Last updated