Visualize and Share Large Raw Sequencing Datasets


BIOINFORMATICS WORKSHOP (APR 2019)

Visualize and Share Large Raw Sequencing Datasets

Raw sequencing datasets generated by Genome-seq, Exome-seq, RNA-seq, ChiP-seq, etc… experiments consist of a compilation of sequences assigned to a genomic location (BAM files). These files are usually too large to be manipulated by non-bioinformaticians. Nonetheless, assessing the quality of the experiment and getting a prior overview of the data can be achieved by a larger audience. BAM files manipulation can help the biologist in understanding her/his data and perform troubleshooting operations.

In this mini-workshop, we will explain how these files can be handled by biologists without bioinformatics knowledge using a conventional computer.

In detail, attendees will be trained to:

  • Perform and interpret quality control on BAM files
  • Create their own UCSC genome browser track to navigate through BAM files
  • Transfer/share large BAM files via public servers.
Speakers Touati Benoukraf, Ph.D
Canada Research Chair (Tier II) in Bioinformatics for Personalised Medicine
Assistant Professor, Faculty of Medicine, Discipline of Genetics
Memorial University of Newfoundland
Visiting Assistant Professor
Cancer Science Institute of Singapore, NUS
Denis Thieffry, Ph.D
Group Leader and Professor
Institute of Biology, Ecole Normale Superieure de Paris
Venue NUS, Centre for Translational Medicine (MD6),
#04-01 SMART Classroom
14 Medical Drive, S117599
Date 16 April 2019, Tuesday
Time 1 pm – 5 pm

Workshop Exercises

The aim of this workshop is to learn how to preprocess a raw sequencing file (fastq) and visualize it. As an example, we will use an RNAseq dataset from the Encode consortium, perform in the K562 cell line.

To accelerate all processes during this workshop, we will provide you only a fragment of the file (chromosome 19 only).

As explained during the lecture, you will use the Galaxy platform (usegalaxy.org) to perform quality control and to generate the different files needed for visualization. Then, files will be uploaded to Cyverse, a cloud system that allows connecting data with genome browser like UCSC Genome Browser.

Step 1:

Download both fastq files for the following link and perform a fastqc using Galaxy (https://usegalaxy.org/).

https://tinyurl.com/y2yq7fka

drop

Step 2:

Go the usegalaxy.org to upload your fastq files

G1

G2

Screenshot 2019-04-16 at 3.35.24 AM

Step 3: Perform a QC using FastQC

Screenshot 2019-04-16 at 3.37.13 AM

Screenshot 2019-04-16 at 4.39.17 AM

Step 4: Load BAM files

Due to time constraint, we will not align fastqc to the genome reference.

Please download BAM files from the previous like.

Here, we will use bamCoverage, a tool that pileup reads into signal. In this specific example, pileup reads will represent transcription intensities.

Important note: After generating a BAM file, the BAM has to be “optimized” via 2 steps: i) sorting and ii) indexing. Indexing will create a new file that records an index of the main file.

Step 5: Convert BAM files into coverage filesScreenshot 2019-04-16 at 4.54.04 AM

Note: For RNAseq, strands can be segregated.

Then, convert your coverage file into BigWig:

Screenshot 2019-04-16 at 5.27.00 AM

Step 6: Create a Cyverse Hub for UCSC Genome Browser.

Log in to Cyverse.

Create the following folder and files:

folder hg38

file genome.txt

file hub.txt

Screenshot 2019-04-16 at 4.59.46 AM

File contents:

hub.txt

hub Project-Name
shortLabel RNAseq-test
longLabelRNAseq-test
genomesFile https://de.cyverse.org/dl/d/0D9281C4-D365-433C-9A1E-765AEDD717B2/genomes.txt
email tbenoukraf@mun.ca
descriptionUrl ucscHub.html

Screenshot 2019-04-16 at 5.15.47 AM

Screenshot 2019-04-16 at 5.23.07 AM

 

Leave a comment

Your email address will not be published. Required fields are marked *