Session 1: QC and reads mapping


Quality control of the reads and statistics

Goal: Get some basic information on the data (read length, number of reads, global quality of dataset)

1 – Getting the FASTQC report

We are going to use the command line version of Fastqc.

1. Create the directory ‘QC’ for saving the QC report:

$ mkdir QC

2. Run fastqc on your fastq file locate at : (Dataset/Level_1_Sequenced_Reads/GMP_WT_Cebpa_mm9_chr19andRandom.fq)

Use the option ‘-o’ to specify the results output directory. For more information about fastqc parameters type:

$ fastqc -h

To run the program, type:

$ fastqc -o QC Dataset/Level_1_Sequenced_Reads/GMP_WT_Cebpa_mm9_chr19andRandom.fq

Check the results stored in the QC directory by using the commands ‘cd’ (change directory) and ‘ls’ (list content).

$ cd QC
$ ls
$ cd GMP_WT_Cebpa_mm9_chr19andRandom.fq_fastqc
$ ls

Read the summary file using the command ‘less’ (file streaming)

$ less summary.txt

Press ‘q’ to quit the streaming view.

What is the read length ?

Is the overall quality good ?
Are there any concerns raised by the report ? If so, can you tell where the problem might come from ?

 

The graphical report can be seen here.

fastqc

 

 

Note: You can go back to the parent directory by typing:

$ cd ../

2 – Reads alignment using Bowtie2

Goal: Obtain the coordinates of each read on the reference genome.

Let’s see the parameters of bowtie2 before launching the mapping:
bowtie_index/mm9_chr19 is the location/name of our genome index file

1. First of all, you have to create a folder to store your results:

$ mkdir aligned_reads

-U indicates the input file is in FASTQ format. data/GMP_WT_Cebpa_chr19.fastq is the location/name of our FASTQ file.
-x indicates the bowtie genome reference.
-S will output the result in SAM format

We are going to use the default parameters. For more information about bowtie2 type:

$ bowtie2 -h

2. Run bowtie2 using the following command:

$ bowtie2 -x programs/bowtie2_index/mm9 -U data/GMP_WT_Cebpa_chr19.fastq -S aligned_reads/GMP_WT_Cebpa_mm9_chr19.sam

This should take few minutes as we work with a single chromosome. For the human genome, we would need either more time, or a dedicated server.

The bowtie process will terminate by printing the alignement statistics in the screen:

2546690 reads; of these:
 2546690 (100.00%) were unpaired; of these:
 123047 (4.83%) aligned 0 times
 2106893 (82.73%) aligned exactly 1 time
 316750 (12.44%) aligned >1 times
95.17% overall alignment rate

3 – Filtering and compressing sam file

The .bam format is binary version of the .sam format. It is not directly human readable, but this conversion enable a much more efficient processing of the data

$ samtools view -S -b -o aligned_reads/GMP_WT_Cebpa_mm9_chr19.bam aligned_reads/GMP_WT_Cebpa_mm9_chr19.sam

or

$ samtools view -S -b aligned_reads/GMP_WT_Cebpa_mm9_chr19.sam > aligned_reads/GMP_WT_Cebpa_mm9_chr19.bam

The ‘samtools view’ command with the “-b” will convert the sam file (text) to a bam file (binary, i.e. compressed) The ‘samtools view’ command with the “-b” will convert the sam file (text) to a bam file (binary, i.e. compressed)
The same command can be used to convert back a .bam file into a human readable .sam file

$ samtools view aligned_reads/GMP_WT_Cebpa_mm9_chr19.bam | less -S

Press ‘q’ to leave the less mode.

4. Having converted the file into bam format, allows for using other tools of samtools
4.1 Filtering mapping with a low mapQ value
We filter all low quality reads (bellow q20) using samtools.

$ samtools view -b -q 20 -o aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.bam aligned_reads/GMP_WT_Cebpa_mm9_chr19.bam

The ‘samtools view’ command with the “-b” will convert the sam file (text) to a bam file (binary, i.e. compressed)

4.2.Removing duplicates
samtools can be used to remove duplicates, that is reads mapped at the very same location. Those duplicated reads can arise from the PCR amplification step of the library construction and are therefore to be interpreted with caution and are typically removed either “by default automatically” by peak calling algorithms (i.e. MACS) or explicitly. Removing duplicates also facilitate data visualization.
Removing duplicated reads is accomplished using the command

$ samtools rmdup -s aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.bam aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.rmdup.bam

4.3.Indexing a .bam file for fast processing
we can also use samtools to index the reads. This will allow for an efficient retrieval of reads mapped onto a location. Indexing is done on (position) sorted reads.
So we first sort the reads using the “samtools sort” command

$ samtools sort aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.rmdup.bam aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.rmdup.psort

 

And then index the file using the “samtools index” command

$ samtools index aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.rmdup.psort.bam

Note the creation of the index file : GMP_WT_Cebpa_mm9_chr19_q20.rmdup.psort.bam.bai

4.4. Creating a bigWig file
converting .sam to .bam reduced the size of the file without any loss of data. Actually the .bam file is a compressed version of the .sam as a ZIP file. We have then removed unreliable mapping (reads that could be mapped to more then one locations on the genome) and removed PCR generated duplicates. For special purpose such as data visualization, further simplifying the data content to limit it to what is strictly necessary. In particular the most popular visualization softwares and web services use the .biWig format. To convert .bam files into bigWig requires a multi-step process involving the following tools :
Here you will need a file containing the length of all the mm9 chromosomes which is located in the folder workshop/data/mm9.chrom_len.txt

$ bedtools genomecov -g  data/mm9.chrom_len.txt -ibam aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.rmdup.psort.bam -bg > aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.rmdup.bg
$ cd programs

$ ./bedGraphToBigWig ../aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.rmdup.bg ../data/mm9.chrom_len.txt ../aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.rmdup.bigWig

… We can similarly create a bigWig formatted file for the non-filtered (mapping quality and PCR duplicate removed data)

$ samtools sort aligned_reads/GMP_WT_Cebpa_mm9_chr19.bam aligned_reads/GMP_WT_Cebpa_mm9_chr19.psort
$ samtools view -b aligned_reads/GMP_WT_Cebpa_mm9_chr19.psort.bam \
 	 	 | genomeCoverageBed -g  mm9.chrom_len.txt -ibam stdin  -bg \
                 > aligned_reads/GMP_WT_Cebpa_mm9_chr19.bedGraph
$ bedGraphToBigWig  aligned_reads/GMP_WT_Cebpa_mm9_chr19.bg mm9.chrom_len.txt  aligned_reads/GMP_WT_Cebpa_mm9_chr19.bigWig
$ samtools view -b -q20 aligned_reads/GMP_WT_Cebpa_mm9_chr19.psort.bam \
 	 	 | genomeCoverageBed -g  mm9.chrom_len.txt -ibam stdin  -bg \
                 > aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.bedGraph
$ bedGraphToBigWig aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.bg mm9.chrom_len.txt aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.bigWig