workshop: template

Signal extraction: peak calling

Goal: Get some basic information on the data (read length, number of reads, global quality of dataset)

1 – Getting the FASTQC report

Open the experiment file (SRR576933.fastq) in the FASTQC program.

Analyze the result of the FASTQC program: How many reads are present in the file ? What is the read length ? Is the overall quality good ?

Are there any concerns raised by the report ?

If so, can you tell where the problem might come from ?

bowtie2 -U bloup.fq -x mm9 --very-sensitive

There are 3 603 544 reads of 36bp. The overall quality is good, although it drops at the last position, which is usual with Illumina sequencing, so this feature is not raising hard concerns. There are several “red lights” in the report. In particular, the per sequence GC content and the duplication level are problematic. If you check the “overrepresented sequences”, you’ll notice a high percentage of adapters (29% !). Ideally, we would remove these adapters (=trim) the reads, and then re-run FASTQC. In practice, we often skip this step, as these reads will anyway not be mapped. Warning: this will affect the future calculated “% of mapped read” !!>

2 - Organism length
Knowing your organism size is important to evaluate if your dataset has sufficient coverage to continue your analyses. For the human genome (3 Gb), we usually aim at least 10 Million reads.
Go to the NCBI Genome website, and search for the organism Escherichia coli
 Click on the Escherichia coli str. K-12 substr. MG1655 to access statistics on this genome.
 How long is the genome ?
 Do both FASTQ files contain enough reads for a proper analysis ?