Signal extraction: peak calling
Goal: Get some basic information on the data (read length, number of reads, global quality of dataset)
1 – Getting the FASTQC report
Open the experiment file (SRR576933.fastq) in the FASTQC program.
Analyze the result of the FASTQC program: How many reads are present in the file ? What is the read length ? Is the overall quality good ?
Are there any concerns raised by the report ?
If so, can you tell where the problem might come from ?
bowtie2 -U bloup.fq -x mm9 --very-sensitive
There are 3 603 544 reads of 36bp. The overall quality is good, although it drops at the last position, which is usual with Illumina sequencing, so this feature is not raising hard concerns. There are several “red lights” in the report. In particular, the per sequence GC content and the duplication level are problematic. If you check the “overrepresented sequences”, you’ll notice a high percentage of adapters (29% !). Ideally, we would remove these adapters (=trim) the reads, and then re-run FASTQC. In practice, we often skip this step, as these reads will anyway not be mapped. Warning: this will affect the future calculated “% of mapped read” !!>
2 - Organism length Knowing your organism size is important to evaluate if your dataset has sufficient coverage to continue your analyses. For the human genome (3 Gb), we usually aim at least 10 Million reads. Go to the NCBI Genome website, and search for the organism Escherichia coli Click on the Escherichia coli str. K-12 substr. MG1655 to access statistics on this genome. How long is the genome ? Do both FASTQ files contain enough reads for a proper analysis ?