Taf15 KD: RNAseq analysis (expression and splicing)


Samuel COLLOMBET (samuel.collobmet@ens.fr)

23.05.2014

(1) Mapping QC

mapping was performed with STAR, on hg19 genome, using ensembl_Homo_sapiens.GRCh37.75 genes annotation to improve spliced-reads mapping, keeping uniquely mapped reads only.

For siScr bio2, bio5 and siTaf15 bio6 more than 80% of reads were mapped uniquely, which is good. For siTaf15 bio9 more than 70% of reads were mapped uniquely, which is acceptable (see Tab.1).

siScr bio2 siScr bio5 siTaf15 bio6 siTaf15 bio9
Sequenced reads: 173,489,143 171,159,041 168,444,564 172,011,688
Uniquely mapped reads: 144,686,786 83.40% 141,783,903 82.84% 138,412,640 82.17% 123,167,742 71.60%
Reads mapping multiple location: 6,015,500 3.47% 5,633,031 3.29% 5,205,900 3.09% 4,792,045 2.79%
Unmapped reads: 22,786,857 13.13% 23,742,107 13.87% 24,826,024 14.74% 44,051,901 25.61%

Tab.1: Results of reads mapping.

Technical information:
- genome: homo sapiens GRC37 (hg19), chr1 to 2, X and Y ; chrM and other contigs were excluded. STAR genome index was build using ensembl_Homo_sapiens.GRCh37.75 annotation for splicing sites information.
Parameters: require 95% of the paired-reads to map (default=66%) ; filter out reads that map more than once ; other as default.
- Software: STAR_2.3.0e_r291
- command line: STAR --genomeDir <genome> --readFilesIn <read1> <reads2> --runThreadN <20> --outFilterMatchNminOverLread 0.95 --outFilterMultimapNmax 1 --outFileNamePrefix <output>

 

(2) Genes expression quantification and differential analysis:

Genes expression quantification was measured by counting the number of reads mapped to a region annotated as an exon (from ensembl_Homo_sapiens.GRCh37.75), using HTseq-count. Reads mapped to regions covering annotated exon from different genes are considered as ambiguous and not counted.

Technical information:
- Parameters: no strand-specificity required (reads mapping same of reverse strand than genes are counted), 
- software: htseq-count 0.6.0
- command line: htseq-count -f bam -r name -s no -t exon -i gene_id -m union <bamFile> <annotationFIle> > <output>

Normalization and differential analysis were performed with the DESeq2 method (R package), based on the negative binomial model.

Reads count were normalized for i) library sequencing depth (sizeFactor) and ii) dispersion estimation (rlog transformation). Note that the rlog transformation is a log transformation that correct for low count over-dispersion (i.e. small/very lowly expressed genes).

Calculating distance between all samples and clustering them, it appears that replicates show relatively high distance, comparable to inter-experiments distance, although replicates still cluster together (Fig.1). A principal component analysis shows that more than 40% of the variance between all samples comes from variation between the 2 conditions (Fig.2), but inter-replicates variance still represent more than 50% of the variance (principale component 2 and 3). THis means that replicates already have relatively high difference at the level of gene expression.

SampleCorrelation_Heatmap

Fig.2: Euclidean distance between samples based on rlog transformed genes count.

PCA_ExpVar  PCA_pc1VSpc2PCA_pc1VSpc3

Fig.3: Principle component analysis. Left: percentage of total inter-samples variance explained by 4 first principal component (PC); middle: projection of the 4 samples on PC2 vs PC1; right: projection on PC3 vs PC1.

Differentially expressed genes (DEG) were identify using DESeq2 negative-binomial Wald test. Multiple test correction of p-value was performed with the Benjamini-Hochberg method. Controlling the false discovery rate (FDR) at 1%, 1450 DEG were found.

 

(3) Differential exon usage:

Exon usage correspond to the level of reads count (i.e. expression) of an exon, relative to reads count of all other exon of a gene. Differential exon usage between different condition mean different expression of specific exon (implying different splicing), relative to gene expression change. Differential exon usage analysis was performed using the DEXseq method (R package).

Selecting differentially used exon for no false positive expected (i.e. false discovery rate * number of accepted hits < 1), we found 1145 differentially used exon, associated to 1330 genes (this is due to the fact that some exons are associated to several genes, when genes overlap each others).

Looking at the list of genes you found with higher expression of the last exon, it appears that, for most genes, this difference is not significant when corrected for the global level of gene expression (see Fig.4).

ENSG00000119922_IFIT2ENSG00000159388_BTG2 ENSG00000034152_MAP2K3

Fig.4: Example of genes. Left: Ifit2, middle: Btg2, right: Map2k3. Top panel: exon expression per condition; middle pannel: exon usage, i.e. exon expression normalized to global gene expression ; bottom: exon reads count per samples (show variability between replicates). Under the plots are represented the different transcript. Significantly differentially used exon are in pink.