Npm+p53 KO vs p53 KO: RNAseq data analysis



(1) Sequencing QC

Sequencing quality was good for most of the reads, but a large proportion of the reads show relatively reduced quality (see Fig.1). The data are usable but not optimal.

If this is recurrent in your sequencing experiments, you might want to discuss it with your sequencing facility.


Fig.1: per base sequencing quality distribution (from the software fastqc). Each plot shows the distribution of sequencing quality at each base of all reads, for each library (columns), pair-reads 1 and 2 (top and bottom) . Red line: median value; yellow box: 25-75 percentiles; black bars: 10-90% percentiles.

For all libraries, the quality is good for at least 75% of the reads, but drops to very low quality around the 20th base, for at least 10% of the reads.


(2) Mapping QC

mapping was performed with STAR, on mm9 genome, using ensembl_Mus_musculus.NCBIM37.67 (mm9) genes annotation to improve spliced-reads mapping, keeping uniquely mapped reads only.

For all libraries more than 80% of reads were mapped uniquely, which is good (see Tab.1).

dkoNPMp53_bio1 dkoNPMp53_bio2 kop53_bio1 kop53_bio2
Input reads: 87,485,029 115,700,236 78,900,673 95,380,853
Uniquely mapped: 73,391,907 (83.89%) 97,755,874 (84.49%) 65,588,892 (83.13%) 80,756,908 (84.67%)
Mapped to more than 1 locus: 5,511,815 (6.30%) 8,206,793 (7.09%) 4,790,167 (6.07%) 6,670,922 (6.99%)
Unmapped: 8,581,307 (9.81%) 9,737,569 (8.41%) 8,521,614 (10.80%) 7,953,023 (8.33%)

Tab.1: Results of reads mapping.

Technical information:
- genome: mus musculus NCBIM37 (mm9), chr1 to 19, X and Y ; chrM and other contigs were excluded. STAR genome index was build using ensembl_Mus_musculus.NCBIM37.67 annotation for splicing sites information, and an over hang distance of 49 base (--sjdbOverhang 49).
Parameters: require 80% of the paired-reads to map (default=66%) ; filter out reads that map more than once ; other as default.
- Software: STAR_2.3.0e_r291
- command line: STAR --genomeDir <mm9_ensembl_Mus_musculus.NCBIM37.67> --readFilesIn <read1> <reads2> --runThreadN <20> --outFilterMatchNminOverLread 0.80 --outFilterMultimapNmax 1 --outFileNamePrefix <output>

Ensembl annotation used in this analysis can be downloaded here.


(3) Genes expression quantification:

Genes expression quantification was measured by counting the number of reads mapped to a region annotated as an exon (from ensembl_Mus_musculus.NCBIM37.67), using the FeatureCounts program. Reads mapped to regions covering annotated exon from different genes are considered as ambiguous and not counted.

For all libraries more than 90% of the reads were counted unambiguously on exons, which is a good results for polyA+ RNAseq (Tab.2).

dkoNPMp53_bio1 dkoNPMp53_bio2 kop53_bio1 kop53_bio2
Assigned 67,825,725 92.42% 90,698,503 92.78% 60,788,244 92.68% 74,774,825 92.59%
Unassigned: ambiguous 1,363,875 1.86% 1,907,847 1.95% 1,137,899 1.73% 1,520,727 1.88%
Unassigned: no features 4,202,307 5.73% 5,149,524 5.27% 3,662,749 5.58% 4,461,356 5.52%

Tab.2: Results of gene expression estimation by reads count on exons.

Technical information:
Parameters: no strand-specificity required (reads mapping same of reverse strand than genes are counted), 
- Software: featureCounts v1.4.4
- command line: featureCounts -a ensembl_Mus_musculus.NCBIM37.67.gtf -F gtf -t exon -s 0 -T 20 -p <aligned.bam> -o <output.tsv>

Reads counts per genes can be downloaded here


(4) Genes expression normalization and samples comparison:

Normalization and differential analysis were performed with the DESeq2 method (R package), based on the negative binomial model.

Estimation of dispersion shows a good fit of the statistical model, without overdispersion (Fig.2).


Fig.2: Genes expression dispersion estimation and model fitting.

Reads count were normalized for i) library sequencing depth (sizeFactor) and ii) dispersion estimation (rlog transformation). Note that the rlog transformation is a log transformation that correct for low count over-dispersion (i.e. small/very lowly expressed genes).

Based on these values, samples show a good similarity between replicates (Fig.3), and a principale component analysis shows that most of the variance between all samples comes from variation between the 2 conditions (Figure.4).



Fig.3: Euclidean distance between samples based on rlog transformed genes count.



Fig.4: Principle component analysis. Left: percentage of total inter-samples variance explained by 4 first principal component (PC). PC1 explain a large majority of the variance (>70%); middle: projection of the 4 samples on PC2 vs PC1; right: projection on PC3 vs PC1.

Reads count normalized by size factor (library depth) can be downloaded here.

Reads count normalized by DESeq rlog transformation can be downloaded here (this transformation correct for lowly expressed genes and should not be biais by gene size. I would advise to use it as the best estimation of gene expression for further analysis. Note that these values are in a log2 scale).


(5) Differentially expressed genes:

Differentially expressed genes (DEG) were identify using DESeq2 negative-binomial Wald test. Multiple test correction of p-value was performed with the Benjamini-Hochberg method. Controlling the false discovery rate (FDR) at 1%, 1340 DEG were found. Controlling for no false discovery expected (i.e. FDR * (number of selected DEG) < 1) 908 genes were found, with a fold change going from 1.5 to 194 fold up-regulated genes,and from 1.5 to 104 fold down-regulated (see Fig.5).


Fig.5: Differentially expressed genes. Log2 of gene expression fold change (Npm+p53 KO / p53 KO), versus mean expression over all samples. Less lowly expressed genes are found significant because their expression estimation is more subject to noise.

Technical information:
script is available on demand 
- Software: DESeq2 version 1.14.5

Table with the results of statistical test for all genes can be downloaded here.

A list of selected deferentially expressed genes for a false discovery rate of 1% can be downloaded here, and one for no false discovery expected here.


(6) Gene set enrichment analysis:

Gene set enrichment analysis was performed with the GSEA software on the MSigDB database, using the list of genes in the annotation, ranked by log2-FoldChange / log2-FoldChange-ErrorEstimation (this way, fold change are weighted by their dispersion, so high fold change with high variability are weighted down).

Several Gene sets were found significantly enriched. In particular, pathways associated to mitosis and cell division were found enriched in genes down-regulated in Npm+p53 KO vs p53 KO (see Fig.6).


Fig.6: Selected results of gene set enrichment analysis. Examples of genes sets associated to cell division, enriched in the down-regulated genes.

Technical information:
Parameters: score weighted by the ranking metric (lfc/lfcSE)
- Software: GSEA 2-2.0.13 , MSigDB v4
- command line: java -cp /home/samuel/programs/GSEA2.0.13/gsea2-2.0.13.jar -Xmx2g xtools.gsea.GseaPreranked -gmx <database> -rnk <rankedList> -collapse false -out <outputFolde> -rpt_label <outputName> -gui false -scoring_scheme weighted

 Full results of the gene set enrichment analysis can be visualised there:

results on MSigDB gene sets c2.all (all curated gene sets)

results on MSigDB gene sets c2.cp (canonical pathways from KEGG, BIOCARTA and REACTOME database)

results on MSigDB gene sets c4 (computationally predicted gene sets; I do not usually use it but you can look at it)

results on MSigDB gene sets c5.all (all Gene Ontology gene sets)

results on MSigDB gene sets c5.bp (biological processes from the Gene Ontology)

results on MSigDB gene sets (molecular functions from the Gene Ontology)

results on MSigDB gene sets c6 (oncogenic signatures; I do not usually use it but you can look at it)

You can find further description of the gene sets used here.


(7) Imprinted genes:

From the list of imprinted genes available here, the following were found significantly differentially expressed (FDR<1%):