Signal extraction: peak calling
Goal: to identify region in the genome where the ChIPed protein is bound, i.e. regions where reads are enriched over the background.
There are different methods to find peaks in ChIP-seq data. We will the software MACS (Model-based Analysis of ChIP-seq) version 2, because i) it is widely used, ii) it is very efficient and rely on clear statistical standard, and iii) version 2 implemented the most up-to-date methods of peak calling.
Note: version 2 is the most up-to-date, and is more efficient than previous versions. However it is still in development (and not published yet). Arguments and default options may change in the (near) future.
Introductory slides
To see the slides, click on this link (and then click on the picture of the slide)
1 – MACS2 manual
MACS can be called with the command macs2. Use the argument -h to see the available arguments (A more eye-candy version can be found at https://github.com/taoliu/MACS/ )
macs2 -h
At the bottom of the manual, you can see
For command line options of each command, type: macs2 COMMAND -h
The function we want to use is callpeak. Have a look to the available option for this function:
macs2 callpeak -h
2 – Peak calling using ChIP data only (no control)
MACS allowed one to call peaks using only the reads from the ChIP experiment and not the control.
!!! ATTENTION !!!
This is NOT a good idea!! Peak calling MUST be done using a control (IgG or input) for background estimation. Background control can exhibit local biais (i.e. artifactual peaks) and IS NECESSARY.
However, we will try to do peak calling with no control to see the results.
First make a link in your working directory to the .bam file of the ChIP sample, which is in your Datasets directory:
ln -s aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.bam .
Now run MACS on the ChIP data, using the following parameters:
– bandwidth of 300 bases to scan the genome for model building
– keeping only 1 reads when duplicates are present
– a threshold of 0.001 on the pValue for reads enrichment.
Other parameters as default.
Remember that you have to tell macs the size of the genome your are working on. Here we are only working on the mouse chr19, which is 61,431,566 bases long.
First create a directory in which MACS will write its results.
mkdir macs2_bw300dup1p0.001_Cebpa macs2 callpeak -t GMP_WT_Cebpa_mm9_chr19_q20.bam -n macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa -g 61431566 --bw 300 --keep-dup 1 -p 0.001 2> macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa.log
2> macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa.log will write MACS messages (results, errors,…) in a text file that you can keep and read later.
use the ls -l command to see the files that MACS has created.
ls -l macs2_bw300dup1p0.001_Cebpa
total 1636 -rw-r--r-- 1 root root 3213 Mar 23 09:33 macs2_bw300dup1p0.001_Cebpa.log -rw-r--r-- 1 root root 78314 Mar 23 09:33 macs2_bw300dup1p0.001_Cebpa_model.r -rw-r--r-- 1 root root 407877 Mar 23 09:33 macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak -rw-r--r-- 1 root root 409054 Mar 23 10:01 macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak.osc -rw-r--r-- 1 root root 441690 Mar 23 09:33 macs2_bw300dup1p0.001_Cebpa_peaks.xls -rw-r--r-- 1 root root 323716 Mar 23 09:33 macs2_bw300dup1p0.001_Cebpa_summits.bed
Read the log file to see if the peak calling worked:
more macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa.log
INFO @ Sun, 23 Mar 2014 09:33:22: # Command line: callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -n macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa -g 61431566 --bw 300 --keep-dup 1 -p 0.001 # ARGUMENTS LIST: # name = macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa # format = AUTO # ChIP-seq file = ['GMP_WT_Cebpa_mm9chr19_q20.bam'] # control file = None # effective genome size = 6.13e+07 # band width = 300 # model fold = [5, 50] # pvalue cutoff = 1.00e-03 # qvalue will not be calculated and reported as -1 in the final output. # Larger dataset will be scaled towards smaller dataset. # Range for calculating regional lambda is: 10000 bps # Broad region calling is off INFO @ Sun, 23 Mar 2014 09:33:22: #1 read tag files... INFO @ Sun, 23 Mar 2014 09:33:22: #1 read treatment tags... INFO @ Sun, 23 Mar 2014 09:33:22: Detected format is: BAM INFO @ Sun, 23 Mar 2014 09:33:22: * Input file is gzipped. INFO @ Sun, 23 Mar 2014 09:33:26: 1000000 INFO @ Sun, 23 Mar 2014 09:33:31: 2000000 INFO @ Sun, 23 Mar 2014 09:33:32: #1 tag size is determined as 51 bps INFO @ Sun, 23 Mar 2014 09:33:32: #1 tag size = 51 INFO @ Sun, 23 Mar 2014 09:33:32: #1 total tags in treatment: 2263023 INFO @ Sun, 23 Mar 2014 09:33:32: #1 user defined the maximum tags... INFO @ Sun, 23 Mar 2014 09:33:32: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) INFO @ Sun, 23 Mar 2014 09:33:33: #1 tags after filtering in treatment: 709326 INFO @ Sun, 23 Mar 2014 09:33:33: #1 Redundant rate of treatment: 0.69 INFO @ Sun, 23 Mar 2014 09:33:33: #1 finished! INFO @ Sun, 23 Mar 2014 09:33:33: #2 Build Peak Model... INFO @ Sun, 23 Mar 2014 09:33:33: #2 number of paired peaks: 983 WARNING @ Sun, 23 Mar 2014 09:33:33: Fewer paired peaks (983) than 1000! Model may not be build well! Lower your MFOLD parameter may erase this warning. Now I will use 983 pairs to build model! INFO @ Sun, 23 Mar 2014 09:33:33: start model_add_line... INFO @ Sun, 23 Mar 2014 09:33:35: start X-correlation... INFO @ Sun, 23 Mar 2014 09:33:35: end of X-cor INFO @ Sun, 23 Mar 2014 09:33:35: #2 finished! INFO @ Sun, 23 Mar 2014 09:33:35: #2 predicted fragment length is 116 bps INFO @ Sun, 23 Mar 2014 09:33:35: #2 alternative fragment length(s) may be 116 bps INFO @ Sun, 23 Mar 2014 09:33:35: #2.2 Generate R script for model : macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_model.r INFO @ Sun, 23 Mar 2014 09:33:35: #3 Call peaks... INFO @ Sun, 23 Mar 2014 09:33:35: #3 Call peaks with given -log10pvalue cutoff: 3.00000 ... INFO @ Sun, 23 Mar 2014 09:33:35: #3 Pre-compute pvalue-qvalue table... INFO @ Sun, 23 Mar 2014 09:33:40: #3 Call peaks for each chromosome... INFO @ Sun, 23 Mar 2014 09:33:46: #4 Write output xls file... macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.xls INFO @ Sun, 23 Mar 2014 09:33:46: #4 Write peak in narrowPeak format file... macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak INFO @ Sun, 23 Mar 2014 09:33:46: #4 Write summits bed file... macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_summits.bed INFO @ Sun, 23 Mar 2014 09:33:46: Done!
Look at the beginning of the file macs2_bw300dup1p0.001_Cebpa_peaks.xls
more macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.xls
# This file is generated by MACS version 2.0.10.20131216 (tag:beta) # Command line: callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -n macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa -g 61431566 --bw 300 --keep-dup 1 -p 0.001 # ARGUMENTS LIST: # name = macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa # format = AUTO # ChIP-seq file = ['GMP_WT_Cebpa_mm9chr19_q20.bam'] # control file = None # effective genome size = 6.13e+07 # band width = 300 # model fold = [5, 50] # pvalue cutoff = 1.00e-03 # qvalue will not be calculated and reported as -1 in the final output. # Larger dataset will be scaled towards smaller dataset. # Range for calculating regional lambda is: 10000 bps # Broad region calling is off # tag size is determined as 51 bps # total tags in treatment: 2263023 # tags after filtering in treatment: 709326 # maximum duplicate tags at the same position in treatment = 1 # Redundant rate in treatment: 0.69 # d = 116 # alternative fragment length(s) may be 116 bps chr start end length abs_summit pileup -log10(pvalue) fold_enrichment -log10(qvalue) name chr19 3046825 3046940 116 3046882 10.00 6.72976 4.69814 4.38045 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_1 chr19 3053207 3053322 116 3053257 8.00 4.93277 3.84393 2.73420 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_2 chr19 3083546 3083661 116 3083610 9.00 5.81107 4.27104 3.52522 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_3 chr19 3084109 3084354 246 3084188 8.00 4.93277 3.84393 2.73420 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_4 chr19 3085835 3085952 118 3085903 10.00 6.72976 4.69814 4.38045 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_5
Look at the beginning of the file macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak
head macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak
chr19 3046824 3046940 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_1 67 . 4.69814 6.72976 4.38045 57 chr19 3053206 3053322 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_2 49 . 3.84393 4.93277 2.73420 50 chr19 3083545 3083661 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_3 58 . 4.27104 5.81107 3.52522 64 chr19 3084108 3084354 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_4 49 . 3.84393 4.93277 2.73420 79
This is a .narrowPeak file, a format use by the ENCODE consortium. Columns correspond to:
chromosome | start | end | name | Tags count | strand | foldChange vs background | pVal (-log) | qVal | Position of Summit
(you can find these informations on MACS website: https://github.com/taoliu/MACS/)
Count the number of of peaks found by macs
wc -l macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak
3307 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak
order them by pValue and look at the first hits
sort -rgk8 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak | head
chr19 32641557 32641851 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_1704 2674 . 37.80951 267.46173 259.67395 157 chr19 46975862 46976218 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_2626 2549 . 30.24302 254.94597 248.48042 192 chr19 40963403 40963746 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_2183 2400 . 37.83377 240.02473 234.13460 149 chr19 5313233 5313550 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_176 2394 . 38.93952 239.48555 233.61687 174
Go Further …
3 – Peak calling using ChIP data vs Control
Make a link in your working directory to the .bam file of the control sample, which is in your Datasets directory:
ln -s Dataset/Level_2_Aligned_Reads/LSK_WT_igG_mm9chr19_q20.bam LSK_WT_IgG_mm9chr19_q20.bam
Now call peaks using the IgG dataset as control (first create a directory to write MACS results in):
mkdir macs2_bw300dup1p0.001_Cebpa_vs_IgG macs2 callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -c LSK_WT_IgG_mm9chr19_q20.bam -n macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG -g 61431566 --bw 300 --keep-dup 1 -p 0.001 2> macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG.log
Count the number of peaks found with control and compare it to the number of peak called without control:
wc -l macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak
3307 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak
wc -l macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak
1569 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak
We would like to get all the peaks that were filtered when using the control dataset. We can use the program intersectBed from BEDTOOLS (http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html) for that:
intersectBed -a macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak -b macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak -v
Normally, intersectBed will report all the genomic intervals in -a that overlap one in -b; option -v tell him to report those in -a who do NOT overlap any ib -b.
This result is directly reported on your terminal. You can look at the top peaks called without control that are filtered out by sorting then by pValue and just getting the top one:
intersectBed -a macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak -b macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak -v | sort -r -g -k8 | head
Note: in sort, the option -r give you the reverse sorted list (ie highest value first), -g tell sort to use general numerical notation (decimal, scientific notation…), and -k8 to sort on the 8th column.
chr19 11410560 11410713 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_335 164 . 8.16740 16.48031 13.73139 68 chr19 37000082 37000216 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_995 163 . 4.87581 16.30774 13.57057 68 chr19 29231289 29231432 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_740 154 . 7.47886 15.43415 12.71137 69 chr19 59532009 59532142 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_1530 152 . 7.45955 15.25735 12.53735 69 chr19 5376457 5376594 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_102 152 . 7.45955 15.25735 12.53735 62 chr19 30516971 30517108 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_794 148 . 7.62319 14.84488 12.13363 57 chr19 7191010 7191153 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_189 143 . 6.03521 14.31085 11.61076 49 chr19 44619348 44619481 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_1214 142 . 7.62480 14.25801 11.56026 64 chr19 41988362 41988553 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_1128 142 . 7.19562 14.21410 11.51678 105 chr19 29443503 29443641 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_754 142 . 7.19562 14.21410 11.51678 59
Note the coordinates of the first peak, you will look at the reads distribution profile there in the next session.
Do they seems like lowly enriched peaks?
A control sample allowed to filter artifact for several reason:
– Experimental artifact (real peak, non specific): some region of the chromatin are highly accessible, and just get enriched after sonication without any immuno-precipitation. By note performing IP (input) or a non-specific IP (IgG), one can identify these non specifically enriched regions.
– Statistical artifact (false peaks): ChIP sample should have a high fractions of reads in peaks region, and therefore much less in background regions. Less reads higher variation in reads enrichment and a more difficult estimation of the statistical (random) distribution of reads. Control sample should be more uniform and allowed better estimation of the distribution (IF SEQUENCED DEEP ENOUGH !! do not be miserly on it, it is not worth it.)
5 – Statistical threshold: pValue vs qValue
When doing peak calling, we performed a lot of statistical tests, and it becomes easy to catch some event with low probability. Therefore, when doing genome-wide analysis, it is better to define the level of significance not on the individual pValue but on a global estimator like the False Discovery Rate (FDR). The qValue given by MACS (pValue corrected by the Benjamini-Hochberg correction) allow to control the FDR at a define threshold (ie, if one select all events with qValue < 0.001, the expected false discovery rate is 0.001).
Note:
– It can be justify to use a pValue threshold to identify individual peaks that “look enriched over the background”, and to confirm those of interest with additional experiment (ChIP-PCR, gel-shift,…).
However, really good peak usually have very high pValue and therefore high qValue.
– If you do genome-wide analysis (motif discovery, peaks-genes association, genome annotations…), you MUST take into account the expected rate of error due to multiple testing.
Run MACS again with a threshold on the qValue at 0.001.
mkdir macs2_bw300dup1q0.001_Cebpa_vs_IgG macs2 callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -c LSK_WT_IgG_mm9chr19_q20.bam -n macs2_bw300dup1q0.001_Cebpa_vs_IgG/macs2_bw300dup1q0.001_Cebpa_vs_IgG -g 61431566 --bw 300 --keep-dup 1 -q 0.001 2> macs2_bw300dup1q0.001_Cebpa_vs_IgG/macs2_bw300dup1q0.001_Cebpa_vs_IgG.log
Compare the number of peaks found with FDR < 0.001 to those with pVal < 0.001.
wc -l macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak
1569 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak
wc -l macs2_bw300dup1q0.001_Cebpa_vs_IgG/macs2_bw300dup1q0.001_Cebpa_vs_IgG_peaks.narrowPeak
1034 macs2_bw300dup1q0.001_Cebpa_vs_IgG/macs2_bw300dup1q0.001_Cebpa_vs_IgG_peaks.narrowPeak
Look at the lowest qValue you got with pVal < 0.001. How many false positive would you expect if selecting all these peaks?
You need to rank the peaks found with pVal < 0.001 by pValue and look at the bottom one.
sort -g -k8 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak | head
chr19 53933711 53933911 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_1399 30 . 2.57541 3.03847 0.90463 0 chr19 42912219 42912364 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_1160 31 . 2.41011 3.15615 0.97725 56 chr19 41337608 41337748 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_1092 35 . 2.73377 3.56270 1.34256 91
You can see that the less significant peaks has a qValue of 0.90463, meaning that you false discovery rate is controlled at ~ 90% (i.e. you can not be sure that more than ~10% of the selected peaks are true positive!! ).
6 – Finding peaks summits (new feature from MACS)
In MACS 2, a new algorithm has been implemented to find summits of peaks, with the option –call-summit.
Often several “individual peaks” are merged in one, because the peaks are too close and therefore the region between them is significantly enriched, to MACS do not find any reason not to merge them. This algorithm is very useful to separate close, overlapping peaks.
Run MACS with this option.
mkdir macs2_bw300dup1q0.001summit_Cebpa_vs_IgG macs2 callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -c LSK_WT_igG_mm9chr19_q20.bam -n macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG -g 61431566 --bw 300 --keep-dup 1 -p 0.001 --call-summit 2> macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG.log
Look at the result file and find if for some peaks, several summits has been found.
When MACS find several summits for a peak, he will put several for the peak, and add to its name a letter a, b, c …
more macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG_peaks.narrowPeak
... chr19 4680432 4680558 macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG_peak_59 148 . 9.31077 17.65865 14.88698 57 chr19 4794008 4794451 macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG_peak_60a 528 . 20.28850 56.20490 52.89221 11 0 chr19 4794008 4794451 macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG_peak_60b 534 . 12.17567 56.76663 53.45085 33 6 chr19 4811159 4811308 macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG_peak_61 293 . 10.74138 32.38012 29.36629 64
Note the coordinates of this peak, you will look at the reads distribution profile there in the next session.
7 – MACS filtering and model
After these basic analysis, we will know go a bit deeper in the quality assessment of the data quality (signal/noise) and the peak calling.
Find the redundancy rate of the alignment file (correspond to the proportion of duplicated reads). How would you interpret it?
INFO @ Sat, 22 Mar 2014 20:41:35: #1 Redundant rate of treatment: 0.69
This mean that 69% of the reads were duplicated. This is due to the fact that the ChIP was performed with a low quantity of chromatin (low number of cells), and the PCR amplification gave a high level of artifacts.
Find the number of peaks used by MACS to build its model
INFO @ Sat, 22 Mar 2014 20:41:35: #2 number of paired peaks: 983 WARNING @ Sat, 22 Mar 2014 20:41:35: Fewer paired peaks (983) than 1000! Model may not be build well! Lower your MFOLD parameter may erase this warning. Now I will use 983 pairs to build model!
MACS only found 983 enriched regions to build its model. As we are using the chromosome 19 only, this is normal. However, a similar result on all the genome would means that the ChIP enrichment is very low.
Find the fragment length inferred by MACS model.
INFO @ Sat, 22 Mar 2014 20:41:36: #2 predicted fragment length is 116 bps INFO @ Sat, 22 Mar 2014 20:41:36: #2 alternative fragment length(s) may be 116 bps
This should correspond (roughly) to the average size of fragment length after sonication. It is always good to verify this predicted value to library quality control (gel or bioanalyser).
The file macs2_bw300dup1p0.001_Cebpa_vs_IgG_model.r is a script in R (a programming language for statistical analysis) that create figures to visualize reads strand cross-correlation used to build the model.
R --vanilla < macs2_bw300dup1q0.001_Cebpa_vs_IgG/macs2_bw300dup1q0.001_Cebpa_vs_IgG_model.r
A pdf file as been created (macs2_bw300dup1q0.001_Cebpa_vs_IgG/macs2_bw300dup1q0.001_Cebpa_vs_IgG_model.pdf)
You can not open it here, but this is how it looks:
To verify if the prediction of MACS modelling is robust, re-run MACS with a band-width of 500 and 1000b.
The fragment length inferred by MACS should not vary too much if the bandwidth is change, as long as the bandwidth is a bit longer than the fragment length, but not too much so the signal is not lost.
mkdir macs2_bw500dup1q0.001_Cebpa_vs_IgG macs2 callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -c LSK_WT_IgG_mm9chr19_q20.bam -n macs2_bw500dup1q0.001_Cebpa_vs_IgG/macs2_bw500dup1q0.001_Cebpa_vs_IgG -g 61431566 --bw 500 --keep-dup 1 -q 0.001 2> macs2_bw500dup1q0.001_Cebpa_vs_IgG/macs2_bw500dup1q0.001_Cebpa_vs_IgG.log R --vanilla < macs2_bw500dup1q0.001_Cebpa_vs_IgG/macs2_bw500dup1q0.001_Cebpa_vs_IgG_model.r
mkdir macs2_bw1000dup1q0.001_Cebpa_vs_IgG macs2 callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -c LSK_WT_IgG_mm9chr19_q20.bam -n macs2_bw1000dup1q0.001_Cebpa_vs_IgG/macs2_bw1000dup1q0.001_Cebpa_vs_IgG -g 61431566 --bw 1000 --keep-dup 1 -q 0.001 2> macs2_bw1000dup1q0.001_Cebpa_vs_IgG/macs2_bw1000dup1q0.001_Cebpa_vs_IgG.log R --vanilla < macs2_bw1000dup1q0.001_Cebpa_vs_IgG/macs2_bw1000dup1q0.001_Cebpa_vs_IgG_model.r
Visualize the result of these 2 models.
Is MACS model robust over parameter tuning?
You could also try to to change the fold change parameter (-mfold) and see how it impact the model.
Bibliography
1. Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9(9):R137 (2008).
2. https://github.com/taoliu/MACS/ (MACS2 website, for the most up-do-date information about last version)