Session 2: signal extraction – peak calling (course)


 

Signal extraction: peak calling

Goal: to identify region in the genome where the ChIPed protein is bound, i.e. regions where reads are enriched over the background.

There are different methods to find peaks in ChIP-seq data. We will the software MACS (Model-based Analysis of ChIP-seq) version 2, because i) it is widely used, ii) it is very efficient and rely on clear statistical standard, and iii) version 2 implemented the most up-to-date methods of peak calling.

Note: version 2 is the most up-to-date, and is more efficient than previous versions. However it is still in development (and not published yet). Arguments and default options may change in the (near) future.

 

Introductory slides

To see the slides, click on this link (and then click on the picture of the slide)

1 – MACS2 manual

MACS can be called with the command macs2. Use the argument -h to see the available arguments (A more eye-candy version can be found at https://github.com/taoliu/MACS/ )

macs2 -h

At the bottom of the manual, you can see

For command line options of each command, type: macs2 COMMAND -h

The function we want to use is callpeak. Have a look to the available option for this function:

macs2 callpeak -h

 

2 – Peak calling using ChIP data only (no control)

MACS allowed one to call peaks using only the reads from the ChIP experiment and not the control.

!!! ATTENTION !!!

This is NOT a good idea!! Peak calling MUST be done using a control (IgG or input) for background estimation. Background control can exhibit local biais (i.e. artifactual peaks) and IS NECESSARY.

However, we will try to do peak calling with no control to see the results.
First make a link in your working directory to the .bam file of the ChIP sample, which is in your Datasets directory:

ln -s aligned_reads/GMP_WT_Cebpa_mm9_chr19_q20.bam .

Now run MACS on the ChIP data, using the following parameters:

– bandwidth of 300 bases to scan the genome for model building

– keeping only 1 reads when duplicates are present

– a threshold of 0.001 on the pValue for reads enrichment.

Other parameters as default.

Remember that you have to tell macs the size of the genome your are working on. Here we are only working on the mouse chr19, which is 61,431,566 bases long.

First create a directory in which MACS will write its results.

mkdir macs2_bw300dup1p0.001_Cebpa

macs2 callpeak -t GMP_WT_Cebpa_mm9_chr19_q20.bam -n macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa -g 61431566 --bw 300 --keep-dup 1 -p 0.001 2> macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa.log

2> macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa.log will write MACS messages (results, errors,…) in a text file that you can keep and read later.

use the ls -l command to see the files that MACS has created.

ls -l macs2_bw300dup1p0.001_Cebpa
total 1636
-rw-r--r-- 1 root root 3213 Mar 23 09:33 macs2_bw300dup1p0.001_Cebpa.log
-rw-r--r-- 1 root root 78314 Mar 23 09:33 macs2_bw300dup1p0.001_Cebpa_model.r
-rw-r--r-- 1 root root 407877 Mar 23 09:33 macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak
-rw-r--r-- 1 root root 409054 Mar 23 10:01 macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak.osc
-rw-r--r-- 1 root root 441690 Mar 23 09:33 macs2_bw300dup1p0.001_Cebpa_peaks.xls
-rw-r--r-- 1 root root 323716 Mar 23 09:33 macs2_bw300dup1p0.001_Cebpa_summits.bed

Read the log file to see if the peak calling worked:

more macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa.log
INFO @ Sun, 23 Mar 2014 09:33:22: 
# Command line: callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -n macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa -g 61431566 --bw 300 --keep-dup 1 -p 0.001
# ARGUMENTS LIST:
# name = macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa
# format = AUTO
# ChIP-seq file = ['GMP_WT_Cebpa_mm9chr19_q20.bam']
# control file = None
# effective genome size = 6.13e+07
# band width = 300
# model fold = [5, 50]
# pvalue cutoff = 1.00e-03
# qvalue will not be calculated and reported as -1 in the final output.
# Larger dataset will be scaled towards smaller dataset.
# Range for calculating regional lambda is: 10000 bps
# Broad region calling is off

INFO @ Sun, 23 Mar 2014 09:33:22: #1 read tag files... 
INFO @ Sun, 23 Mar 2014 09:33:22: #1 read treatment tags... 
INFO @ Sun, 23 Mar 2014 09:33:22: Detected format is: BAM 
INFO @ Sun, 23 Mar 2014 09:33:22: * Input file is gzipped. 
INFO @ Sun, 23 Mar 2014 09:33:26: 1000000 
INFO @ Sun, 23 Mar 2014 09:33:31: 2000000 
INFO @ Sun, 23 Mar 2014 09:33:32: #1 tag size is determined as 51 bps 
INFO @ Sun, 23 Mar 2014 09:33:32: #1 tag size = 51 
INFO @ Sun, 23 Mar 2014 09:33:32: #1 total tags in treatment: 2263023 
INFO @ Sun, 23 Mar 2014 09:33:32: #1 user defined the maximum tags... 
INFO @ Sun, 23 Mar 2014 09:33:32: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) 
INFO @ Sun, 23 Mar 2014 09:33:33: #1 tags after filtering in treatment: 709326 
INFO @ Sun, 23 Mar 2014 09:33:33: #1 Redundant rate of treatment: 0.69 
INFO @ Sun, 23 Mar 2014 09:33:33: #1 finished! 
INFO @ Sun, 23 Mar 2014 09:33:33: #2 Build Peak Model... 
INFO @ Sun, 23 Mar 2014 09:33:33: #2 number of paired peaks: 983 
WARNING @ Sun, 23 Mar 2014 09:33:33: Fewer paired peaks (983) than 1000! Model may not be build well! Lower your MFOLD parameter may erase this warning. Now I will use 983 pairs to build
 model! 
INFO @ Sun, 23 Mar 2014 09:33:33: start model_add_line... 
INFO @ Sun, 23 Mar 2014 09:33:35: start X-correlation... 
INFO @ Sun, 23 Mar 2014 09:33:35: end of X-cor 
INFO @ Sun, 23 Mar 2014 09:33:35: #2 finished! 
INFO @ Sun, 23 Mar 2014 09:33:35: #2 predicted fragment length is 116 bps 
INFO @ Sun, 23 Mar 2014 09:33:35: #2 alternative fragment length(s) may be 116 bps 
INFO @ Sun, 23 Mar 2014 09:33:35: #2.2 Generate R script for model : macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_model.r 
INFO @ Sun, 23 Mar 2014 09:33:35: #3 Call peaks... 
INFO @ Sun, 23 Mar 2014 09:33:35: #3 Call peaks with given -log10pvalue cutoff: 3.00000 ... 
INFO @ Sun, 23 Mar 2014 09:33:35: #3 Pre-compute pvalue-qvalue table... 
INFO @ Sun, 23 Mar 2014 09:33:40: #3 Call peaks for each chromosome... 
INFO @ Sun, 23 Mar 2014 09:33:46: #4 Write output xls file... macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.xls 
INFO @ Sun, 23 Mar 2014 09:33:46: #4 Write peak in narrowPeak format file... macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak 
INFO @ Sun, 23 Mar 2014 09:33:46: #4 Write summits bed file... macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_summits.bed 
INFO @ Sun, 23 Mar 2014 09:33:46: Done!

 

Look at the beginning of the file macs2_bw300dup1p0.001_Cebpa_peaks.xls

more macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.xls
# This file is generated by MACS version 2.0.10.20131216 (tag:beta)
# Command line: callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -n macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa -g 61431566 --bw 300 --keep-dup 1 -p 0.001
# ARGUMENTS LIST:
# name = macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa
# format = AUTO
# ChIP-seq file = ['GMP_WT_Cebpa_mm9chr19_q20.bam']
# control file = None
# effective genome size = 6.13e+07
# band width = 300
# model fold = [5, 50]
# pvalue cutoff = 1.00e-03
# qvalue will not be calculated and reported as -1 in the final output.
# Larger dataset will be scaled towards smaller dataset.
# Range for calculating regional lambda is: 10000 bps
# Broad region calling is off

# tag size is determined as 51 bps
# total tags in treatment: 2263023
# tags after filtering in treatment: 709326
# maximum duplicate tags at the same position in treatment = 1
# Redundant rate in treatment: 0.69
# d = 116
# alternative fragment length(s) may be 116 bps
chr start end length abs_summit pileup -log10(pvalue) fold_enrichment -log10(qvalue) name
chr19 3046825 3046940 116 3046882 10.00 6.72976 4.69814 4.38045 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_1
chr19 3053207 3053322 116 3053257 8.00 4.93277 3.84393 2.73420 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_2
chr19 3083546 3083661 116 3083610 9.00 5.81107 4.27104 3.52522 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_3
chr19 3084109 3084354 246 3084188 8.00 4.93277 3.84393 2.73420 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_4
chr19 3085835 3085952 118 3085903 10.00 6.72976 4.69814 4.38045 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_5

 

Look at the beginning of the file macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak

head macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak
chr19 3046824 3046940 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_1 67 . 4.69814 6.72976 4.38045 57
chr19 3053206 3053322 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_2 49 . 3.84393 4.93277 2.73420 50
chr19 3083545 3083661 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_3 58 . 4.27104 5.81107 3.52522 64
chr19 3084108 3084354 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_4 49 . 3.84393 4.93277 2.73420 79

 

This is a .narrowPeak file, a format use by the ENCODE consortium. Columns correspond to:

chromosome   |   start   |   end   |   name   |   Tags count   |   strand   |   foldChange vs background   |   pVal (-log)   |   qVal   |   Position of Summit

(you can find these informations on MACS website: https://github.com/taoliu/MACS/)

Count the number of of peaks found by macs

wc -l macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak
3307 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak

 

order them by pValue and look at the first hits

sort -rgk8 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak | head
chr19 32641557 32641851 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_1704 2674 . 37.80951 267.46173 259.67395 157
chr19 46975862 46976218 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_2626 2549 . 30.24302 254.94597 248.48042 192
chr19 40963403 40963746 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_2183 2400 . 37.83377 240.02473 234.13460 149
chr19 5313233 5313550 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peak_176 2394 . 38.93952 239.48555 233.61687 174

Go Further …

3 – Peak calling using ChIP data vs Control

Make a link in your working directory to the .bam file of the control sample, which is in your Datasets directory:

ln -s Dataset/Level_2_Aligned_Reads/LSK_WT_igG_mm9chr19_q20.bam LSK_WT_IgG_mm9chr19_q20.bam

Now call peaks using the IgG dataset as control (first create a directory to write MACS results in):

mkdir macs2_bw300dup1p0.001_Cebpa_vs_IgG

macs2 callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -c LSK_WT_IgG_mm9chr19_q20.bam -n macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG -g 61431566 --bw 300 --keep-dup 1 -p 0.001 2> macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG.log

Count the number of peaks found with control and compare it to the number of peak called without control:

wc -l macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak
3307 macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak

 

wc -l macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak
1569 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak

 

We would like to get all the peaks that were filtered when using the control dataset. We can use the program intersectBed from BEDTOOLS (http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html) for that:

intersectBed -a macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak -b macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak -v

Normally, intersectBed will report all the genomic intervals in -a that overlap one in -b; option -v tell him to report those in -a who do NOT overlap any ib -b.

This result is directly reported on your terminal. You can look at the top peaks called without control that are filtered out by sorting then by pValue and just getting the top one:

intersectBed -a macs2_bw300dup1p0.001_Cebpa/macs2_bw300dup1p0.001_Cebpa_peaks.narrowPeak -b macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak -v | sort -r -g -k8 | head

Note: in sort, the option -r give you the reverse sorted list (ie highest value first), -g tell sort to use general numerical notation (decimal, scientific notation…), and -k8 to sort on the 8th column.

chr19 11410560 11410713 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_335 164 . 8.16740 16.48031 13.73139 68
chr19 37000082 37000216 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_995 163 . 4.87581 16.30774 13.57057 68
chr19 29231289 29231432 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_740 154 . 7.47886 15.43415 12.71137 69
chr19 59532009 59532142 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_1530 152 . 7.45955 15.25735 12.53735 69
chr19 5376457 5376594 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_102 152 . 7.45955 15.25735 12.53735 62
chr19 30516971 30517108 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_794 148 . 7.62319 14.84488 12.13363 57
chr19 7191010 7191153 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_189 143 . 6.03521 14.31085 11.61076 49
chr19 44619348 44619481 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_1214 142 . 7.62480 14.25801 11.56026 64
chr19 41988362 41988553 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_1128 142 . 7.19562 14.21410 11.51678 105
chr19 29443503 29443641 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_754 142 . 7.19562 14.21410 11.51678 59

 

Note the coordinates of the first peak, you will look at the reads distribution profile there in the next session.

Do they seems like lowly enriched peaks?

A control sample allowed to filter artifact for several reason:

– Experimental artifact (real peak, non specific): some region of the chromatin are highly accessible, and just get enriched after sonication without any immuno-precipitation. By note performing IP (input)  or a non-specific IP (IgG), one can identify these non specifically enriched regions.

Statistical artifact (false peaks): ChIP sample should have a high fractions of reads in peaks region, and therefore much less in background regions. Less reads higher variation in reads enrichment and a more difficult estimation of the statistical (random) distribution of reads. Control sample should be more uniform and allowed better estimation of the distribution (IF SEQUENCED DEEP ENOUGH !! do not be miserly on it, it is not worth it.)

 

5 – Statistical threshold: pValue vs qValue

When doing peak calling, we performed a lot of statistical tests, and it becomes easy to catch some event with low probability. Therefore, when doing genome-wide analysis, it is better to define the level of significance not on the individual pValue but on a global estimator like the False Discovery Rate (FDR). The qValue given by MACS (pValue corrected by the Benjamini-Hochberg correction) allow to control the FDR at a define threshold (ie, if one select all events with qValue < 0.001, the expected false discovery rate is 0.001).

Note:

– It can be justify to use a pValue threshold to identify individual peaks that “look enriched over the background”, and to confirm those of interest with additional experiment (ChIP-PCR, gel-shift,…).

However, really good peak usually have very high pValue and therefore high qValue.

 

– If you do genome-wide analysis (motif discovery, peaks-genes association, genome annotations…), you MUST take into account the expected rate of error due to multiple testing.

Run MACS again with a threshold on the qValue at 0.001.

mkdir macs2_bw300dup1q0.001_Cebpa_vs_IgG

macs2 callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -c LSK_WT_IgG_mm9chr19_q20.bam -n macs2_bw300dup1q0.001_Cebpa_vs_IgG/macs2_bw300dup1q0.001_Cebpa_vs_IgG -g 61431566 --bw 300 --keep-dup 1 -q 0.001 2> macs2_bw300dup1q0.001_Cebpa_vs_IgG/macs2_bw300dup1q0.001_Cebpa_vs_IgG.log

Compare the number of peaks found with FDR < 0.001 to those with pVal < 0.001.

wc -l macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak
1569 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak

 

 wc -l macs2_bw300dup1q0.001_Cebpa_vs_IgG/macs2_bw300dup1q0.001_Cebpa_vs_IgG_peaks.narrowPeak
1034 macs2_bw300dup1q0.001_Cebpa_vs_IgG/macs2_bw300dup1q0.001_Cebpa_vs_IgG_peaks.narrowPeak

 

Look at the lowest qValue you got with pVal < 0.001. How many false positive would you expect if selecting all these peaks?

You need to rank the peaks found with pVal < 0.001 by pValue and look at the bottom one.

sort -g -k8 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peaks.narrowPeak | head
chr19 53933711 53933911 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_1399 30 . 2.57541 3.03847 0.90463 0
chr19 42912219 42912364 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_1160 31 . 2.41011 3.15615 0.97725 56
chr19 41337608 41337748 macs2_bw300dup1p0.001_Cebpa_vs_IgG/macs2_bw300dup1p0.001_Cebpa_vs_IgG_peak_1092 35 . 2.73377 3.56270 1.34256 91

 

You can see that the less significant peaks has a qValue of  0.90463, meaning that you false discovery rate is controlled at ~ 90% (i.e. you can not be sure that more than ~10% of the selected peaks are true positive!! ).

 

6 – Finding peaks summits (new feature from MACS)

In MACS 2, a new algorithm has been implemented to find summits of peaks, with the option –call-summit.

Often several “individual peaks” are merged in one, because the peaks are too close and therefore the region between them is significantly enriched, to MACS do not find any reason not to merge them. This algorithm is very useful to separate close, overlapping peaks.

Run MACS with this option.

mkdir macs2_bw300dup1q0.001summit_Cebpa_vs_IgG

macs2 callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -c LSK_WT_igG_mm9chr19_q20.bam -n macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG -g 61431566 --bw 300 --keep-dup 1 -p 0.001 --call-summit 2> macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG.log

Look at the result file and find if for some peaks, several summits has been found.

When MACS find several summits for a peak, he will put several for the peak, and add to its name a letter a, b, c …

more macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG_peaks.narrowPeak
...
chr19 4680432 4680558 macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG_peak_59 148 . 9.31077 17.65865 14.88698 57
chr19 4794008 4794451 macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG_peak_60a 528 . 20.28850 56.20490 52.89221 11 0
chr19 4794008 4794451 macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG_peak_60b 534 . 12.17567 56.76663 53.45085 33 6
chr19 4811159 4811308 macs2_bw300dup1q0.001summit_Cebpa_vs_IgG/macs2_bw300dup1q0.001summit_Cebpa_vs_IgG_peak_61 293 . 10.74138 32.38012 29.36629 64

 

Note the coordinates of this peak, you will look at the reads distribution profile there in the next session.

 

7 – MACS filtering and model

After these basic analysis, we will know go a bit deeper in the quality assessment of the data quality (signal/noise) and the peak calling.

Find the redundancy rate of the alignment file (correspond to the proportion of duplicated reads). How would you interpret it?

INFO @ Sat, 22 Mar 2014 20:41:35: #1 Redundant rate of treatment: 0.69

 

This mean that 69% of the reads were duplicated. This is due to the fact that the ChIP was performed with a low quantity of chromatin (low number of cells), and the PCR amplification gave a high level of artifacts.

Find the number of peaks used by MACS to build its model

INFO @ Sat, 22 Mar 2014 20:41:35: #2 number of paired peaks: 983 
WARNING @ Sat, 22 Mar 2014 20:41:35: Fewer paired peaks (983) than 1000! Model may not be build well! Lower your MFOLD parameter may erase this warning. Now I will use 983 pairs to build
 model!

 

MACS only found 983 enriched regions to build its model. As we are using the chromosome 19 only, this is normal. However, a similar result on all the genome would means that the ChIP enrichment is very low.

Find the fragment length inferred by MACS model.

INFO @ Sat, 22 Mar 2014 20:41:36: #2 predicted fragment length is 116 bps 
INFO @ Sat, 22 Mar 2014 20:41:36: #2 alternative fragment length(s) may be 116 bps

 

This should correspond (roughly) to the average size of fragment length after sonication. It is always good to verify this predicted value to library quality control (gel or bioanalyser).

The file macs2_bw300dup1p0.001_Cebpa_vs_IgG_model.r is a script in R (a programming language for statistical analysis) that create figures to visualize reads strand cross-correlation used to build the model.

R --vanilla < macs2_bw300dup1q0.001_Cebpa_vs_IgG/macs2_bw300dup1q0.001_Cebpa_vs_IgG_model.r

A pdf file as been created (macs2_bw300dup1q0.001_Cebpa_vs_IgG/macs2_bw300dup1q0.001_Cebpa_vs_IgG_model.pdf)
You can not open it here, but this is how it looks:
macs2_bw300dup1q0.001_Cebpa_vs_IgG_model-0macs2_bw300dup1q0.001_Cebpa_vs_IgG_model-1

To verify if the prediction of MACS modelling is robust, re-run MACS with a band-width of 500 and 1000b.

The fragment length inferred by MACS should not vary too much if the bandwidth is change, as long as the bandwidth is a bit longer than the fragment length, but not too much so the signal is not lost. 

mkdir macs2_bw500dup1q0.001_Cebpa_vs_IgG

macs2 callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -c LSK_WT_IgG_mm9chr19_q20.bam -n macs2_bw500dup1q0.001_Cebpa_vs_IgG/macs2_bw500dup1q0.001_Cebpa_vs_IgG -g 61431566 --bw 500 --keep-dup 1 -q 0.001 2> macs2_bw500dup1q0.001_Cebpa_vs_IgG/macs2_bw500dup1q0.001_Cebpa_vs_IgG.log

R --vanilla < macs2_bw500dup1q0.001_Cebpa_vs_IgG/macs2_bw500dup1q0.001_Cebpa_vs_IgG_model.r

macs2_bw500dup1q0.001_Cebpa_vs_IgG_model-0macs2_bw500dup1q0.001_Cebpa_vs_IgG_model-1

mkdir macs2_bw1000dup1q0.001_Cebpa_vs_IgG

macs2 callpeak -t GMP_WT_Cebpa_mm9chr19_q20.bam -c LSK_WT_IgG_mm9chr19_q20.bam -n macs2_bw1000dup1q0.001_Cebpa_vs_IgG/macs2_bw1000dup1q0.001_Cebpa_vs_IgG -g 61431566 --bw 1000 --keep-dup 1 -q 0.001 2> macs2_bw1000dup1q0.001_Cebpa_vs_IgG/macs2_bw1000dup1q0.001_Cebpa_vs_IgG.log

R --vanilla < macs2_bw1000dup1q0.001_Cebpa_vs_IgG/macs2_bw1000dup1q0.001_Cebpa_vs_IgG_model.r

 Visualize the result of these 2 models.

macs2_bw1000dup1q0.001_Cebpa_vs_IgG_model-0macs2_bw1000dup1q0.001_Cebpa_vs_IgG_model-1

Is MACS model robust over parameter tuning?

You could also try to to change the fold change parameter (-mfold) and see how it impact the model.

 

Bibliography

1. Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9(9):R137 (2008).

2. https://github.com/taoliu/MACS/ (MACS2 website, for the most up-do-date information about last version)