4.2 : RSAT peak-motifs for motif discovery in large datasets


[printfriendly]

 

 

Goal: Define binding motif(s) for the ChIPed transcription factor and identify potential cofactors

We will work on the peaks found on the chrom19 (that you obtained at the peak-calling step yesterday). For easy access to this file, download the peak file here, and save it on your windows computer (on the Desktop for example) under the name: Cebpa_vs_igG_peaks.narrowPeak

1 – Retrieve the peak sequences corresponding to the peak coordinate file (BED)

For the motif analysis, you first need to extract the sequences underlying the peaks. There are several ways to do this (as usual…). If you work on a UCSC-supported organism, the easiest is to use RSAT fetch-sequences or Galaxy. For non UCSC-supported organisms, the best is to use Bedtools, after downloading the genome of interest on your computer. Here, we will use the RSAT suite of tools [1-3].

 

  1. Open a connection to a Regulatory Sequence Analysis Tools Singapore server tblab-csi.nus.edu.sg/rsat.
  2. At the bottom of the page, you can choose between various website mirrors. To prevent overcharging a particular server during this training, you will be assigned a particular server. Click on the picture of your assigned server.RSAT main server
  3. Make sure you are on your assigned server by looking at the picture at the top left of the page.
  4. In the left menu, click on NGS ChIP-seq and then click on fetch sequences from UCSC. A new page opens, with a form.
  5. In the Genome select menu (—UCSC genome—), choose the mouse mm9 assembly, as the mapping step was performed on this assembly, and the peaks found have subsequently their coordinates in the mm9 assembly.

    This step is crucial : Note that choosing a different assembly will retrieve completely different sequences

    selection of mm9

  6. Specify the coordinates of the peaks to use for the motif analysis. In Upload a file from your computer, click on the button to upload/choose the file Cebpa_vs_igG_peaks.narrowPeak.
  7. Keep Galaxy for Header Format
  8. Click on the button GO.

    Depending on the server, a new page appear, or the form page still displays until the job is finished. Wait until completion of the job (do not go back or click several times on the GO button)

  9. A new page appears with links to the results. Click on the fasta file to quickly view the retrieved sequences. Return to the result page.

 

 

 2 – Motif discovery with RSAT

  1. At the bottom of the result page, there is a section Next step, with a button peak-motifs. Click on this button, so that the sequences are automatically sent to the form of the motif analysis program called peak-motifs [4,5].
  2. The default peak-motifs web form only displays the essential options. There are only two mandatory parameters.3_RSAT a. The title box, which has been automatically set as Cebpa_vs_igG_peaks.narrowPeak_20140322_113647_a3e.narrowPeak
    b. The sequences, that are linked through a URL on the server.

    Make sure the Output option is set to display and not email, as some recently installed servers have not been configured yet for email.

  3. We could launch the analysis like this, but we will now modify some of the advanced options in order to fine-tune the analysis according to your data set.
      • Open the “Reduce peak sequences” title, and make sure the “Cut peak sequences: +/- ” option is set to 0 (we wish to analyse our full dataset)
      • Open the “Motif Discovery parameters” title, and check the oligomer sizes 6 and 7 (but not 8). Check “Discover over-represented spaced word pairs [dyad-analysis]”

    4_RSAT

      • Under the “Locate motifs and export predicted sites as custom UCSC tracks” title, check Peak coordinates specified in fasta headers of the test sequence file (Galaxy format)

    5_RSAT

    • Click on the button GO.

      Depending on the server, a new page appear, or the form page still displays until the job is finished. Wait until completion of the job, it will take several minutes

      (do not go back or click several times on the GO button)

       

  4. While waiting, read the following protocol “A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs” by Thomas-Chollier et al, Nature Protocols (2012) PDF. In particular, read the introduction, the boxes and the anticipated results.
  5. The Web page also displays a link, You can already click on this link. The report will be progressively updated during the processing of the workflow.
  6. Analyze the sequence composition panel

    How many peaks were given as input ? What is the sequence length range (smallest/largest peaks), and the mean length ? Look at the dinucleotide profile: there is a biais in the CG, do you know why ?

    show tip

    The mean length of the peaks is about 200bp. All peaks have a length between 116bp and 639bp, so they are relatively small. The CG dinucleotide is less frequent than the other dinucleotides, because CG tend to be methylated (specific regulatory signal) in mammals. This must be taken into account when calculating the expected frequency of a given motif, and highlights the fact that a simple naïve A=C=G=T model is not representative of a real biological sequence

  7. Analyze the discovered motifs

    7_RSAT

    Click on the panel Discovered motifs (by algorithm). You will see all the motifs that were found with the various algorithms chosen as input. Look at the links towards the discovered words, that were used as seed to construct the motif as a matrix.

  8. Analyze the motif comparison

    8_RSAT
    Do we discover significant motifs ? Are these motifs biologically relevant? In particular, did the program discover motifs related to CEBPA that was the chipped factor ?

    show tip

    The CEBPA motif is found, which was the chipped factor. In addition, it is found by multiple algorithms, which gives us more confidence into the found motifs. Another motif is found, similar to ELF motif. This factor belongs to the Ets family, and it is well-known that Pu.1 is an Ets family cofactor of CEBPA.

    A copy of the results is available here : http://tblab-csi.nus.edu.sg/rsat/tmp/www-data/2014/03/22/peak-motifs.2014-03-22.210602_2014-03-22.210602_eBWWtp/peak-motifs_synthesis.html

 

Bibliography


1. van Helden, J. Regulatory sequence analysis tools. Nucleic Acids Res 31, 3593–3596 (2003).

2. Thomas-Chollier, M. et al. RSAT: regulatory sequence analysis tools. Nucleic Acids Res 36, W119–27 (2008).

3. Thomas-Chollier, M. et al. RSAT 2011: regulatory sequence analysis tools. Nucleic Acids Res 39, W86–91 (2011).

4. Thomas-Chollier, M. et al. RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res 40, e31 (2012).

5. Thomas-Chollier, M. et al. A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nature Protocols 7, 1551–1568 (2012).