Representing Transcription Factor Dimer Binding Sites Using Forked-Position Weight Matrices and Forked-Sequence Logos

Roberto Tirado-Magallanes^1∗, Aida Ghayour-Khiavi^2∗, Quy Xiao Xuan Lin¹, Hamid Usefi³, Sudhakar Jha¹, Morgane Thomas-Chollier⁴, Denis Thieffry⁴ and Touati Benoukraf^1,2

¹Cancer Science Institute of Singapore, National University of Singapore, Singapore; ²Discipline of Genetics, Faculty of Medicine, Memorial University of Newfoundland, St. John’s, NL, Canada; ³Departement of Mathematics and Statistics, Faculty of Science, Memorial University of Newfoundland, St. John’s, NL, Canada; ⁴Institut de Biologie de l’Ecole Normale Supérieure, Paris, France

Correspondence to tbenoukraf@mun.ca

∗ Co-first author

Last update: 22 March 2020 - Version: 0.1.0

Abstract

Motivation: Current position weight matrices and sequence logos are not adequate to model transcription factor binding sites that are composed by a mixture of homodimer and heterodimer complexes. Results: We introduce forkedTF, an R-library that combined functions to design Forked-Position Weight Matrices (FPWM) and Forked-Sequence Logos to better portray TF dimers. forkedTF uses and produces standard objects facilitating the exportation of FPWM to third-party software.

`forkedTF` integration with `TFregulomeR`

forkedTF utilizes many of the resources and funtions in TFregulomeR. Therefore, TFregulomeR is automatically called when loading forkedTF with library(forkedTF)

Co-Factor exploration

In order to explore the diversity of binding partners of that a give TF might have in a particular cell line, we implemented the function miniCofactorReport(). This function serves as a convenient lite version of cofactorReport included in TFregulomeR. Just by specifying the TF and cell line of interest, miniCofactorReport() will access the “TFregulomeR compendium”, identify the top 9 co-binding partners and generate a PDF report.

library(forkedTF)

miniCofactorReport( TF = "CEBPB", cell = "K562" )

MiniCoFactor report for CEBPB in K562.

You can inspect all the different cellLine-TF combinations with TFregulomeR’s dataBrowser() function. For example, to retireve all the TF available for “K562” run:

all_record <- dataBrowser() # Extract all the entries in the TFregulomeR compendium
cell_tf <- cbind(all_record$cell_tissue_name, all_record$TF) # keep the relevant information in a data frame

cell_of_interest <- "K562" # cell line of interest
cell_tf[grep(cell_of_interest,cell_tf),2] 
  [1] "AFF1"     "ARID2"    "ARID3A"   "ATF1"     "ATF2"     "ATF3"    
  [7] "ATF4"     "ATF7"     "BACH1"    "BHLHE40"  "C11orf30" "CBFA2T3" 
 [13] "CEBPB"    "CEBPD"    "CREB1"    "CREB3L1"  "CREM"     "CTCF"    
 [19] "CTCFL"    "CUX1"     "DACH1"    "DPF2"     "E2F1"     "E2F4"    
 [25] "E2F6"     "E2F7"     "E2F8"     "E4F1"     "EGR1"     "ELF1"    
 [31] "ELF4"     "ELK1"     "ESRRA"    "ETV6"     "FOS"      "FOSL1"   
 [37] "FOXA1"    "FOXK2"    "FOXM1"    "GABPA"    "GABPB1"   "GATA1"   
 [43] "IRF1"     "IRF2"     "JUN"      "JUNB"     "JUND"     "KLF1"    
 [49] "KLF13"    "MAFF"     "MAFG"     "MAFK"     "MAX"      "MAZ"     
 [55] "MEF2A"    "MEF2D"    "MEIS2"    "MITF"     "MNT"      "MXI1"    
 [61] "MYBL2"    "MYC"      "MYNN"     "NCOA1"    "NCOA2"    "NFATC3"  
 [67] "NFE2"     "NFE2L1"   "NFE2L2"   "NFIA"     "NFIC"     "NFYB"    
 [73] "NR2C1"    "NR2F1"    "NR2F2"    "NR3C1"    "NRF1"     "NUFIP1"  
 [79] "PBX2"     "PHF21A"   "PKNOX1"   "RBM25"    "REST"     "RFX1"    
 [85] "RFX5"     "RUNX1"    "SIX5"     "SMAD1"    "SMARCA5"  "SMARCC2" 
 [91] "SP1"      "SPI1"     "SRF"      "TAL1"     "TCF12"    "TCF7"    
 [97] "TEAD2"    "TFDP1"    "THAP1"    "THRA"     "USF1"     "USF2"    
[103] "VEZF1"    "YY1"      "ZBTB2"    "ZBTB40"   "ZBTB5"    "ZBTB7A"  
[109] "ZFP36"    "ZFP91"    "ZFX"      "ZHX1"     "ZKSCAN1"  "ZKSCAN8" 
[115] "ZMYM3"    "ZNF143"   "ZNF148"   "ZNF184"   "ZNF24"    "ZNF263"  
[121] "ZNF316"   "ZNF354B"  "ZNF407"   "ZNF410"   "ZNF507"   "ZNF592"  
[127] "ZNF639"   "ZNF740"   "ZNF75A"   "ZNF83"    "ZSCAN29"

Creation of a FPWM

In order to create a Forked Positioning Weight Matrix (FPWM), we construct multiple PWM, one for each group of peaks from the main TF that overlap with the binding partners of interest. We have created a function called createFPWM that takes as an input the main TF of interest mainTF, a list or vector of binding partners partners, a cell line cell and the postion of the binding site to be forked forkPosition. We foresee that a common application would be to find the most common co-binding partners of a TF in a cell line using miniCofactorReport() and then create FPWM object based on the co-factor report. Here we create a FPWM object based on the top binding partners we observed in the co-factor report above.

createFPWM can take as inputs the IDs in the TFregulomeR compendium.

FPWM_CEBPB_K562 <- createFPWM(mainTF ="CEBPB",
                            partners = c("CEBPD","ATF4","ATF7","ATF3","JUND","FOS"),
                                cell = "K562", 
                        forkPosition = 5)

Aditionally, we can create the same FPWM providing as input the IDs in the TFregulomeR compendium. In this mode use mainTF_MMID to provide the ID of the main TF and partners_MMID for the list or vector of IDs for the binding partners. It is not necessary to specify the cell line as that information is already included in the IDs.

FPWM_CEBPB_K562 <- createFPWM(mainTF_MMID ="MM1_HSA_K562_CEBPB",
                            partners_MMID = c("MM1_HSA_K562_CEBPD",
                                              "MM1_HSA_K562_ATF4",
                                              "MM1_HSA_K562_ATF7",
                                              "MM1_HSA_K562_ATF3",
                                              "MM1_HSA_K562_JUND",
                                              "MM1_HSA_K562_FOS"),
                             forkPosition = 5)

It is very common that a particular TF is not found in a cell line but is present in another. To overcome this limitation, it is possible to compare peaks from different cell lines. To create a FPWM with ChIP-Seq peaks from different cell lines give as input the IDs in TFregulomeR as shown below:

FPWM_different_celllines <- createFPWM(mainTF_MMID ="MM1_HSA_K562_CEBPB",
                                     partners_MMID = c("MM1_HSA_HCT116_CEBPB",
                                                       "MM1_HSA_K562_ATF4",
                                                       "MM1_HSA_GM12878_ATF3"),
                                      forkPosition = 5)

Visualizing FPWM: Forked-Logo

FPWM can be visualized using the function plotFPWM. This will generate a PDF with a forked-logo, were the first bases before the fork correspond to the mainTF of interest. The mainTF motif is connected via a fork with continuations of the logo depending on the binding partner. The logos show the methylation levels for each of the nucleotides in the binding site. The percentage on top of each forked arrow displays the amount of overlapping between the TF of interest and each of the binding partners. The output can be specified with pdfName, as well as the width and height of the PDF.

plotFPWM( FPWM_CEBPB_K562, pdfName = "FPWM_CEBPB_K562.pdf" )

$Plot of the Forked-PWM.$

Plot of the Forked-PWM.

Reverse complement FPWM when the core motif is on the second half of the motif positions.

It might happen that the core motif appears in the second half of the motifs positions while the partners motif is at the beginning. Such is the case for the cofactors of JUND in HepG2 as can be seen in this miniCofactorReport:

miniCofactorReport( TF = "JUND", cell = "HepG2" )

MiniCoFactor report for CEBPB in K562.

In such cases, FPWM must be created with the parameter flipMatrix = TRUE to reverse-complement the individual PWM. forkPosition should be set to the position just before the core main TF motif, in this case position 6.

flipped_fpwm <- createFPWM(mainTF ="JUND",
                        partners = c("ATF7","ATF2","JUN","FOSL2","FOS"),
                        cell = "HepG2", 
                        forkPosition = 6,
                        flipMatrix = TRUE)

The resulting FPWM can be drawn with plotFPWM.

plotFPWM(flipped_fpwm, pdfName="flipped_fpwm.pdf")

MiniCoFactor report for CEBPB in K562.

Writing FPWMs into a FPWM-Transfac format and standard Transfac format.

One of the most common ways to represent a TF binding motif is through the Transfac format. Therefore, we propose a modified version of this format called FPWMtransfac. In this format the mainTF ID is stored as the “parentLogo” whereas the partners are called “leafLogos”. Information regarding overlap percentage, the number of bases pairs found in the overlapped peaks and the total number of overlapped peaks are found in the “overlappingScore”, “numberOfBasePairs” and “numberOfOverlappingPeaks” fields respectively. Each comma separated value in those fields corresponds to the TF in that position in leafLogos. For example, the first value in overlappingScore corresponds to the first TF in leafLogos, the second value to the second TF and so forth. This is also the case for the nucleotide frequency matrix. From position 1 until the forkPosition corresponds to the mainTF shown in the parentLogo field whereas the looping values after the forkPosition correspond to the binding partners in the order shown in the leafLogos field.

FPWM_CEBPB_2partners <- createFPWM(mainTF = "CEBPB",
                                 partners = c("ATF4","JUND"),
                                     cell = "K562", 
                             forkPosition = 5)

write.FPWM( FPWM = FPWM_CEBPB_2partners, format = "FPWMtransfac", fileName = "FPWM_CEBPB_2partners.FPWMtransfac")

This is how the output in FPWMtransfac format looks like:

FPWMtransfac format.

There is a myriad of sequence analysis tools that employ the Transfac format as input. Therefore, we have provided the option to write FPWM in the standard transfact format where each combination of mainTF + partner is given as an independent matrix together with all the other information found in a FPWM.

write.FPWM( FPWM = FPWM_CEBPB_2partners, format = "transfac", fileName = "FPWM_CEBPB_2partners.transfac")

This is how the output looks in transfac format:

FPWM in Transfac format.

Creating FPWMs with three different matrix value options.

In a standard transfact matrix, the sum of all the elements in a row gives the same result for every row. This is because all the positions in the matrix are built from the same number of peaks/binding regions. However, in FPWMs, parts of the matrix are constructed with variable numbers of peaks/binding regions depending on the overlap between the main TF and their partners. We have provided 3 different options to output the matrix values:

Count matrix: Raw counts from the intersection of peaks. The sum of elements in a row can differ across rows
Probability matrix: The counts in a row are divided by the sum of the elements in that row. The sum of elements in a row is always = 1
Scaled Count matrix: Counts are scaled to the highest value from the sum of elements across all rows. The sum of elements in a row is always the same.

# Count Matrix option [Default]
FPWM_CEBPB_2partners <- createFPWM(mainTF = "CEBPB",
                                             partners = c("ATF4","JUND"),
                                                 cell = "K562", 
                                         forkPosition = 5)
                                    
write.FPWM( FPWM = FPWM_CEBPB_2partners, 
            format = "FPWMtransfac", 
            fileName = "FPWM_CEBPB_2partners.FPWMtransfac")

# Probability Matrix option
FPWM_CEBPB_2partners_probability <- createFPWM(mainTF = "CEBPB",
                                             partners = c("ATF4","JUND"),
                                                 cell = "K562", 
                                         forkPosition = 5,
                                    probabilityMatrix = TRUE) # Probability option

write.FPWM( FPWM = FPWM_CEBPB_2partners_probability, 
            format = "FPWMtransfac", fileName = 
              "FPWM_CEBPB_2partners_probability.FPWMtransfac")

# Scaled Count Matrix option
FPWM_CEBPB_2partners_scaledCounts <- createFPWM(mainTF = "CEBPB",
                                              partners = c("ATF4","JUND"),
                                                  cell = "K562", 
                                          forkPosition = 5,
                                  scaleFrequencyCounts = TRUE) # Scaled Count option

write.FPWM( FPWM = FPWM_CEBPB_2partners_scaledCounts, 
            format = "FPWMtransfac", 
            fileName = "FPWM_CEBPB_2partners_scaledCounts.FPWMtransfac")

Different options for the values in the matrix.

Saving FPWMs as RDS.

Lastly, FPWM objects can be easily saved/read into an RDS using R built-in functions:

# To save in RDS:
saveRDS(FPWM_CEBPB_K562,"FPWM_CEBPB_K562.rds")

# To read RDS:
FPWM_CEBPB_K562 <- readRDS("FPWM_CEBPB_K562.rds")

Representing Transcription Factor Dimer Binding Sites Using Forked-Position Weight Matrices and Forked-Sequence Logos

Roberto Tirado-Magallanes1∗, Aida Ghayour-Khiavi2∗, Quy Xiao Xuan Lin1, Hamid Usefi3, Sudhakar Jha1, Morgane Thomas-Chollier4, Denis Thieffry4 and Touati Benoukraf1,2

Correspondence to tbenoukraf@mun.ca

Last update: 22 March 2020 - Version: 0.1.0

forkedTF integration with TFregulomeR

Co-Factor exploration

Creation of a FPWM

Visualizing FPWM: Forked-Logo

Reverse complement FPWM when the core motif is on the second half of the motif positions.

Writing FPWMs into a FPWM-Transfac format and standard Transfac format.

Creating FPWMs with three different matrix value options.

Saving FPWMs as RDS.

Roberto Tirado-Magallanes^1∗, Aida Ghayour-Khiavi^2∗, Quy Xiao Xuan Lin¹, Hamid Usefi³, Sudhakar Jha¹, Morgane Thomas-Chollier⁴, Denis Thieffry⁴ and Touati Benoukraf^1,2

`forkedTF` integration with `TFregulomeR`