∗ Co-first author
Abstract
Motivation: Current position weight matrices and sequence logos are not adequate to model transcription factor binding sites that are composed by a mixture of homodimer and heterodimer complexes. Results: We introduceforkedTF
, an R-library that combined functions to design Forked-Position Weight Matrices (FPWM) and Forked-Sequence Logos to better portray TF dimers. forkedTF
uses and produces standard objects facilitating the exportation of FPWM to third-party software.
forkedTF
integration with TFregulomeR
forkedTF
utilizes many of the resources and funtions in TFregulomeR
. Therefore, TFregulomeR
is automatically called when loading forkedTF
with library(forkedTF)
In order to explore the diversity of binding partners of that a give TF might have in a particular cell line, we implemented the function miniCofactorReport()
. This function serves as a convenient lite version of cofactorReport
included in TFregulomeR
. Just by specifying the TF and cell line of interest, miniCofactorReport()
will access the “TFregulomeR compendium”, identify the top 9 co-binding partners and generate a PDF report.
library(forkedTF)
miniCofactorReport( TF = "CEBPB", cell = "K562" )
MiniCoFactor report for CEBPB in K562.
You can inspect all the different cellLine-TF combinations with TFregulomeR
’s dataBrowser()
function. For example, to retireve all the TF available for “K562” run:
all_record <- dataBrowser() # Extract all the entries in the TFregulomeR compendium
cell_tf <- cbind(all_record$cell_tissue_name, all_record$TF) # keep the relevant information in a data frame
cell_of_interest <- "K562" # cell line of interest
cell_tf[grep(cell_of_interest,cell_tf),2]
[1] "AFF1" "ARID2" "ARID3A" "ATF1" "ATF2" "ATF3"
[7] "ATF4" "ATF7" "BACH1" "BHLHE40" "C11orf30" "CBFA2T3"
[13] "CEBPB" "CEBPD" "CREB1" "CREB3L1" "CREM" "CTCF"
[19] "CTCFL" "CUX1" "DACH1" "DPF2" "E2F1" "E2F4"
[25] "E2F6" "E2F7" "E2F8" "E4F1" "EGR1" "ELF1"
[31] "ELF4" "ELK1" "ESRRA" "ETV6" "FOS" "FOSL1"
[37] "FOXA1" "FOXK2" "FOXM1" "GABPA" "GABPB1" "GATA1"
[43] "IRF1" "IRF2" "JUN" "JUNB" "JUND" "KLF1"
[49] "KLF13" "MAFF" "MAFG" "MAFK" "MAX" "MAZ"
[55] "MEF2A" "MEF2D" "MEIS2" "MITF" "MNT" "MXI1"
[61] "MYBL2" "MYC" "MYNN" "NCOA1" "NCOA2" "NFATC3"
[67] "NFE2" "NFE2L1" "NFE2L2" "NFIA" "NFIC" "NFYB"
[73] "NR2C1" "NR2F1" "NR2F2" "NR3C1" "NRF1" "NUFIP1"
[79] "PBX2" "PHF21A" "PKNOX1" "RBM25" "REST" "RFX1"
[85] "RFX5" "RUNX1" "SIX5" "SMAD1" "SMARCA5" "SMARCC2"
[91] "SP1" "SPI1" "SRF" "TAL1" "TCF12" "TCF7"
[97] "TEAD2" "TFDP1" "THAP1" "THRA" "USF1" "USF2"
[103] "VEZF1" "YY1" "ZBTB2" "ZBTB40" "ZBTB5" "ZBTB7A"
[109] "ZFP36" "ZFP91" "ZFX" "ZHX1" "ZKSCAN1" "ZKSCAN8"
[115] "ZMYM3" "ZNF143" "ZNF148" "ZNF184" "ZNF24" "ZNF263"
[121] "ZNF316" "ZNF354B" "ZNF407" "ZNF410" "ZNF507" "ZNF592"
[127] "ZNF639" "ZNF740" "ZNF75A" "ZNF83" "ZSCAN29"
In order to create a Forked Positioning Weight Matrix (FPWM), we construct multiple PWM, one for each group of peaks from the main TF that overlap with the binding partners of interest. We have created a function called createFPWM
that takes as an input the main TF of interest mainTF
, a list or vector of binding partners partners
, a cell line cell
and the postion of the binding site to be forked forkPosition
. We foresee that a common application would be to find the most common co-binding partners of a TF in a cell line using miniCofactorReport()
and then create FPWM object based on the co-factor report. Here we create a FPWM object based on the top binding partners we observed in the co-factor report above.
createFPWM
can take as inputs the IDs in the TFregulomeR
compendium.
FPWM_CEBPB_K562 <- createFPWM(mainTF ="CEBPB",
partners = c("CEBPD","ATF4","ATF7","ATF3","JUND","FOS"),
cell = "K562",
forkPosition = 5)
Aditionally, we can create the same FPWM providing as input the IDs in the TFregulomeR
compendium. In this mode use mainTF_MMID
to provide the ID of the main TF and partners_MMID
for the list or vector of IDs for the binding partners. It is not necessary to specify the cell line as that information is already included in the IDs.
FPWM_CEBPB_K562 <- createFPWM(mainTF_MMID ="MM1_HSA_K562_CEBPB",
partners_MMID = c("MM1_HSA_K562_CEBPD",
"MM1_HSA_K562_ATF4",
"MM1_HSA_K562_ATF7",
"MM1_HSA_K562_ATF3",
"MM1_HSA_K562_JUND",
"MM1_HSA_K562_FOS"),
forkPosition = 5)
It is very common that a particular TF is not found in a cell line but is present in another. To overcome this limitation, it is possible to compare peaks from different cell lines. To create a FPWM with ChIP-Seq peaks from different cell lines give as input the IDs in TFregulomeR
as shown below:
FPWM_different_celllines <- createFPWM(mainTF_MMID ="MM1_HSA_K562_CEBPB",
partners_MMID = c("MM1_HSA_HCT116_CEBPB",
"MM1_HSA_K562_ATF4",
"MM1_HSA_GM12878_ATF3"),
forkPosition = 5)
FPWM can be visualized using the function plotFPWM
. This will generate a PDF with a forked-logo, were the first bases before the fork correspond to the mainTF of interest. The mainTF motif is connected via a fork with continuations of the logo depending on the binding partner. The logos show the methylation levels for each of the nucleotides in the binding site. The percentage on top of each forked arrow displays the amount of overlapping between the TF of interest and each of the binding partners. The output can be specified with pdfName
, as well as the width
and height
of the PDF.
plotFPWM( FPWM_CEBPB_K562, pdfName = "FPWM_CEBPB_K562.pdf" )
Plot of the Forked-PWM.
It might happen that the core motif appears in the second half of the motifs positions while the partners motif is at the beginning. Such is the case for the cofactors of JUND
in HepG2
as can be seen in this miniCofactorReport:
miniCofactorReport( TF = "JUND", cell = "HepG2" )
MiniCoFactor report for CEBPB in K562.
In such cases, FPWM must be created with the parameter flipMatrix = TRUE
to reverse-complement the individual PWM. forkPosition
should be set to the position just before the core main TF motif, in this case position 6
.
flipped_fpwm <- createFPWM(mainTF ="JUND",
partners = c("ATF7","ATF2","JUN","FOSL2","FOS"),
cell = "HepG2",
forkPosition = 6,
flipMatrix = TRUE)
The resulting FPWM
can be drawn with plotFPWM
.
plotFPWM(flipped_fpwm, pdfName="flipped_fpwm.pdf")
MiniCoFactor report for CEBPB in K562.
One of the most common ways to represent a TF binding motif is through the Transfac format. Therefore, we propose a modified version of this format called FPWMtransfac. In this format the mainTF ID is stored as the “parentLogo” whereas the partners are called “leafLogos”. Information regarding overlap percentage, the number of bases pairs found in the overlapped peaks and the total number of overlapped peaks are found in the “overlappingScore”, “numberOfBasePairs” and “numberOfOverlappingPeaks” fields respectively. Each comma separated value in those fields corresponds to the TF in that position in leafLogos. For example, the first value in overlappingScore corresponds to the first TF in leafLogos, the second value to the second TF and so forth. This is also the case for the nucleotide frequency matrix. From position 1 until the forkPosition corresponds to the mainTF shown in the parentLogo field whereas the looping values after the forkPosition correspond to the binding partners in the order shown in the leafLogos field.
FPWM_CEBPB_2partners <- createFPWM(mainTF = "CEBPB",
partners = c("ATF4","JUND"),
cell = "K562",
forkPosition = 5)
write.FPWM( FPWM = FPWM_CEBPB_2partners, format = "FPWMtransfac", fileName = "FPWM_CEBPB_2partners.FPWMtransfac")
This is how the output in FPWMtransfac format looks like:
FPWMtransfac format.
There is a myriad of sequence analysis tools that employ the Transfac
format as input. Therefore, we have provided the option to write FPWM in the standard transfact format where each combination of mainTF + partner is given as an independent matrix together with all the other information found in a FPWM.
write.FPWM( FPWM = FPWM_CEBPB_2partners, format = "transfac", fileName = "FPWM_CEBPB_2partners.transfac")
This is how the output looks in transfac format:
FPWM in Transfac format.
In a standard transfact matrix, the sum of all the elements in a row gives the same result for every row. This is because all the positions in the matrix are built from the same number of peaks/binding regions. However, in FPWMs, parts of the matrix are constructed with variable numbers of peaks/binding regions depending on the overlap between the main TF and their partners. We have provided 3 different options to output the matrix values:
# Count Matrix option [Default]
FPWM_CEBPB_2partners <- createFPWM(mainTF = "CEBPB",
partners = c("ATF4","JUND"),
cell = "K562",
forkPosition = 5)
write.FPWM( FPWM = FPWM_CEBPB_2partners,
format = "FPWMtransfac",
fileName = "FPWM_CEBPB_2partners.FPWMtransfac")
# Probability Matrix option
FPWM_CEBPB_2partners_probability <- createFPWM(mainTF = "CEBPB",
partners = c("ATF4","JUND"),
cell = "K562",
forkPosition = 5,
probabilityMatrix = TRUE) # Probability option
write.FPWM( FPWM = FPWM_CEBPB_2partners_probability,
format = "FPWMtransfac", fileName =
"FPWM_CEBPB_2partners_probability.FPWMtransfac")
# Scaled Count Matrix option
FPWM_CEBPB_2partners_scaledCounts <- createFPWM(mainTF = "CEBPB",
partners = c("ATF4","JUND"),
cell = "K562",
forkPosition = 5,
scaleFrequencyCounts = TRUE) # Scaled Count option
write.FPWM( FPWM = FPWM_CEBPB_2partners_scaledCounts,
format = "FPWMtransfac",
fileName = "FPWM_CEBPB_2partners_scaledCounts.FPWMtransfac")
Different options for the values in the matrix.
Lastly, FPWM objects can be easily saved/read into an RDS using R built-in functions:
# To save in RDS:
saveRDS(FPWM_CEBPB_K562,"FPWM_CEBPB_K562.rds")
# To read RDS:
FPWM_CEBPB_K562 <- readRDS("FPWM_CEBPB_K562.rds")