R Sys.Date()
Cancer is driven by somatic mutations. A few of these mutations confer a survival advantage to the tumour (drivers) while most mutations play a passive role in tumour development (passengers). Most known driver mutations are located in protein-coding genes in whole exome sequencing datasets. As whole genome sequencing datasets are increasingly available, new methods are required to discover driver mutations in the vast noncoding regulatory genome. ActiveDriverWGS is a statistical model to detect candidate driver mutations genome-wide.
ActiveDriverWGS is a recurrence-based method which builds on the idea that driver mutations are subject to positive selection and should appear more frequently than expected by chance alone. This method analyzes the mutational burden of SNVs and short indels (less than 50bps) in functionally defined genomic elements. These elements can include the protein-coding regions of genes (i.e., exons), as well as noncoding regulatory regions such as promoters, enhancers and untranslated regions. Each element can include one or multiple sub-elements; for example protein-coding genes have multiple exons tested together in ActiveDriverWGS while the introns of the are considered as background.
Optionally, the tested genomic elements may also include active sites of interest which reside within the elements themselves. Examples of such active sites include post-translational modification (PTM) sites in protein-coding genes, miRNA binding sites in mRNA sequences and transcription factor binding sites (TFBS) in enhancers. If sites are specified in addition to the elements, enrichment of site-specific mutations are estimated in elements with enriched mutational burden. This additional test asks if the sites in a predicted driver element are enriched in mutations even beyond the mutations seen in the given driver region.
ActiveDriverWGS uses a Poisson generalized linear regression to compare the mutational burden of elements against the expected mutational burden of a background window (Figure 1). In our work, we have optimized the background to be 50,000 bps upstream and downstream of the elements. For an element comprising a single sub-element the background sequence would total to 100 kbps.
If an element is segmented as multiple sub-elements, such as the exons of a protein coding gene, the inter-segment sequences are also used to calculate the expected background rate, up to +/- 50 kbps around every segment. ActiveDriverWGS also incorporates the effect of mutational signatures, specifically the probability distribution of SNVs arising across trinucleotide contexts which vary with mutational processes. Users also have the option to exclude hyper-mutated samples which decrease the accuracy of driver discovery.
The default genome build for ActiveDriverWGS is hg19. Additional options for human (hg38) and mouse (mm9, mm10) are now supported. Please refer to the user manual and the option ref_genome
of ActiveDriverWGS()
.
For a more detailed reference on ActiveDriverWGS, please refer to the publication. Helen Zhu, Liis Uuskula-Reimand, Keren Isaev*, Lina Wadi, Azad Alizada, Shimin Shuai, Vincent Huang, Dike Aduluso-Nwaobasi, Marta Paczkowska, Diala Abd-Rabbo, Oliver Ocsenas, Minggao Liang, J. Drew Thompson, Yao Li, Luyao Ruan, Michal Krassowski, Irakli Dzneladze, Jared T. Simpson, Mathieu Lupien, Lincoln D. Stein, Paul C. Boutros, Michael D. Wilson, Jüri Reimand. Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks. Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027 .
ActiveDriverWGS requires a file for somatic mutations
and a file for genomic elements
. A third optional file for sites
can be specified by the user. Elements may contain multiple segments, each represented in a separate row. Segments belonging to the same element must share a common id unique to the element. Sites are only incorporated in the model if they reside within elements but not all elements need to contain sites. Elements may contain multiple sites. Site ids have to match element ids.
The mutations
data must be in a data frame containing the columns with the correct column names chr
, pos1
, pos2
, ref
, alt
, and patient
. Additional columns may be included but will not be analyzed. Patient ID is required since per element, only one mutation per patient is counted towards element-specific mutation rate and driver prediction. This allows us to avoid inflation of significance in cases where an element is locally hypermutated in a single patient.
chr
: autosomal chromosomes as chr1 to chr22 and sex chromosomes as chrX and chrY for the human genome, and chr1 to chr19 for mouse
pos1
: the start position of the mutation in base 1 coordinates
pos2
: the end position of the mutation in base 1 coordinates
ref
: the reference allele as a string containing the bases A, T, C or G
alt
: the alternate allele as a string containing the bases A, T, C or G
patient
: the patient identifier as a string
The elements
and sites
data must be in data frames containing the columns with the correct column names chr
, start
, end
, and id
. Additional columns may be included but will not be analyzed.
chr
: autosomal chromosomes as chr1 to chr22 and sex chromosomes as chrX and chrY
start
: the start position of the element or site in base 0 coordinates (BED format)
end
: the end position of the element or site in base 0 coordinates (BED format)
id
: the element identifier - if the element contains multiple segments such as exons, each segment should be a separate row with the segment coordinates and the element identifier as id. Elements can be coding or noncoding such as exons of protein coding genes or active enhancers.
library(ActiveDriverWGS)
data("cll_mutations")
head(cll_mutations)
## chr pos1 pos2 ref alt patient
## 779701 chr6 96651182 96651183 AA T 001-0002-03TD
## 779702 chr10 106556005 106556005 C T 001-0002-03TD
## 779703 chr10 107457352 107457352 C T 001-0002-03TD
## 779704 chr10 108111334 108111334 C A 001-0002-03TD
## 779705 chr10 109605024 109605024 T A 001-0002-03TD
## [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
data("cancer_genes")
head(cancer_genes)
## chr start end id
## 648 chr1 2488103 2488172 TNFRSF14
## 649 chr1 2489164 2489273 TNFRSF14
## 650 chr1 2489781 2489907 TNFRSF14
## 651 chr1 2491261 2491417 TNFRSF14
## 652 chr1 2492062 2492153 TNFRSF14
## 653 chr1 2492932 2492963 TNFRSF14
data("cancer_gene_sites")
head(cancer_gene_sites)
## chr start end id
## 1 chr1 11169412 11169416 MTOR
## 2 chr1 11169418 11169422 MTOR
## 3 chr1 11169424 11169427 MTOR
## 4 chr1 11169706 11169709 MTOR
## 5 chr1 11169711 11169718 MTOR
## 6 chr1 11169720 11169724 MTOR
For elements and sites written in BED12 files and BED4 files, the functions prepare_elements_from_BED12()
and prepare_elements_from_BED4()
can be used to read the file and adapt it to fulfill the format requirements for the elements
and sites
parameters of the ActiveDriverWGS function. For more information on the BED12 or BED4 format, please refer to the UCSC guidelines. In this example, elements are adapted from annotations for protein coding genes from GENCODE.v19 for chromosome 17 and sites are adapted from post-translational modification (TPM) sites from ActiveDriverDB.
prepare_elements_from_BED12(
elements =system.file(
"extdata",
"chr17.coding_regions.bed",
package = "ActiveDriverWGS",
mustWork = TRUE))
##
## 1189 Rows :: Processing row 100 200 300 400 500 600 700 800 900 1000 1100
## Preparing Elements Complete
## RM 0 lines
head(elements)
## chr start end id
## 1 chr17 6006 6168 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 2 chr17 11205 11332 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 3 chr17 11871 11981 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 4 chr17 13920 13995 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 5 chr17 22327 22407 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 6 chr17 30897 31270 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
prepare_elements_from_BED4(
sites =system.file(
"extdata",
"chr17.PTM_sites.bed",
package = "ActiveDriverWGS",
mustWork = TRUE))
##
## Preparing Elements Complete
## RM 0 lines
head(sites)
## chr start end id
## 1 chr17 63526097 63526102 gc19_pc.cds::gencode::AXIN2::ENSG00000168646.8
## 2 chr17 63526104 63526108 gc19_pc.cds::gencode::AXIN2::ENSG00000168646.8
## 3 chr17 63526110 63526114 gc19_pc.cds::gencode::AXIN2::ENSG00000168646.8
## 4 chr17 63526116 63526117 gc19_pc.cds::gencode::AXIN2::ENSG00000168646.8
## 5 chr17 63526119 63526123 gc19_pc.cds::gencode::AXIN2::ENSG00000168646.8
## 6 chr17 63526125 63526126 gc19_pc.cds::gencode::AXIN2::ENSG00000168646.8
ActiveDriverWGS can be run with the mutations file, the elements file and an optional sites file. In this example, mutations are adapted from the Alexandrov et al, 2013 dataset for chronic lymphocytic leukemia (CLL) patients. Regions are adapted from the cancer gene census and annotations for protein coding genes are adapted from GENCODE.v19.
c("ATM", "MYD88", "NOTCH1")
some_genes =
ActiveDriverWGS(mutations = cll_mutations,
results =elements = cancer_genes[cancer_genes$id %in% some_genes,],
sites = cancer_gene_sites)
## 0 remove hypermut, n= 0 , 0 %
## hypermuted samples:
##
## reversing 0 positions
## Removing 0 invalid SNVs & indels
##
## Number of Elements with 0 Mutations: 0
## Tests to do: 3
## ...
All three genes have a significant enrichment of somatic mutations as indicated by the fdr_element
field. Note that NOTCH1 is also highlighted for its site-related mutations since two of two mutations in NOTCH1 affect a PTM site and the FDR value fdr_site
is also significant.
ActiveDriverWGS has several adjustable parameters:
window_size
: A background window both upstream and downstream of the element in which mutation rates are assumed to remain constant. We have optimized this parameter on the PCAWG dataset to be 50,000 bps for SNVs and indels. For a single non-fragmented element the background sequence is thus 100 kbps, while elements with multiple fragments capture more sequence space and therefore their background sequence is added up to a maximum of 2x50 kbps per sub-element.
filter_hyper_MB
: The threshold for the number of somatic mutations (SNVs and indels combined) per megabase above which a sample is considered hypermutated. We define the default to be 30 mutations/megabase according to published literature. The genome is assumed to be 3000 mbps.
recovery.dir
: The directory for writing recovery files for ActiveDriverWGS. If the directory does not exist, it will be created. If the parameter is unspecified, recovery files will not be saved. As an ActiveDriverWGS query for large datasets may be computationally heavy, specifying a recovery directory will recover previously computed results if a query is interrupted. The results of each element are stored in this folder and can be recovered if the process is terminated while in progress.
mc.cores
: The number of cores that the user wishes to allocate to running ActiveDriverWGS. For more information, refer to the R package parallel.
results
## id pp_element element_muts_obs element_muts_exp element_enriched
## 1 ATM 0.001807467 2 0 TRUE
## 2 MYD88 0.004976353 1 0 TRUE
## pp_site site_muts_obs site_muts_exp site_enriched fdr_element fdr_site
## 1 0.5398703 0 0 FALSE 0.005422402 0.6510669
## 2 0.6510669 0 0 FALSE 0.005641867 0.6510669
## has_site_mutations
## 1
## 2
## [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
ActiveDriverWGS will return results in a data frame format with the following columns.
id
: The identifier for the element.
pp_element
: The p-value associated with enrichment of mutations in the element. A value of NA indicates that no mutations fall within the element.
element_muts_obs
: Number of patients with mutations in the element. A value of NA indicates that no mutations fall within the element.
element_muts_exp
: Number of expected patients with mutations in the element. A value of NA indicates that no mutations fall within the element.
element_enriched
: Boolean indicating whether an enrichment of mutations in the element is observed. A value of NA indicates that no mutations fall within the element.
pp_site
: The p-value associated with enrichment of mutations in the sites.
site_muts_obs
: Number of patients with mutations in the sites. A value of 0 means that sites exist but are unaffected by mutations. A value of NA indicates that no site resides within the element.
site_muts_exp
: Number of expected patients with mutations in the sites. A value of 0 means that sites exist but are unaffected by mutations. A value of NA indicates that no site resides within the element.
site_enriched
: Boolean indicating whether an enrichment of mutations in the sites is observed. A value of NA indicates that no site resides within the element.
fdr_element
: FDR corrected p-value associated with the element.
fdr_site
: FDR corrected p-value associated with the site.
has_site_mutations
: “V” indicating the presence of site mutations. An empty string indicates that no site mutations are present.
In ActiveDriverWGS, multiple testing is performed for all given regions using the Benjamini-Hochberg FDR method. We encourage the users to filter by FDR corrected values rather than p-values to eliminate false positives. FDR correction makes the conservative assumption that genes/regions with 0 mutations have P = 1. FDR correction for sites is conducted over a restricted set of hypotheses, comprising genes/regions that have a significant enrichment of mutations (FDR < 0.05) at the level of genes or regions.
Compute time increases with the number of samples, mutations and regions. Hence, the two main functions integral to ActiveDriverWGS have also been made available in the package and can be adapted to individual local high performance computing clusters (HPCCs). One approach is to assign a subset of testable elements (and/or cancer types) to individual compute nodes and later use an additional script that collects the data from these nodes.
This function formats the mutations data frame, removes hyper-mutated samples and removes non-mitochondrial mutations in extrachromosomal regions. It adds an additional column to the mutations data frame that provides the trinucleotide context of the given mutation which will be later used to estimate the mutational distribution across signatures.
This function calculates the enrichment of mutations for a particular region id. It applies a Poisson generalized linear regression model across mutation signatures to identify enriched regions.
The following example demonstrates how to build an ActiveDriverWGS pipeline which can be adapted to HPCCs. Parallel jobs are executed for a list of element ids whereas mutation data and element coordinates are prepared prior to the process. Note that the creation of GRanges objects is part of the ActiveDriverWGS()
wrapper function and must be completed manually by users wishing to create personalized pipelines. The function ADWGS_test()
needs a link to the reference genome object (from the R package BSgenome
) as an argument, and assumes that the mutations and element coordinates match the reference genome. Also, note that multiple testing corrections will need to be recalculated by the user after the results have been collected from individual jobs.
library(GenomicRanges)
# Loading elements & creating a GRanges object
data(cancer_genes)
GRanges(seqnames = cancer_genes$chr,
gr_element_coords =IRanges(start = cancer_genes$start,
end = cancer_genes$end),
mcols = cancer_genes$id)
# Loading sites & creating a GRanges object
data(cancer_gene_sites)
GRanges(seqnames = cancer_gene_sites$chr,
gr_site_coords =IRanges(start = cancer_gene_sites$start,
end = cancer_gene_sites$end),
mocols = cancer_gene_sites$id)
# Loading mutations, format muts & creating a GRanges object
data(cll_mutations)
# load the default reference genome
BSgenome.Hsapiens.UCSC.hg19::Hsapiens
this_genome =
# format_muts
format_muts(cll_mutations, this_genome,
cll_mutations =filter_hyper_MB = 30)
## 0 remove hypermut, n= 0 , 0 %
## hypermuted samples:
##
## reversing 0 positions
## Removing 0 invalid SNVs & indels
GRanges(cll_mutations$chr,
gr_maf =IRanges(start = cll_mutations$pos1,
end = cll_mutations$pos2),
mcols = cll_mutations[,c("patient", "tag")])
# Examplifying the ATM Element
"ATM" id =
Note that when splitting tasks using the ADWGS_test
function, only the parameter id
needs to modified for each element while gr_element_coords
, gr_sites
and gr_maf
can be the complete datasets. However, the compute time and memory can be saved in very large analyses by providing more-optimal subsets of these datasets for each test, for example those limited to specific chromosomes.
# Result of 1 input element
ADWGS_test(id = id,
result =gr_element_coords = gr_element_coords,
gr_site_coords = gr_site_coords,
gr_maf = gr_maf,
win_size = 50000, this_genome = this_genome)
## .
result
## id pp_element element_muts_obs element_muts_exp element_enriched pp_site
## 1 ATM 0.001807467 2 0 TRUE 0.5398703
## site_muts_obs site_muts_exp site_enriched
## 1 0 0 FALSE
For questions, technical support or to report bugs and errors, please use our GitHub.