Sequenza User Guide

About
Introduction
Getting started
- Minimum requirements
- Installation
Running sequenza
- Preprocessing of input files
- Sequenza analysis (in R)
Plots and Results

About

Sequenza: Copy Number Estimation from Tumor Genome Sequencing Data

Sequenza is a tool to analyze genomic sequencing data from paired normal-tumor samples, including cellularity and ploidy estimation; mutation and copy number (allele-specific and total copy number) detection, quantification and visualization.

Introduction

Deep sequence of tumor DNA along with corresponding normal DNA can provide a valuable perspective on the mutations and aberrations that characterize the tumor. However, analysis of this data can be impeded by tumor cellularity and heterogeneity and by unwieldy data. Here we describe Sequenza, an R package that enables the efficient estimation of tumor cellularity and ploidy, and generation of copy number, loss-of-heterozygosity, and mutation frequency profiles.

This document details a typical analysis of matched tumor-normal exome sequence data using sequenza.

Getting started

Minimum requirements

Software: R, Python, SAMtools, tabix
Operating system: Linux, OS X, Windows
Memory: Minimum 4 GB of RAM. Recommended >8 GB.
Disk space: 1.5 GB for sample (depending on sequencing depth)
R version: 3.2.0
Python version: 2.7, 3.4, 3.5, 3.6 (or PyPy)

Installation

The R package can be installed by:

setRepositories(graphics = FALSE, ind = 1:6)
install.packages("sequenza")

To install the Python companion package sequenza-utils to preprocess BAM files, refer to the sequenza-utils project page, or simply use the python package manager from the command prompt:

pip install sequenza-utils

Running sequenza

Preprocessing of input files

In order to obtain precise mutational and aberration patterns in a tumor sample, Sequenza requires a matched normal sample from the same patient. Typically, the following files are needed to get started with Sequenza:

A BAM file (or a derived pileup file) from the tumor specimen.
A BAM file (or a derived pileup file) from the normal specimen.
A FASTA reference genomic sequence file

The normal and tumor BAM files are processed together to generate a seqz file, which is the required input for the analysis. It is possible to generate a seqz starting from other processed data, such as pileup, or VCF files. The available options are described in the sequenza-utils manual pages.

The sequenza-utils command provides various tools; here we highlight only the basic usage:

Process a FASTA file to produce a GC Wiggle track file:

sequenza−utils gc_wiggle −w 50 --fasta hg19.fa -o hg19.gc50Base.wig.gz

Process BAM and Wiggle files to produce a seqz file:

sequenza−utils bam2seqz -n normal.bam -t tumor.bam --fasta hg19.fa \
    -gc hg19.gc50Base.wig.gz -o out.seqz.gz

Post-process by binning the original seqz file:

sequenza−utils seqz_binning --seqz out.seqz.gz -w 50 -o out small.seqz.gz

Sequenza analysis (in R)

library(sequenza)

In the package is provided a small seqz file

data.file <-  system.file("extdata", "example.seqz.txt.gz", package = "sequenza")

The main interface consists of 3 functions:

sequenza.extract: process seqz data, normalization and segmentation

test <- sequenza.extract(data.file, verbose = FALSE)

sequenza.fit: run grid-search approach to estimate cellularity and ploidy

CP <- sequenza.fit(test)

sequenza.results: write files and plots using suggested or selected solution

sequenza.results(sequenza.extract = test,
    cp.table = CP, sample.id = "Test",
    out.dir="TEST")

Plots and Results

The function sequenza.results outputs various files in the specified path. The resulting files are either output in pdf of in plain text. The files include quality control assessments (eg evaluate GC-correction), visualization of the data and files such as segmentation with copy number calling and mutation lists.

Result files

Each generated file is briefly explained in the following table

Files	Description
Test_alternative_fit.pdf	Alternative solution fir to the segments. One solution per slide
Test_alternative_solutions.txt	List of all ploidy/cellularity alternative solution
Test_chromosome_depths.pdf	Visualization of sequencing coverage in the normal and in the tumor samples, before and after normalization
Test_chromosome_view.pdf	Visualization per chromosome of depth.ratio, B-allele frequency and mutations, using the selected or estimated solution. One chromosome per slide
Test_CN_bars.pdf	Bar plot representing the percentage of genome in the detected copy number states
Test_confints_CP.txt	Table of the confidence inerval of the best solution from the model
Test_CP_contours.pdf	Visualization of the likelihood density for each pair of cellularity/ploidy solution. The local maximum-likelihood points and confidence interval of the best estimate are also visualized
Test_gc_plots.pdf	Visualization of the GC correction in the normal and in the tumor sample
Test_genome_view.pdf	Genome-whide visualization of the allele-specific and absolute copy number results, and raw profile of the depth ratio and allele frequency
Test_model_fit.pdf	model_fit.pdf
Test_mutations.txt	Table with mutation and estimated number of mutated alleles (Mt)
Test_segments.txt	Table listing the detected segments, with estimated copy number state at each sement
Test_sequenza_cp_table.RData	RData object dump of the maxima a posteriori computation
Test_sequenza_extract.RData	RData object dump of all the sample information
Test_sequenza_log.txt	Log with version and time information

Segments results

The segmentation file with the allele-specific copy number calling is one of the main result of the analysis. A sample of the file is shown in the table below:

chromosome	start.pos	end.pos	Bf	N.BAF	sd.BAF	depth.ratio	N.ratio	sd.ratio	CNt	A	B	LPP
1	881992	54694219	0.499	1636	0.092	1.130	2172	0.557	2	1	1	-7.988
1	54700724	60223464	0.345	73	0.080	1.530	101	1.482	3	2	1	-6.739
1	60381518	67890614	0.047	94	0.034	1.155	114	0.492	2	2	0	-7.187
1	68151686	92262955	0.478	264	0.072	2.052	338	1.626	4	2	2	-6.831
1	92445264	118165373	0.344	315	0.083	1.631	434	1.021	3	2	1	-6.955
1	118165645	121485317	0.495	62	0.088	1.167	85	0.489	2	1	1	-6.928

The columns represents:

chromosome: Chromosome
start.pos: Start position of the segment
end.pos: End position of the segment
Bf: B-allele frequency value
N.BAF: Number of observation to compute Bf in the segment
sd.BAF: Standard deviation of Bf
depth.ratio: Adjusted and normalized depth ratio tumor / normal
N.ratio: Number of observation to compute depth.ratio in the segment
sd.ratio: Standard deviation of depth.rati
CNt: Estimated total copy number value
A: Estimated number of A-alleles
B: Estimated number of B-alleles (minor allele)
LPP: Log-posterior probability of the segment

Gene wide overview

Allele-specific copy number

Total copy number

Raw profile

Grid search maximum likelihood

cp.plot(CP)
cp.plot.contours(CP, add = TRUE,
   likThresh = c(0.999, 0.95),
   col = c("lightsalmon", "red"), pch = 20)

Chromosome view

Chromosome view is the visualization that displays chromosome by crhosome, nutations, B-allele frequency and depth-ratio. The visualization makes it easier to ispect the segmentation results, comparing to a binned profile of the raw data. It also visualize the copy number calling using the cellularity and ploidy solution, making useful to asses if the copy number calling is acurate. In addition it provides a visualization of the mutation frequency that can also help to corroborate the solution.

chromosome.view(mut.tab = test$mutations[[1]], baf.windows = test$BAF[[1]],
                ratio.windows = test$ratio[[1]],  min.N.ratio = 1,
                segments = test$segments[[1]],
                main = test$chromosomes[1],
                cellularity = 0.89, ploidy = 1.9,
                avg.depth.ratio = 1)