Documentation

SiPhy

The SiPhy software package comprises several analysis modules and utility programs. All are Java programs executed from the command line. The primary executable file, bin/scaler_0_5.jar, includes the following modules (the -task parameter identifies modules by task number). The general methods are described below and their full description, inputs and outputs are described in the appropriate sections:

Task Description
1 Slide a fixed window to testimate local rate of vatiation [ω in Equation (10) of Garber, et al.].
10 Estimates local rate of variation for a set of fixed regions (such as exons).
7 Estimates substitution patterns bias [π in Equation (10) of Garber, et al.] for each base pair.
11 Estimates the substitution patterns for a set of fixed regions by calculating the sum of the scores produced by Task=7.
16 Estimates a conservation score for a given sequence motif using a simple generalization of the substitution pattern estimation.

Java 1.5 or later: The SiPhy software requires Java 1.5 or later. It was written and tested using Java 1.5.

Input Files

As a typical constraint estimation tool, SiPhy requires two main inputs: a multiple alignment file and neutral "average" evolutionary model file.

Multiple Alignment File

SiPhy accepts three multiple alignment formats:

  • FASTA
  • MAF (Requires a matching index file, which can be created using the mafutils utility provided by SiPhy).
  • PHYLIP

Multiple alignment files for well-studied phylogenies can be found on public websites such as the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu).

Missing data and alignment gaps:

Mulitple sequence alignments of genome sequence allow for missing (or inserted) sequence in one or several species to align as "gaps". Such gaps are abundant and may be the result evolutionary insertion and deletions events that, either by random drift or due to selective pressure, become fixed in a species, or they may be due to artifacts of the multiple alignment process. Further, genome assemblies vary in their completenes, and contain in many "gapped" regions of missing sequence. In this version, SiPhy treats both missing data and alignment gaps by ignoring the species with missing or gapped alignments from the computations.

Model File

The model file specifies the parameters of the probabilistic model describing the evolutionary process: A matrix describing the rate at which each nucleotide mutates into another, the equilibrium distribution of bases, and the phylogentic tree topology connecting the species under study.

SiPhy modules estimate the local divergence of different statistics from the neutral "average" evolutionary model provided by the model file. The modules work with the output of two programs commonly used to generate such models:

  • Phast. The output file format of this program is used as the input model file for SiPhy.
  • PAML. SiPhy provides a conversion utility that converts the output file format of this program to the Phast model file format.

Two types of genomic sequence are understood to have evolved under no or very weak selection: (1) 4D-sites, the four-fold degenerate third codon position of protein-coding genes, and (2) Ancestral Repeats (AR), mobile elements that are ancestral to all sequences under study. Several software programs, such as Phast and PAML, estimate the average substitution rate from the extracted alignment crresponding to these sequences. The resulting model file includes the scaled phylogenetic tree with edge length proportional to the estimates average substitutions per site, a rate matrix Q representing the rate of substitutions between the four nucleotides, and a stationary distribution.

Following is an example model file. The identifiers in the TREE field of the model file must match identifiers in the multiple alignment file.

BACKGROUND: 0.2 0.3 0.3 0.2
RATE_MAT:
 -1.0860163084198684 0.25225991842162027  0.6937865599085344 0.13996983008971356
 0.16817327894774686 -0.9225558555676914  0.2866374459445824 0.46774513067536205
  0.4625243732723563  0.2866374459445824 -0.9192377889311856 0.17007596971424688
 0.13996983008971356   0.701617696013043  0.2551139545713703 -1.0967014806741269
TREE: (((((hg18:0.006,panTro2:0.007):0.024,rheMac2:0.036):0.08,tarSyr1:0.152):0.009,(micMur1:0.099,otoGar1:0.143):0.037):0.004,((mm9:0.089,rn4:0.098):0.239,speTri1:0.159):0.028):0.02

                  

Estimating local rate of variation (ω)

SiPhy provides two ways to estimate the local rate of variation [ω in Equation (10) of Garber, et al.]: by sliding a fixed-size window or by providing a set of annotations in BED format. To estimate variation for the whole genome or a generic distribution of constraint in the alignment, use a fixed-size window. To estimate variation for specific genomic areas, use annotations. The result is a tab-delimited output file with the estimated rate of variation for each window or element. The paper discusses only the fixed-size window method.

Sliding a fixed-size window

Estimate the local rate of variation [ω in Equation (10) of Garber, et al.] by sliding a fixed-size window.

Command:

java -jar bin/scaler_0.5.jar -task 1

Parameters:

  • -in: Required. Multiple alignment file as described above.
  • -format (default=FASTA): Alignment format of multiple alignment file: FASTA, PHYLIP, or MAF.
  • -mod: Required. Neutral model file as described above.
  • -out: Output file name
  • -ref: Required for formats other than MAF. Reference sequence id. The identifier must match an identifier in the multiple alignment file.
  • -ignore: Comma separated list of species to ignore, if the model's phylogenetic tree contains species not desired in the analysis.
  • -minTreeLength (default=1): Minimum branch length to actually compute ω, sites with lesser total tree length are ignored.
  • -window (default=1): Size of sliding window, where tree is locally scaled.
  • -windowOverlap (default=window - 1): Sliding window overlap.
  • -start: Required for MAF. Start coordinate of an interval in the multiple alignment file. SiPhy estimates the local rate of variation for this interval.
  • -end: Required for MAF. End coordinate of the interval in the multiple alignment file.

Result:

The output is a tab delimited file with the following columns:

  • Start position, zero based for FASTA and PHYLIP formatted alignments or absolute for MAF formatted alignments.
  • Estimated scalar rescaling (ω) of the neutral rate that maximizes the likelihood of the model in the window alignment.
  • Minimum branch length in the window after removing species aligned with gaps or missing sequence.
  • Log-odds ratio of the fitted vs the neutral likelihodds (SiPhy writes negative likelihodds for ω greater than 1 to distinguish rapid vs slow evolving sequence).
  • Theoretical chis-sequared p-value of the obtained log-odds ratio.

Sample command:

 /seq/mgarber/tools/runJava edu.mit.broad.prodinfo.conservation.TreeScaler -task 1 -ref hg17 \\
-in ENm002_compressed.fa -window 12 -mod combined.AR.mod -out ENm002_w12.omegas -windowOverlap 11 -minTreeLenght 0.5

Sample output:

This command generated the data describbed in Garber et al. for Encode region ENm002. Both the input and results of the command are available in the data directory.

Estimating varying size elements

Estimate the local rate of variation [ω in Equation (10) of Garber, et al.] providing a set of annotations in BED format. The paper discusses how to estimate the local rate of variation using fixed-size windows; it does not discuss this method.

Command:

java -jar bin/scaler_0.5.jar -task 10

Parameters:

  • -alignment: Required. Multiple alignment file as described above.
  • -in: BED file annotating elements in the multiple alignment file. SiPhy estimates the local rate of variation for these elements. This tasks does not support reading from standard input, you must provide file.
  • -mod: Required. Neutral model file as described above.
  • -format (default=FASTA): Alignment format of multiple alignment file: FASTA, PHYLIP, or MAF.
  • -out if a specific file name is desired, if this parameter is not specified output goes to standard out.
  • -ref: Required for formats other than MAF. Reference sequence id. The identifier must match an identifier in the multiple alignment file.
  • -ignore: Comma separated list of species to ignore, if the model's phylogenetic tree contains species not desired in the analysis.

Result:

The output is a tab delimited file with the following columns

  • Start position of the annotation.
  • End position of the annotation.
  • Annotation name.
  • Annotation orientation.
  • Estimated scalar rescaling (ω) of the neutral rate that maximizes the likelihood of the model in the window alignment.
  • Log-odds ratio of the fitted vs the neutral likelihodds (SiPhy writes negative likelihodds for ω greater than 1 to distinguish rapid vs slow evolving sequence).
  • Theoretical chis-sequared p-value of the obtained log-odds ratio.
  • Minimum branch length in annotation after removing species aligned with gaps or missing sequence.

Sample command:

java -jar bin/scaler_0.5.jar -task 10 -alignment chr22.maf -in sample_data/tug1.exons.bed -mod sample_data/autosomal.mod -format MAF \\
-ignore danRer5,oryLat2,gasAcu1,fr2,taeGut1,tetNig1,xenTro2,anoCar1,galGal3,ornAna1,monDom4 > sample_data/tug1.exons.exon.omegas

This command computes the change in substitutuion rate (ω) for each of the three exons of the TUG1 non-coding RNA. The model file and output can be found in the data directory. The multiple alignment for chromosome 22 can be downloaded from UCSC genome browser: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz44way/maf/chr22.maf.gz . The -ignore flag used here restricts the program to eutherian mammals rather than using all vertebrate species in the alignment.

Estimating substitution patterns (π)

SiPhy estimates substitution patterns [π in Equation (10) of Garber, et al.] per column, essentially using a fixed-size window of length 1. SiPhy then sums the resulting scores to estimate substitution patterns across larger fixed-size windows.

Estimating site-wise substitution patterns

Estimate substitution patterns [π in Equation (10) of Garber, et al.] per column.

Command:

java -jar bin/scaler_0.5.jar -task 7

Parameters:

  • -in: Required. Multiple alignment file as described above.
  • -format (default=FASTA): Alignment format of multiple alignment file: FASTA, PHYLIP, or MAF.
  • -mod: Required. Neutral model file as described above.
  • -out if a specific file name is desired
  • -ref: Required for formats other than MAF. Reference sequence id. The identifier must match an identifier in the multiple alignment file.
  • -ignore: Comma separated species to ignore, if the model's phylogenetic tree contains species not desired in the analysis.
  • -minTreeLength (default=1): Minimum branch length to actually compute ω, sites with lesser total tree length are ignored.
  • -start: Required for MAF. Start coordinate of an interval in the multiple alignment file. SiPhy estimates site-wise substitution patterns for this interval.
  • -end: Required for MAF. End coordinate of the interval in the multiple alignment file.
  • -priorOmega: To adjust for local substitution rate variation you may specifying the ω obtained as described above. If omitted, ω is set to 1 (neutral).

Result:

The output is a tab delimited file with the following columns:

  • Base position (zero-based).
  • Estimated equilibrium Adenine frequency.
  • Estimated equilibrium Cytosine frequency.
  • Estimated equilibrium Guanine frequency.
  • Estimated equilibrium Thyamine frequency.
  • Estimated annotation ω.
  • Log-odds ratio of the fitted vs the neutral likelihodds (SiPhy writes negative likelihodds for ω greater than 1 to distinguish rapid vs slow evolving sequence).
  • Actual branch length available at the site after removing gapped and missing sequences.
  • Unique identifier for the phylogenetic tree obtained by prunning all species that aligned with a gap or had missing sequence at the site.

Sample command:

 java -jar bin/scaler_0.5.jar -task 7 -ref hg17 -in  ENm002_compressed.fa \\ 
               -mod combined.AR.mod -minTreeLength 0.5 -out ENm002.pi

Sample output:

This command will estimate the stationary distribution (π) and its log-odds ratio score for each column in the alignment. Both input files and the output of this command can be found in the data directory.

Integrating site-wise substitution patterns log-odds score in windows

After estimating substitution patterns per column as described above, use this method to sum the resulting scores to estimate substitution patterns across a larger region or a fixed-sized window.

Command:

java -jar bin/scaler_0.5.jar -task 11

Parameters:

  • -in: π estimation file (the result of site-wise base substitution pattern estimation). This task does not support standard input, you must provide a file name
  • -out: Output file. If omitted, SiPhy writes results to standard output.
  • -window: Required. Size of the sliding window in which site-wise log-odds scores will be summed.
  • -windowOverlap (default=window - 1): Sliding window overlap.

Result:

The output is a BED file contaiig all windows with score equal to the sum of the individual site log-odds ratio score.

Estimating substitution patterns (π) using a Hidden Markov Model (HMM)

As describbed in Garber, et al.] Siphy can define conserved elements whose bases are evolving in unexpected subsitituion patterns using a hidden Markov model. Siphy supports two commands, the first to find the Viterbi path, the second to estimate the posterior probability that a base is evolving under the biased substitution pattern model rather than the neutral model. Both commands have the same parameters and only differ in their task number.

Command:

java -jar bin/pihmm.jar -task 1 (or -task 2 to compute posterior probabilities )

Parameters:

  • -gamma: Required. Prior expected coverage of the alignment by conserved elements (as a percent of the genome).
  • -l (default=FASTA): Prior expected element length
  • -mod: Required. Neutral model file as described above.
  • -out if a specific file name is desired otherwise output is written to standard out
  • -ref: Required for formats other than MAF. Reference sequence id. The identifier must match an identifier in the multiple alignment file.
  • -in: Alignment file.
  • -format (default=FASTA): Alignment format of multiple alignment file: FASTA, PHYLIP, or MAF.
  • -start: Required for MAF. Start coordinate of an interval in the multiple alignment file. SiPhy estimates site-wise substitution patterns for this interval.
  • -end: Required for MAF. End coordinate of the interval in the multiple alignment file.

Sample command:

java -jar bin/pihmm.jar -task 1 -gamma 0.08 -l 10 -mod sample_data/combined.AR.mod -in sample_data/ENm002_compressed.fa -ref hg17 -out ENm002.viterbi
and
java -jar bin/pihmm.jar -task 2 -gamma 0.08 -l 10 -mod sample_data/combined.AR.mod -in sample_data/ENm002_compressed.fa -ref hg17 -out ENm002.viterbi

Computes the viterbipath and posterior probabilities using the π HMM on the encode region ENm002. The input and output used in this example can be downloaded from the data directory.

Scoring motif conservation

Estimates a conservation score for a given sequence motif using a simple generalization of the substitution pattern estimation. The algorithm is similar to the one described by Moses, et al. in MONKEY (Genome Biology 2004, 5:R98). This method was not discussed in Garber, et al. but was used in Guttman, et al. (Nature 2009, 458:223-227) to rank lincRNAs by their potential to be regulated by known transcription binding sites based on occurrence of conserved binding sites.

SiPhy estimates the conservation score by sliding a fixed-size window equal in length to the motif. You can analyze a single genomic interval by specifying the start and end coordinates of that interval (-start and -end parameters) or analyze a set of coordinates by specifying a set of annotations in a BED file (-in parameter). The input multiple alignment files must be in the MAF file format.

Command:

java -jar bin/scaler_0.5.jar -task 16

Parameters:

  • -pwm: Required. Position weight matrix (pwm) of the sequence motif in pwm file format. See p53.pwm for a sample input file.
  • -indir: Input directory that contains a MAF file for each chromosome. Alignment files are assumed to have a .maf extension, if they have a different extension use the -mafSuffix to specify it. If omitted, working dir is used?
  • -outdir: Output directory. If omitted, SiPhy writes output files to the working directory.
  • -mod: Required. Neutral model file as described above.
  • -seedMinScore: For long motifs, provide a minimum reference score. SiPhy computes a reference score for motif tuples then analyzes only those motifs that score higher than the specified minimum reference score. Scores depend on the motif information content. For example, a minSeedScore of 0 would seed on tuples that are equally likely under both the neutral and PWM models, a positive score indicates that the tuple is more likely under the PWM specified base frequencies.
  • -ignore: Comma separated species to ignore, if the model's phylogenetic tree contains species not desired in the analysis.
  • -minTreeLength (default=1): Minimum branch length to actually compute ω, sites with lesser total tree length are ignored.
  • -start: Start coordinate of an interval in the multiple alignment file. SiPhy analyzes this interval. Specify -start and -end or -in.
  • -end: End coordinate of the interval in the multiple alignment file.
  • -in: BED file annotating elements in the multiple alignment file. SiPhy analyzes these elements. Specify -start and -end or -in.
  • -priorOmega: For a large interval, adjust for local substitution rate variation by specifying the ω obtained as described above. If omitted, ω is set to 1 (neutral).
  • -outprefix: To append a used defined prefix to the generated output file name.
  • -mafSuffix: By default the command assume MAF files withing the indir have a .maf extension. Use this parameter to sepcify a different file extension.

Result:

The ouput is one BED file per motif (the file name being the motif name as it appears in the pwm input file) with the locations that pass the minimum seed score, the orientation of the entry reflects the best orientation of the motif match. The score is sum of the log-odds ratio of the likelihood that each column of the element was generated by the evolutionary model with stationary distribution (π) equals to the corresponding column in the motif or by the null model.

Sample command:

 java -jar bin/scaler_0.5.jar  -task 16 -pwm p53.pwm -indir multiz44way/ -chr 22 -start 29695034 -end 29705980 -mod autosomal.mod \\
-ignore danRer5,oryLat2,gasAcu1,fr2,taeGut1,tetNig1,xenTro2,anoCar1,galGal3,ornAna1,monDom4 -minSeedScore 2 -outdir outdir

Sample output:

This command will scan the TUG1 non-coding RNA loci for conserved instances of the transfac P53 motif. The -minSeedScore = 2 improves performance by scanning the reference first, only running the maximum likelihood estimation for windows where the log-odds ratio of the likelihood of the reference (hg18) bases being generated by the motif vs the neutral model specified is grater than 2. All input files, except for the directory with the genome wide alignments can be found in the data directory, to run this command it is sufficient to create a multiz44way directory and download the chr22 alignment from the UCSC browser: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz44way/maf/chr22.maf.gz the -ignore flag restricts the program to eutherian mammals rather than using all vertebrate species in the alignment.

Utilities

MAF Indexing

Generates an index file for a MAF file.

Command:

java -jar mafutils.jar -task 3 < alignment_file > alignment_file.index

The MAF index file must have the same name as the MAF file name and the .index suffix.

PAML outout to Model file conversion

Converts a PAML output file to the file format used by Phast (and SiPhy). Requires PERL

Command: perl extract_model_from_paml_out.pl

Parameters:

  • -indir: Directory containing PAML output files
  • -suffix: Suffix of PAML output files in the indir above
  • -outdir: Output directory. Were a model file per PAML output file in the input directory will be written out with a .mod extension