How the sgRNA Designer Works (all versions)
Overview
This tool ranks and picks candidate sgRNA sequences for the targets provided, while attempting to maximize on-target activity and minimizing off-target activity. It uses the "Rule Set 2" scoring model from Doench, Fusi et al., Nature Biotechnology 2016 (developed in conjunction with the Azimuth project at Microsoft Research) to assess sgRNA on-target activity, and the CFD (Cutting Frequency Determination) score to evaluate off-target sites.
Note: November 4, 2016: We have now updated to Microsoft's latest on-target scoring model, Azimuth 2.0 (see link for detailed list of changes). The bug fixes in the new implementation do not affect the overall performance of the model, though the individual numeric scores do vary slightly.
Target Resolution
In this initial phase, we look up each input gene or transcript identifier in an attempt to match with a known entity in our database (current NCBI Gene and RefSeq catalogs). If this is successful, we then retrieve any sequence and genomic locus information we have available for this target. Since the quality of the source annotations varies, we may not always have perfect data on how a transcript maps to genomic loci, or even its exonic structure. In the worst case, the tool infers a putative genomic sequence from the known RNA sequence of a transcript, and if exon boundary locations are not known, then there is a good chance that a very small number of candidate sgRNA sequences will be biologically nonsensical (i.e. corresponding to a sequence that spans an exon-exon boundary and thus discontiguous in the genomic DNA).
sgRNA Candidate Sequence Generation
Once the target has been successfully resolved and sequence information gathered, the tool cycles through the sequence looking for appropriate PAM sites along both strands, generating an initial list of "candidate" sgRNA target sequences.
sgRNA Candidate Sequence Annotation and Ranking
The candidate sequences must be annotated and ranked in order to prioritize the picking process. First we calculate two independent dimensions: On-Target Rank and Off-Target Rank. The on-target and off-target ranks of each sgRNA are then combined at equal weight to provide a final rank for each sgRNA targeting a particular transcript.
On-Target Efficacy Scoring (Azimuth 2.0)
We use the Azimuth 2.0 model to calculate the on-target score for each candidate sgRNA target sequence, and use these scores to assign a per-transcript On-Target Ranking. Our implementation uses the version of Azimuth 2.0 that does not incorporate protein target site information, as that criterion is used later as a relaxable constraint during the "picking" phase.
For detailed information about Rule Set 2 scoring methods please refer to Doench, Fusi et al., Nature Biotechnology 2016.
Off-Target Analysis ("Threat Matrix")
We annotate each candidate sgRNA sequence by the number of potential off-target sites along two dimensions:
(1a) CRISPRko: Match Tiers ("Tiers I - IV" in the output file):
- Tier I: coding regions only
- Tier II: non-coding regions (exonic UTR or intronic) of coding genes
- Tier III: exonic or intronic regions of non-coding genes
- Tier IV: all regions not in Tier I-III
(1a) CRISPRa/i: Match Tiers ("Tiers I - III" in the output file):
- Tier I: TSS-relative region of a protein-coding gene
- Tier II: TSS-relative region of a non-coding gene
- Tier III: all regions not in Tier I-II
(2) CFD scores ("Match Bins I - IV" in the output file):
- Match Bin I: CFD = 1.0
- Match Bin II: 0.2 ≤ CFD < 1.0
- Match Bin III: 0.05 ≤ CFD < 0.2
- Match Bin IV: CFD < 0.05
Combining the Tier dimension with the CFD Match Bin dimension yields an off-target "Threat Matrix" (4 x 4 for CRISPRko, 3 x 4 for CRISPRa/i), presented as 16 or 12 columns in the output file. The counts in these columns are used to create an off-target rank-ordering (with column precedence in the order displayed in the file).
Off-Target Cutting Frequency Determination (CFD) Score Calcluation
The Cutting Frequency Determination (CFD) score is calculated by using the percent activity values provided in a matrix of penalties based on mismatches of each possible type at each position within the guide RNA sequence. This matrix will become available pending publication with a full description of it.
For example, if the interaction between the sgRNA and DNA has a single rG:dA ("rna G aligning with dna A") mismatch in position 6, then that interaction receives a score of 0.67. If there are two or more mismatches, then individual mismatch values are multiplied together. For example, an rG:dA mismatch at position 7 coupled with an rC:dT mismatch at position 10 receives a CFD score of 0.57 x 0.87 = 0.50.
DHS scoring (CRISPRa/i only)
For CRISPRa/i annotation, we also take into acount whether the target sequence is within a known ENCODE-annotated DNase I Hypersensitive Site. This is represented as a score ranging from 0 to 1 (1 is highest). At the moment this is a binary score as there are no values between 0 and 1, though this may change in the future. This score is not actually used in ranking the sgRNA candidate sequences; rather it is used as a filter during the cyclic picking algorithm described below.
sgRNA Picking
Once all candidate sgRNA sequences are fully annotated and ranked, the sgRNA Designer cycles through the list of candidates, attempting to pick sequences in order to achieve the desired quota. To pick sgRNAs for each transcript, we first choose the best-ranked sgRNA that satisfies the basic constraints that it targets within the 5 – 65% of the protein-coding region of the target gene and has an on-target score ≥ 0.2. We then select additional sgRNAs per transcript (also satisfying the above constraints), also requiring that each picked sgRNA targets a site at least 5% away (from a protein-coding standpoint) from previously-picked sgRNAs. This ensures diversity in target space, especially useful due to the potential for exons that are present in the reference transcript not to be included in any particular cellular model to which the library is applied. In order to meet the requested quota for some target genes, we may need to perform multiple rounds of picking, with each round relaxing some constraint, such as the 5 – 65% protein-coding region, the minimum Rule Set 2 score, or the 5% spacing criteria.