How to use the sgRNA Designer (CRISPRko)

Back to the sgRNA Design Tool

Jump to section:

Inputs:

This tool supports inputs in the form of raw DNA sequences, FASTA files, RefSeq Transcript IDs, or NCBI Gene IDs or Gene Symbols. You may provide sgRNA targets either by pasting them directly into the text box or by uploading a file. If you provide both text box inputs and a file upload, the latter will be used and the text input will be discarded. Targets are limited to 10 gene or transcript IDs, or 10,000 bases of raw DNA sequence. An earlier sgRNA design tool supported Ensembl transcript IDs, but that version has now been decommissioned.

When providing raw DNA sequences, be aware that the Azimuth 2.0 on-target scoring model considers 4 bases of context on the 5' end and 6 bases of context on the 3' end. That means the shortest possible input sequence that can contain a valid target is 30 bases.

Examples:

Output:

This tool produces two output files. The primary output is a downloadable .txt file containing sgRNAs designed against the coding sequences. The tool also produces a summary file giving statistics for on- and off-target scores by pick order.

Output File Column Descriptions:

Input
Target gene or transcript ID as originally requested.
Quota
Desired number of candidate sgRNA sequences to pick for this target.
Target Taxon
Taxon of the target gene.
Target Gene ID
ID of the target gene (usually numeric, e.g. "Entrez-Gene" ID).
Target Gene Symbol
Official symbol of target gene.
Target Transcript
ID of target transcript if any.
Target Alias
For a DNA sequence target in an uploaded FASTA file, this is the unique ID associated with that sequence in the file.
Target Mode
In what way is the actual target region defined? TRANSCRIPT = all exons of the target transcript, including non-coding regions; CDS = only the coding portions of the exons of the target transcript. The CDS Target Mode has two effects on the operation of the sgRNA Designer: (1) candidate sgRNA sequences are extracted from the target transcript's coding region only, rather then from all its entire exonic sequence; (2) the picking algorithm prefers sgRNA sequences which cut nearer to the beginning of the target CDS sequence (first trying within the 5-65% region before successively relaxing to 0-80% and 0-100%). NOTE: currently all raw DNA sequence inputs are treated as CDS sequences during picking.
PAM Policy
Currently limited to NGG only.
Initial Spacing Requirement
When possible, pick sgRNA sequences which are separated by a certain distance (measured here in terms of percentage of the length of the entire target region). This is called "Initial" because, this requirement is relaxed in subsequent picking "rounds" if the quota is not met after examining all candidate sgRNAs for the target.
Off-Target Match Rule Set Version ("CFD score")
Method of calculating an off-target match, or "CFD" (Cutting Frequency Determination) score; currently there is only one off-target rule set ("1").
Off-Target Tier Policy
Method used to categorize off-target matches into "Tiers"; currently there is only one such policy ("1"), which breaks down as follows: Tier I: protein-coding regions; Tier II: any position within the transcribed sequence of a coding gene (intron or exon); Tier III: any position within the transcribed sequence of a non-coding gene; Tier IV: positions not contained in any gene (i.e. not transcribed).
Off-Target Match Bin Policy
Thresholds used to categorize off-target matches into "Match Bins" according to CFD score. There are four bins notated by three thresholds in increasing numerical order, separated by periods. Threshold values are in hundredths. For example, "5.20.100" represents the following 4 bins: Bin I: CFD = 1.0, Bin II: 1.0 > CFD ≥ 0.2, Bin III: 0.2 > CFD ≥ 0.05, Bin IV: CFD < 0.05.
Reference Sequence
ID of the reference sequence (transcript or chromosome) containing the cut position. If this is a transcript ID, it means that the tool was unable to resolve the user' input to positions in genomic DNA, but instead has resorted to using the known RNA sequence of the transcript (possibly with exon boundary information if available). This is suboptimal because in these cases the tool cannot pick sgRNA target sequences which extend into intronic sequence. Furthermore, if no exon boundaries are known, there is also the possibility that the tool will pick a sgRNA sequence that spans an exon-exon boundary in the RNA, which of course is nonsensical from a biological standpoint.
sgRNA Cut Position (1-based)
1-based position of the base in the genomic sequence after the predicted DNA cut caused by this candidate sgRNA.
Strand of Target
If the Reference Sequence is a genomic reference sequence rather than a transcript ID, this column indicates the strand of the target gene as a whole (+ or -) along the chromosome.
Strand of sgRNA
If the Reference Sequence is a genomic reference sequence rather than a transcript ID, this column indicates the strand of the sgRNA sequence as it is projected (+ or -) onto the chromosome.
Orientation
The orientation (sense or antisense) of the sgRNA sequence with respect to the strand of the targeted sequence
sgRNA Sequence
The sequence of the candidate sgRNA.
sgRNA Context Sequence
The longer context sequence (containing the sgRNA sequence) used for on-target efficacy scoring.
PAM Sequence
Sequence of the PAM for this match.
Exon Number
Exon within the target region containing this cut position (if CDS or TRANSCRIPT target mode only).
Target Cut Length
The "target" sequence is the total set of (possibly discontinuous) bases targeted. For protein coding targets, this is the number of bases before the cut position in the transcribed CDS (i.e. can be used to calculate the percentage of protein truncation).
Target Total Length
The total combined length of all bases targeted. If the target transcript has multiple exons, then this may be a discontinuous set of sequences.
Target Cut %
The percentage of the target that comes before the cut, following the direction of transcription. For CDS targets, this is a suggestion of the resulting possible truncation of the protein product, assuming the cut results in a frameshift and early stop codon.
# Off-Target Tier I Match Bin I Matches
Count of all off-target matches for Tier I and Match Bin I.
# Off-Target Tier II Match Bin I Matches
Count of all off-target matches for Tier II and Match Bin I.
# Off-Target Tier III Match Bin I Matches
Count of all off-target matches for Tier III and Match Bin I.
# Off-Target Tier IV Match Bin I Matches
Count of all off-target matches for Tier IV and Match Bin I.
# Off-Target Tier I Match Bin II Matches
Count of all off-target matches for Tier I and Match Bin II.
# Off-Target Tier II Match Bin II Matches
Count of all off-target matches for Tier II and Match Bin II.
# Off-Target Tier III Match Bin II Matches
Count of all off-target matches for Tier III and Match Bin II.
# Off-Target Tier IV Match Bin II Matches
Count of all off-target matches for Tier IV and Match Bin II.
# Off-Target Tier I Match Bin III Matches
Count of all off-target matches for Tier I and Match Bin III.
# Off-Target Tier II Match Bin III Matches
Count of all off-target matches for Tier II and Match Bin III.
# Off-Target Tier III Match Bin III Matches
Count of all off-target matches for Tier III and Match Bin III.
# Off-Target Tier IV Match Bin III Matches
Count of all off-target matches for Tier IV and Match Bin III.
# Off-Target Tier I Match Bin IV Matches
Count of all off-target matches for Tier I and Match Bin IV.
# Off-Target Tier II Match Bin IV Matches
Count of all off-target matches for Tier II and Match Bin IV.
# Off-Target Tier III Match Bin IV Matches
Count of all off-target matches for Tier III and Match Bin IV.
# Off-Target Tier IV Match Bin IV Matches
Count of all off-target matches for Tier IV and Match Bin IV.
On-Target Rule Set
Model used for calculating "On-Target" efficacy score (currently the only supported version is "Azimuth_2.0", which is the updated version of "Rule Set 2" from Doench, Fusi et al., Nature Biotechnology 2016). See Azimuth 2.0.
On-Target Efficacy Score
Actual on-target score of the context sequence for this candidate sgRNA as calculated using the on-target rule set.
On-Target Rank
Numerical rank (1 is highest) of this candidate sgRNA's On-Target score in relation to all other candidates for this target.
Off-Target Rank
Numerical rank (1 is highest, i.e. most-specific) of this candidate sgRNA's Off-Target evaluation in relation to all other candidates for this target.
On-Target Rank Weight
When combining On-Target and Off-Target rankings into one Combined Rank, use this weight for the On-Target Rank.
Off-Target Rank Weight
When combining On-Target and Off-Target rankings into one Combined Rank, use this weight for the Off-Target Rank.
Combined Rank
Numerical rank (1 is highest) of this candidate sgRNA based on the weighted sum of On-Target and Off-Target ranks.
Pick Order
If this candidate is picked to fulfil the target's quota, what order was it picked.
Picking Round
Candiate picking is complex and may go through multiple rounds of relaxing constraints, this column indicates in which round the pick occurred.
Picking Notes
This column indicates reasons why the construct was skipped during picking. It is empty if the candidate was picked during round 1.

Back to sgRNA Design Tool

Frequently Asked Questions:

Q: What does it mean when there are no "eligible exons" for my target?

Generally speaking, this indicates a transient omission or inconsistency in our reference data sources for the target organism: the NCBI gene data file and the GenBank transcript data file. Specifically, this may indicate (but is not necessarily limited to) one of the following cases:

  1. The transcript version in the GenBank file does not match the version referenced in the gene data file's chromosomal mapping information.
  2. Even if transcript versions do match, the GenBank exon annotation is not consistent (in length or number) with the genomic coordinates in the gene data file.
  3. The target transcript maps to an "unlocalized", "unplaced", or "alternate" region of the genome. While most genome assemblies provide this information, we are not currently utilizing it in our default sgRNA design process.

We update our mirror of NCBI data every 2-3 weeks so in most cases these descrepancies tend to "fix themselves".

Q: Why doesn't your tool take my favorite input type (Ensembl ID/chromosomal locus information, etc.)?

Using NCBI transcript IDs allows us to use specific, up-to-date identifiers that match with information contained in our database and provide key information regarding gene structure for the majority of mouse and human genes. Future iterations of this tool may allow for more flexibility with regard to user input.

Q: I would like to relax one of your pick restrictions. How can I do that?

In the future we plan to give users more flexibility in adjusting some of the picking criteria. In the meantime, the default report shows all possible guides for a target, and users should be able to sort and filter results based on their own set of criteria using the information contained in the various annotation columns.

Q: Your report contains 50+ columns! Which are the most important?

The "Target Gene", "Target Transcript" and "sgRNA Target Sequence" columns will presumably be necessary for all applications of the tool. If you are using this tool to fill a per-target quota (the current default is 5), then the "Pick Order" column reflects the final decision of the tool and incorporates all other rank, score and positional columns. Advanced users may want to ignore the "Pick Order" but make use of e.g. the "Target Concat Cut Length", "On-Target Efficacy Score" and the 16 "Off-Target Match" columns directly.

Q: What are my options if I still want to design sgRNAs against an arbitrary sequence or a target that your tool rejects?

If you have confidence in a particular DNA sequence to design against, you can use that sequence as input. Please be advised that information determined about gene/transcript structure will not be inferred for a raw sequence. For instance, if you wanted to target a specific exon(s) or avoid intronic sequence for a transcript using this method, you may need to edit your sequence to avoid unintentional targeting against an unwanted sequence or feature.

Q: I have heard the term "Threat Matrix" used describing some of the columns in the results. What is the "Threat Matrix"?

This informal term is used internally to describe the 16 columns in the result report that summarize the number of off-target hits arranged by CFD scores and Match Tier Bins, starting with the column headed "# Off-Target Tier I Match Bin I Matches". See the information listed above in the Output section and on our sgRNA Scoring Help Page for more information about these columns and what they mean.

Q: The columns describing the number of off-target hits contain the value "MAX". What does this mean?

If the total number of discovered off-target hits for a particular sgRNA sequence exceeds 10,000, we abort the search and report the value "MAX" in all off-target count columns instead. NOTE: This does not mean that any given column has a count exceeding 10,000, merely that the total of all 16 columns exceeds 10,000 by an unknown amount.

Q: Are your Rule Set 2 libraries available for purchase?

Both mouse (Brie) and human (Brunello) libraries will be available early 2016 via Addgene.

Q: How are transcripts chosen for gene inputs?

Whenever possible, we pick a single representative transcript for a protein-coding gene based on three kinds of tags used by Ensembl and GENCODE before defaulting to a transcript based on CDS length, in the following order:

  1. APPRIS annotations, which prioritize transcripts based on protein structural information, functionally important residues and evidence from cross-species alignments.
  2. The GENCODE "level" value, which indicates the level of confidence that a given loci is valid and not a psuedogene.
  3. The GENCODE "transcript support level", which indicates how well supported a transcript model is based on mRNA and EST alignments.
  4. Finally, in the absence of any APPRIS/GENCODE annotations listed above we rely on current NCBI transcript information for a given gene, and select the transcript with the longest predicted CDS.

In the case a gene does not resolve to the user's preferred transcript for a given gene, the user can choose instead to provide a specific transcript input instead.

For more details regarding the GENCODE system of annotations and data formats, please refer to the following links:

Back to sgRNA Design Tool