The TRC shRNA Design Process
Overview
We design shRNA constructs ("clones") with an algorithm. Our algorithm uses several criteria to rank potential 21mer targets within each human and mouse Refseq transcript. The algorithm applies a set of rules, including those derived from the siRNA literature, analysis of TRC library performance datasets, constraints on the synthesis and cloning of the oligonucleotides and others. In applying the algorithm, our aim is to achieve a balance of two competing goals: make hairpins that effectively knock down the target transcript and, as best possible, design hairpins that knock down only one gene and do not directly alter other genes (so-called 'off-target' effects). Each goal presents distinct challenges. The criteria for predicting effective knockdown with either siRNA or shRNA are not well understood and are still being developed and refined. Specificity is constrained by genome evolution--since many genes are part of extensive gene families, targeting a specific family member can be difficult. Furthermore, functionally distinct genes share many motifs with underlying nucleic acid sequence similarity. Our knowledge of transcript structure and variants is still very incomplete as well. For all these reasons and more, we construct several shRNAs for each transcript with the expectation of getting a range of knockdown efficiencies across the set and at least a few which knockdown effectively.
Users of this database should be aware that in order to have consistent and reliable annotation, the RNAi Consortium decided early on to use NCBI's REFSEQ collection of transcripts as the definitive source of information for the primary target sequence for the design of shRNAs.
As a general rule in the construction of the library, we construct shRNAs targeting just one Refseq transcript for each NCBI gene. Because of the high sequence identity among different transcripts from the same gene, the majority of the shRNAs target all known transcript variants.
A brief narrative of the candidate selection process
-
Get the Candidate Sequences
For one representative human and mouse Refseq transcript per gene, we assess all 21mers starting 25 bp after the beginning of the CDS up to those starting 150 bp from the end of the transcript. Each 21mer is called a "candidate". -
Calculate the intrinsic score for each Candidate
Each candidate is assigned an intrinsic score by applying a set of rules that either penalize or reward features predicting successful knockdown and clone-design considerations. The intrinsic score is the product of all such penalties/rewards, as listed for each Rule Set in the tables below. -
Calculate the miRNA seed factor for each Candidate
Candidates are also rewarded or penalized based on the frequency of predicted microRNA-like off-target effects via "seed" matches. - Score the Candidate Sequences for Specificity
Using an intermediate score defined by the product of the intrinsic score and miRNA seed factor above, the candidates are ranked and the best 250 per target gene are identified. For each of these 250 selected candidates, we calculate a specificity factor to promote candidates without significant sequence similarity to other genes. Each candidate is compared by BLASTN to the targeted transcriptome (i.e. human or mouse) as defined by Refseq. Any candidate that has more than three differences in the initial 19 bases of the 21mer (the "specificity-defining region", or SDR) with every (non-self) gene is considered unique for that reference set. The specificity factor is weighted such that a perfect match to another gene receives the worst penalty (spec. factor = 0.5), while match of only 16/19 of the SDR is given a less severe penalty (spec. factor = 0.8).
Candidates with no off-target matches receive a specificity factor "bonus" value of 1.3. Furthermore, if the candidate is a perfect match to a majority of the current transcripts of the target gene its specificity factor is 1.4.
NOTE: the concept of "off-target" is relative to particular target gene, so the specificity factor of a candidate sequence is recorded separately for each of its potential target genes.
-
Calculate a final adjusted score for each candidate
relative to a target gene
The adjusted score for a candidate targeting a particular gene is simply the product of its intrinsic score with the miRNA seed factor and specificity factor above. Larger numbers are better, and for the latest rule set (9), the scores range from about 15 (best) down to 0 (worst). -
Avoid overlapping shRNAs
In order to create a library of many distinct shRNAs for every human and mouse gene, we consider the target region of the target gene (e.g. CDS, 3'UTR) as well as overlap of top-scoring candidates with existing library clones and with each other. Candidates are ranked by adjusted score and then assessed for target region and overlap with other existing or ordered shRNAs until the desired number of candidates is selected per gene. We first attempt to select new candidates that have a specificify factor and a adjusted score greater than 1, no overlap with other TRC shRNAs, and are distributed in a 4:1 ratio between the CDS and 3'UTR. If sufficient candidates are not available that meet these criteria, we allow some (but not full) overlap and/or relax the CDS:3'UTR ratio before relaxing the score requirements.
Current Rule Set
Rule Set 9
Rule | Description | |
---|---|---|
1 | aaStart9 | Exclude any candidate beginning with AA (score = 0) |
2 | fourRow9 | Exclude any candidate containing a run of four of the same base in a row (score = 0) |
3 | gcScore9 | Exclude candidates with extreme GC percentage (GC <= 25% or > 60%); promote candidates with GC between 25-55% (score = 3); if GC > 55% and <= 60% then score = 1 (neutral) |
4 | nonGATC9 | Exclude any candidate containing ambiguous bases (e.g. N) (score = 0) |
5 | restrictionSite9 | Exclude any candidate containing certain restriction sites: ...GGTACC..., ...GAATTC..., ...CTCGAG..., ...CATATG..., ...ACTAGT..., ...GGTAC, ...GAATT, GTACC..., TACC..., CTAGT... |
6 | sevenGC9 | Exclude any candidate with a run of 7 C/G bases (score = 0) |
7 | stemLoopStem | Penalize candidates that can form an internal stem-loop (score = 0.1) (minimum stem length = 5, minimum loop size = 4) |
8 | threePrimeClamp6 | Give precedence to candidates with weaker base-pairing at positions 15-20 (priority on pos. 17-19); score = 5 if all 6 positions are A or T, decreasing to 0.1 if all 6 are G/C. Score drops off steeply as the number of A/T bases decreases. |
Previous Rule Sets
Rule Set 8
Rule | Description | |
---|---|---|
1 | aaStart | Penalize candidates beginning with AA (score = .000000000000001) |
2 | fourRow | Penalize candidates containing four of the same base in a row gets (score = 0.01) |
3 | gcScore8 | Penalize candidates with extreme GC percentage (GC <= 25% or > 60%; score = 0.01); promote candidates with GC between 25-55% (score = 3); if GC > 55% and <= 60% then score = 1 (neutral) |
4 | nonGATC | Penalize candidates containing an ambiguous base (e.g. N) (score = 0.000000000000001) |
5 | restrictionSite8 | Penalize any candidate containing certain restriction sites: ...GGTACC..., ...GAATTC..., ...CTCGAG..., ...CATATG..., ...ACTAGT..., ...GGTAC, ...GAATT (score = 0.0001) |
6 | sevenGC | Penalize candidates containing a run of 7 C or G (score = 0.01) |
7 | stemLoopStem | Penalize candidates that can form an internal stem-loop (score = 0.1) (minimum stem length = 5, minimum loop size = 4) |
8 | threePrimeClamp6 | Give precedence to candidates with weaker base-pairing at positions 15-20 (priority on pos. 17-19); score = 5 if all 6 positions are A or T, decreasing to 0.1 if all 6 are G/C. Score drops off steeply as the number of A/T bases decreases. |
Rule Set 7
Rule | Description | |
---|---|---|
1 | aaStart | Penalize candidates beginning with AA (score = .000000000000001) |
2 | fivePrimeClamp | fivePrimeClamp:give precedence to a candidates with stronger base-pairing at the 5 prime end of the putative candidate, referred to as five_prime_clamp; penalty/reward .01 if first two positions are GG, .0001 if first two are TT; 2.5 if first four are (G|C){4}; 2.4 if first three positions are G|C{3}; 2.2 if begins (CC|CG|GC)(A|T)(G|C); 2 if begins (CC|CG|GC); 2 if begins (GC); 1.25 if begins (G|C); 1 if begins (A|T)(G|C); .5 if begins ((A|T){2} |
3 | fourRow | Penalize candidates containing four of the same base in a row gets (score = 0.01) |
4 | gcScore | gcContent: extremes of GC percentage are penalized; candidates with GC \< 30% are penalized .01; with > 70% the penalty is .01; with GC between 30-50% the candidate gets a reward of 3; with GC >60 and \<70% the reward/penalty is 1 |
5 | internalAT | internalAT; we want to reward moderately AT rich regions from 7 through 10; if all four are A|T, rewards is 2.2; if 3 of 4 are A|T, the reward is 2, if 2 of 4 is A|T, the reward is 1.5; if 1 or 4 is A|T, the penalty is .7; if none of the four are A|T, the penalty is 0.5 |
6 | internalATFlanking | internalATflank; we want to reward moderately AT-rich sequences at position 6 and 11; if both are AT, the reward is 1.2; if 1 is either A|T, the reward is 1 and if neither is A|T, the penalty is 0.85 |
7 | internalLoop | internalLoop: we penalize candidates that cand form a AAABBB loop with a 0.7 penalty |
8 | nonGATC | Penalize candidates containing an ambiguous base (e.g. N) (score = 0.000000000000001) |
9 | restrictionSite | GCCGGC, CCCGGG, CTCGAG, ...GCCGG |
10 | sevenGC | Penalize candidates containing a run of 7 C or G (score = 0.01) |
11 | threePrimeClamp6 | Give precedence to candidates with weaker base-pairing at positions 15-20 (priority on pos. 17-19); score = 5 if all 6 positions are A or T, decreasing to 0.1 if all 6 are G/C. Score drops off steeply as the number of A/T bases decreases. |
Rule Set 4
Rule | Description | |
---|---|---|
1 | aaStart | Penalize candidates beginning with AA (score = .000000000000001) |
2 | fivePrimeClamp | fivePrimeClamp:give precedence to a candidates with stronger base-pairing at the 5 prime end of the putative candidate, referred to as five_prime_clamp; penalty/reward .01 if first two positions are GG, .0001 if first two are TT; 2.5 if first four are (G|C){4}; 2.4 if first three positions are G|C{3}; 2.2 if begins (CC|CG|GC)(A|T)(G|C); 2 if begins (CC|CG|GC); 2 if begins (GC); 1.25 if begins (G|C); 1 if begins (A|T)(G|C); .5 if begins ((A|T){2} |
3 | fourRow | Penalize candidates containing four of the same base in a row gets (score = 0.01) |
4 | gcScore | gcContent: extremes of GC percentage are penalized; candidates with GC \< 30% are penalized .01; with > 70% the penalty is .01; with GC between 30-50% the candidate gets a reward of 3; with GC >60 and \<70% the reward/penalty is 1 |
5 | internalAT | internalAT; we want to reward moderately AT rich regions from 7 through 10; if all four are A|T, rewards is 2.2; if 3 of 4 are A|T, the reward is 2, if 2 of 4 is A|T, the reward is 1.5; if 1 or 4 is A|T, the penalty is .7; if none of the four are A|T, the penalty is 0.5 |
6 | internalATFlanking | internalATflank; we want to reward moderately AT-rich sequences at position 6 and 11; if both are AT, the reward is 1.2; if 1 is either A|T, the reward is 1 and if neither is A|T, the penalty is 0.85 |
7 | internalLoop | internalLoop: we penalize candidates that cand form a AAABBB loop with a 0.7 penalty |
8 | nonGATC | Penalize candidates containing an ambiguous base (e.g. N) (score = 0.000000000000001) |
9 | sevenGC | Penalize candidates containing a run of 7 C or G (score = 0.01) |
10 | threePrimeClamp | threePrimeClamp: give precedence to a candidates with weaker base-pairing at the 3 prime end of the putative candidate; penalty/reward 5 if last three positions are A or T, 4.5 if last two are A|T and third from is G|C and fourth is A|T; 4 if the last two are A|T; 2 if the last base is A|T; penalty is .2 if last two posisitions are G|C; .5 if the last base is G|C; 0.8 if the last base is G|C and previous two are A|T |