How to use the CRISPR Gene Scoring Tool
Return to the CRISPR Gene Scoring Tool. Download: CRISPR Screen Analysis Tool README.
Tool Overview
This analysis tool can be used to rank genes for genetic perturbation screens. The tool takes a list of perturbations and associated numerical scores as input and computes a score using one of two statistical methods: negative binomial distribution (with replacement) or hypergeometric distribution (without replacement). These methods and their corresponding parameters are described in more detail directly below.
User Guide
The following section is intended as a tutorial for new users and includes sample inputs, outputs and running parameters for both kinds of analysis. For more detail on how the analysis works as well as the general formats of the inputs and outputs, see the related sections below.
- Sample input file: CRISPR Gene Scoring Sample Data Input File (a.k.a. "PoolQ scores matrix" file)
- Sample chip file: CRISPR Gene Scoring Sample Annotation .chip File
- Sample output files (.zip): CRISPR Gene Scoring Sample Output Files
After downloading the files listed above, perform the following steps and choose the following parameters for the given analysis in order to get matching output and ensure that you are using the tool properly. Note: the parameters listed are default or standard given the examples provided though you may choose to run with different parameters depending on the nature of your own experimental data.
Data and Annotation File Input
These input steps are common to either type of analysis used.
- Select the "Choose File" option next to the "Chip File" option and load the "crispr_gene_scoring_sample_annotation.chip" file
from its location on your computer after downloading from the link above. If you know the CP#, you may be selecting a .chip file from
the dropdown provided after first selecting either "strict" or "lax" versions of the file, which refer to the method used to
match the individual sgRNA sequences to genes in the reference genome. A successful upload should result in the following appearing in the form
(note that these are specific to the .chip file given in the example, and could differ
from the .chip file used for your own experimental data):
- Total # guides: 86329
- Average # guides/gene: 4.15
- Total # one-intergenic-site control guides: 30
- Total # no-site control guides: 941
- Select the "Choose File" option next to the "Data File" for field that shows up after .chip file selection and upload the "crispr_gene_scoring_sample_input.txt" data file from its location on your computer after downloading from the link above.
Negative Binomial Distribution (STARS): Instructions and Parameters
- After loading input files as indicated above, select "Negative Binomial (STARS)" from the resulting "Analysis Type" radio button.
- Select the following parameter options to match sample output (note that your own parameters will vary depending on your experimental data).
- Directionality of Scores: Both
- Threshold %: 10
- Include first barcode in calculation: No
- Click the "Submit" button. Analysis on this data set should take around 10 minutes to run, though this will likely differ for your own experimental data analysis.
- Download your results and compare the file that starts with "A375_STARSOutput_P" to the "sample_STARS_output_P.txt" file and the file that starts with "A375_STARSOutput_N" file with the "sample_STARS_output_N.txt" file. The sample files are both found in the "crispr_gene_scoring_sample_output.zip" downloaded from the link above. Note that there may be very subtle differences in the scores calculated though rank order should be maintained between runs.
Hypergeometric Distribution Tool: Instructions and Parameters
- After loading input files as indicated above, select "Hypergeometric" from the resulting "Analysis Type" radio button.
- Select the following parameter options to match sample output (note your own parameters will vary depending on your experimental data).
- Top n% guides per gene: 100 (default)
- Create negative control "dummy" genes w/ n guides per gene: 4 (autofilled, calculated based on avg. number guides/gene for selected pool)
- # of genes to label: 4
- Min # of guides required per gene: 4
- Max # of genes required per gene: 8
- Display no-site controls: leave checked (default)
- Display one-intergenic-site-controls: leave checked (default)
- Click the "Submit" button. Analysis on this data set should take 1-2 minutes to run, though this will likely differ for your own experimental data analysis.
- Download your results and compare the resulting .txt file to the "sample_hypergeom_output.txt" file and the resulting .pdf file with the "sample_hypergeom_output_plot.pdf". The sample files are found in the "crispr_gene_scoring_sample_output.zip" downloaded from the link above. Note that there may be very subtle differences in the scores calculated though rank order should be maintained between runs.
Statistical Methods
There are two statistical methods to choose from for your analysis. Their descriptions are as follows:
- Negative Binomial Distribution (STARS)
- The STARS score is calculated using the probability mass function of a binomial distribution. The calculation is performed for all perturbations that rank above a user-defined threshold, e.g. the top x% of perturbations from a ranked list. The value of the least probable perturbation for each gene is then assigned to the gene as the STARS score. Unless specified, STARS requires that at least two perturbations rank above the user-defined threshold for a gene to receive a STARS score. Permutation testing is also performed on the list of perturbations used in the experiment to generate a null distribution, allowing the calculation of p-values and false discovery rates (FDR) for hit genes. STARS also provides separate outputs for sgRNAs ranked in ascending and descending direction.
- Hypergeometric Distribution
- In this method, the rank of sgRNAs is used to calculate gene p-values using the probability mass function of a hypergeometric distribution. The list of sgRNAs can be ranked in both ascending and descending directions and the resulting p-values will be different in each direction. We choose to resolve this by calculating the average -log10(p-value) in both directions and picking the more significant one. The top n% of sgRNAs per gene can be used to calculate the average p-value with this method. The average log-fold change per gene is also reported and this can be used to assess the magnitude of effect.
Running Parameters
- Negative Binomial Distribution (STARS)
- Directionality of scores: Use "Positive" if the best perturbation has the highest/most positive value (enrichment) and "Negative" if the best perturbation has the lowest/most negative value (depletion). Use "Both" if you want the screen to be analyzed in both directions.
- Threshold: A number ranging from 0 - 100. This indicates the x% of sgRNAs for which a STARS score will be calculated. A value of 10 is standard but can be specified based on the signal in the particular biological assay.
- Include first barcode in calculation: Specify whether the first ranking perturbation for each gene should be used in the calculation of STARS score.
- Hypergeometric distribution
- Top n% guides per gene: n% guides per gene to be used in the calculation of average p-value and average log-fold change. The default value is 100% which uses all guides to calculate average p-value and average log-fold change.
- Create negative control "dummy" guides with n guides per gene: Number of control guides to be used to create "dummy" control genes. Average number of guides per gene in the library is the suggested value.
- # of genes to label: The number of genes to be labeled on the enrichment and depletion sides of the volcano plot.
- Min. # of guides required per gene: Minimum number of guides required for a gene to be plotted on the volcano plot.
- Max. # of guides required per gene: Maximum number of guides required for a gene to be plotted on the volcano plot.
- Display no-site controls: Determines whether no-site controls will be displayed on the resulting volcano plot.
- Display one-intergenic-site controls: Determines whether one-intergenic-site controls will be displayed on the resulting volcano plot.
Input Formats
- Chip File
- A .txt file with the first column listing the individual sgRNAs and the second column listing the gene identifiers
of the sgRNA targets.
- Data File
- A .txt file with the first column listing the sgRNAs as specified in the first column of the chip file
provided and the consecutive columns listing the numerical inputs for each condition.
Output File Details
Notes on target matching in CRISPR chip files and output files
Notes on target matching in CRISPR chip files and output files
In addition to the rows indicating target (gene) matches, a CRISPR chip file may also contain negative, or "non-target", information about a construct. There are two broad types of non-target indicators:
- Non-Target designations discovered via genome search
- NO_SITE
- No CFD100[?]CFD100 means the sequence match site has a 100% CFD score (Doench, Fusi et al., Nature Biotechnology 2016), but since CFD scoring allows some base mismatches at certain positions, and ignores the initial three bases of the target sequence (for SpCas9 and SaCas9 systems), a 100% CFD score does not necessarily indicate a full-length perfect match of the target sequence to a genomic location. matches anywhere in target genome.
- ONE_INTERGENIC_SITE
- Exactly one CFD100[?]CFD100 means the sequence match site has a 100% CFD score (Doench, Fusi et al., Nature Biotechnology 2016), but since CFD scoring allows some base mismatches at certain positions, and ignores the initial three bases of the target sequence (for SpCas9 and SaCas9 systems), a 100% CFD score does not necessarily indicate a full-length perfect match of the target sequence to a genomic location. match in target genome, but in "desert of function" region.
- MULTIPLE_INTERGENIC_SITES
- More than one CFD100[?]CFD100 means the sequence match site has a 100% CFD score (Doench, Fusi et al., Nature Biotechnology 2016), but since CFD scoring allows some base mismatches at certain positions, and ignores the initial three bases of the target sequence (for SpCas9 and SaCas9 systems), a 100% CFD score does not necessarily indicate a full-length perfect match of the target sequence to a genomic location. match in target genome, but all in "desert of function" regions.
- POTENTIALLY_ACTIVE
- No perfect match to any gene, but either:
- At least on CFD100[?]CFD100 means the sequence match site has a 100% CFD score (Doench, Fusi et al., Nature Biotechnology 2016), but since CFD scoring allows some base mismatches at certain positions, and ignores the initial three bases of the target sequence (for SpCas9 and SaCas9 systems), a 100% CFD score does not necessarily indicate a full-length perfect match of the target sequence to a genomic location. match to a gene or very close by, or in intron or non-coding region of a gene.
- Or, the low-level off-target search "MAXed out", i.e., aborted when it found too many results (usually this limit is set to 10,000). In these cases, we do not assess CFD scores for the results, but consider that MAXing out in itself is an indicator that the guide is potentially active.
- May be the result of annotation drift or skew, may still hit a legitimate target.
- Non-Target designations intrinsic to the construct's own sequence
- INACTIVE_4T, INACTIVE_5T, INACTIVE_6T+
- Ineffective guide design (contains TTTT+).
- NOTE: this may co-occur with other hit or non-hit records in the chip file.
Numeric Counter Suffixes on Non-Target Codes
You will notice that the non-target codes mentioned above do not appear in chip files in their "bare" form. Instead you will find e.g. "NO_SITE_192" or "INACTIVE_6T+_32". The reason for these appended "counter" digits is simply to ensure uniqueness so that e.g. the top hit in your screen doesn't end up being "NO_SITE". The actual numeric suffix value is not significant, nor is it stable over time. That is, today a barcode may be associated with "ONE_INTERGENIC_SITE_120" and tomorrow, after an updated run of the chip file generator, the same barcode may instead be given the code "ONE_INTERGENIC_SITE_119", due to a change somewhere higher up in the file.
Notes on Output File columns
This tool generates separate output files for every column in your input file. The column name will be included in the output file name.
- Negative Binomial Distribution (STARS)
- Only the genes with at least 2 perturbations ranking above the threshold will receive
a STARS score and be reported in the output file. If the first perturbation was used to
calculate the STARS score, all the genes with at least one perturbation ranking above the
specified threshold will receive a STARS score and be reported in the output file. The output
file contains 10 columns as follows:
- Gene identifier, from column 2 of the chip file
- Number of perturbations targeting the gene
- Ranks of perturbations targeting the gene
- Identity of perturbations
- Within-gene-rank of the least probable perturbation
- STARS score: -log10(value of least perturbation)
- Average score: Average of negative log of the values of all perturbations ranking above threshold
- P-values calculated using the null distribution specified
- False Discovery Rate (FDR) calculated using permutation testing
- q-value
- Hypergeometric distribution
- All the genes in the library will be reported in the output file along with a .pdf of the
volcano plot. The output file contains 10 columns as follows:
- Gene identifier, from column 2 of the chip file
- Average log-fold change of n% guides per gene
- Average -log10(p-value) of n% guides per gene
- Number of perturbations targeting the gene
- Identity of perturbations; perturbations listed according to individual rankings in ascending order
- Individual log-fold changes of the perturbations
- Ranks of the individual perturbations in the ascending direction
- -log10(p-values) of the individual perturbations in the ascending direction
- Ranks of the individual perturbations in the descending direction
- -log10(p-values) of the individual perturbations in the descending direction