How to use the CRISPR Gene Scoring Tool

Return to the CRISPR Gene Scoring Tool.
Download: CRISPR Screen Analysis Tool README.

Jump to section:

Tool Overview

This analysis tool can be used to rank genes for genetic perturbation screens. The tool takes a list of perturbations and associated numerical scores as input and computes a score using one of two statistical methods: negative binomial distribution (with replacement) or hypergeometric distribution (without replacement). These methods and their corresponding parameters are described in more detail directly below.

User Guide

The following section is intended as a tutorial for new users and includes sample inputs, outputs and running parameters for both kinds of analysis. For more detail on how the analysis works as well as the general formats of the inputs and outputs, see the related sections below.

After downloading the files listed above, perform the following steps and choose the following parameters for the given analysis in order to get matching output and ensure that you are using the tool properly. Note: the parameters listed are default or standard given the examples provided though you may choose to run with different parameters depending on the nature of your own experimental data.

Data and Annotation File Input

These input steps are common to either type of analysis used.

Negative Binomial Distribution (STARS): Instructions and Parameters

Hypergeometric Distribution Tool: Instructions and Parameters

Statistical Methods

There are two statistical methods to choose from for your analysis. Their descriptions are as follows:

Negative Binomial Distribution (STARS)
The STARS score is calculated using the probability mass function of a binomial distribution. The calculation is performed for all perturbations that rank above a user-defined threshold, e.g. the top x% of perturbations from a ranked list. The value of the least probable perturbation for each gene is then assigned to the gene as the STARS score. Unless specified, STARS requires that at least two perturbations rank above the user-defined threshold for a gene to receive a STARS score. Permutation testing is also performed on the list of perturbations used in the experiment to generate a null distribution, allowing the calculation of p-values and false discovery rates (FDR) for hit genes. STARS also provides separate outputs for sgRNAs ranked in ascending and descending direction.
Hypergeometric Distribution
In this method, the rank of sgRNAs is used to calculate gene p-values using the probability mass function of a hypergeometric distribution. The list of sgRNAs can be ranked in both ascending and descending directions and the resulting p-values will be different in each direction. We choose to resolve this by calculating the average -log10(p-value) in both directions and picking the more significant one. The top n% of sgRNAs per gene can be used to calculate the average p-value with this method. The average log-fold change per gene is also reported and this can be used to assess the magnitude of effect.

Running Parameters

Input Formats

Chip File
A .txt file with the first column listing the individual sgRNAs and the second column listing the gene identifiers of the sgRNA targets.
Data File
A .txt file with the first column listing the sgRNAs as specified in the first column of the chip file provided and the consecutive columns listing the numerical inputs for each condition.

Output File Details

Notes on target matching in CRISPR chip files and output files

In addition to the rows indicating target (gene) matches, a CRISPR chip file may also contain negative, or "non-target", information about a construct. There are two broad types of non-target indicators:

  1. Non-Target designations discovered via genome search
  2. Non-Target designations intrinsic to the construct's own sequence

Numeric Counter Suffixes on Non-Target Codes

You will notice that the non-target codes mentioned above do not appear in chip files in their "bare" form. Instead you will find e.g. "NO_SITE_192" or "INACTIVE_6T+_32". The reason for these appended "counter" digits is simply to ensure uniqueness so that e.g. the top hit in your screen doesn't end up being "NO_SITE". The actual numeric suffix value is not significant, nor is it stable over time. That is, today a barcode may be associated with "ONE_INTERGENIC_SITE_120" and tomorrow, after an updated run of the chip file generator, the same barcode may instead be given the code "ONE_INTERGENIC_SITE_119", due to a change somewhere higher up in the file.

Notes on Output File columns

This tool generates separate output files for every column in your input file. The column name will be included in the output file name.

Negative Binomial Distribution (STARS)
Only the genes with at least 2 perturbations ranking above the threshold will receive a STARS score and be reported in the output file. If the first perturbation was used to calculate the STARS score, all the genes with at least one perturbation ranking above the specified threshold will receive a STARS score and be reported in the output file. The output file contains 10 columns as follows:
  1. Gene identifier, from column 2 of the chip file
  2. Number of perturbations targeting the gene
  3. Ranks of perturbations targeting the gene
  4. Identity of perturbations
  5. Within-gene-rank of the least probable perturbation
  6. STARS score: -log10(value of least perturbation)
  7. Average score: Average of negative log of the values of all perturbations ranking above threshold
  8. P-values calculated using the null distribution specified
  9. False Discovery Rate (FDR) calculated using permutation testing
  10. q-value
Hypergeometric distribution
All the genes in the library will be reported in the output file along with a .pdf of the volcano plot. The output file contains 10 columns as follows:
  1. Gene identifier, from column 2 of the chip file
  2. Average log-fold change of n% guides per gene
  3. Average -log10(p-value) of n% guides per gene
  4. Number of perturbations targeting the gene
  5. Identity of perturbations; perturbations listed according to individual rankings in ascending order
  6. Individual log-fold changes of the perturbations
  7. Ranks of the individual perturbations in the ascending direction
  8. -log10(p-values) of the individual perturbations in the ascending direction
  9. Ranks of the individual perturbations in the descending direction
  10. -log10(p-values) of the individual perturbations in the descending direction