-a-

admin tab: A link at the top of all pages providing access to the sandbox page, the user profile page, the cmap website's privacy statement, and the cmap's terms and conditions of use.

amplitude: A measure of the extent of differential expression of a given probe set. Amplitude a is defined as follows:

... where t is the scaled and thresholded average difference value for the treatment and c is the thresholded average difference value for the control. Thus, a=0 indicates no differential expression, a>0 indicates increased expression upon treatment, and a<0 indicates decreased expression upon treatment. For example, an amplitude of 0.67 represents a two-fold induction. The amplitudes for each probe set in a pair of tag lists in a given instance are reported in the table popup, as "amplitude."

ATC code: The Anatomical Therapeutic Chemical classification is a hierarchical coded annotation scheme for pharmaceutical substances maintained by The World Health Organization and used by cmap to aid interpretation of results. A full description of the ATC system can be found at http://www.whocc.no/atcddd/.

ATC codes are displayed on the detailed result page, where available, being keyed on cmap name. The permutation p for all sets of instances whose perturbagens share a level-4 ATC code (eg N05ACXX) are displayed under the by ATC code tab on the permuted result page.

The current build uses the 2008 version of ATC.

see how to interpret a result

ATC popup: A window displaying the text description of an ATC code or ATC code fragment and the list of cmap perturbagens matching that ATC code or fragment.

The ATC popup is spawned by clicking on an ATC code on the detailed result page or by clicking on an ATC code fragment displayed under the by ATC code tab on the permuted result page.

average difference: A measure of the relative level of a transcript monitored by a particular probe set.

see expression values

-b-

barview: A graphical representation of a result displayed on the detailed result page and the result detail window accessed from the permuted result page. It is especially useful for assessing the distribution of all instances of a particular perturbagen, batch or ATC code in an ordered list of instances.

The barview is constructed from a large number of colored horizontal lines each representing one instance in the result. The lines are sorted by descending order of the connectivity score and up score of their corresponsing instances and colored on the basis of the sign of those connectivity scores: green for positive, red for negative and gray for null, by default. Lines representing instances selected using the shading tool on the detailed result page, or the set of selected instances in the permuted result page, are colored black.

On the detailed result page, the position in the full table of the portion of the table displayed is shown by the arrow head to the left of the barview. One may jump to any location in the table by clicking on that location in the barview. Changing the order of instances with the sort buttons in the column headers does not affect the barview.

see how to interpret a result

barview icon: The icon displayed on each row of the tables on the permuted result page that provides access to the result detail window.

batch: The set of treatments and controls performed together. Each batch is uniquely identified by a batch number. Batches produced in six-well dishes contain four treatments, one calibrator compound and one vehicle control (and have batch numbers less than 500 or greater than 999). Batches produced in ninety-six-well dishes contain forty-one treatments, one calibrator compound and six vehicle controls.

build: A version of the cmap dataset and webtools. Each build is uniquely identified by the build number. New builds will be released intermittently as new data and tools become available. Data and tools may also be retired between builds. cmap names may change between builds as new INNs are adopted, or to prevent ambiguity. The current build (build_02) is used as default. Earlier builds can be selected at login.

by ATC code tab: One of three tabs found above the table on the permuted result page. It provides access to permutation p values for all sets of instances whose perturbagens share a level-4 ATC code (eg N05ACXX).

by name tab: One of three tabs found above the table on the permuted result page. It provides access to permutation p values for all sets of instances produced from the same perturbagen.

by name and cell line tab: One of three tabs found above the table on the permuted result page. It provides access to permutation p values for all sets of instances produced from the same perturbagen in the same cell line.

-c-

calibrator compound: A small molecule known to have strong connectivity with a particular query signature. One calibrator compound is included in every batch for quality control purposes.

.cel file: The raw-data file generated by scanning a single Affymetrix microarray. The cmap .cel files can be downloaded individually from the instance page or in batch mode (with the scan numbers from the instance page) with the .cel file download utility. Note that six .cel files provide the vehicle controls for instances in ninety-six-well batches. Their full scan numbers are not listed on the instance page but can be reconstructed by appending each of the six extensions provided (eg ".H01") to the twenty-two character number preceding the period from the corresponding perturbation scan number. The complete set of cmap .cel files can be downloaded in bulk from the download tab.

.cel file download utility: A utility for the batch-mode download of cmap .cel files.

Type or paste scan numbers into the text box and click the 'find files' button. The validity of the scan numbers will be checked and a report table displayed. Provide a file name and click the 'download zip' button. Use the download tab for bulk download of the cmap .cel files.

The .cel file download utility is accessed through the instance tab.

ChemBank: A database of small-molecule annotations, structures and synonyms maintained by the Broad Institute's Chemical Biology Program. ChemBank monographs can be accessed through the ChemBank options provided from the instance buttons on the detailed result page, the result detail window and the instance page. ChemBank can be accessed directly at http://chembank.broad.harvard.edu/.

cmap: Shorthand for 'Connectivity Map'.

cmap cell line: The panel of cell lines used by cmap includes MCF7, ssMCF7, HL60, PC3 and SKMEL5.

cmap name: The name given to a perturbagen (or group of closely related perturbagens) by cmap curators. For small molecules, the cmap name is always the recommended ('r') or provisional ('p') INN, if one is available. Otherwise a cmap name is selected from amongst the United States Adopted Name (USAN), the British Approved Name (BAN), the monograph titles from Martindale: The Complete Drug Reference or The Merck Index. Different salts of the same compound are given the same cmap name, unless a specific salt is the INN. For example, the cmap name for both propiomazine hydrochloride and propiomazine maleate is "propiomazine" (which is the rINN) but the cmap name for isosorbide dinitrate is "isoborbide dinitrate" since this is the rINN.

cmap signature: see internal signature

confidence call: A measure of confidence in an average difference value. Confidence calls are 'P', 'M' and 'A' and are conventionally interpreted as 'present', 'marginal' and 'absent', respectively.

see expression values and .cel file.

connectivity score: A value between +1 and -1 representing the relative strength of a given signature in an instance from the total set of instances calculated upon execution of a query. The connectivity score is the "score" reported on the detailed result page and the result detail window. A high positive connectivity score indicates that the corresponding perturbagen induced the expression of the query signature. A high negative connectivity score indicates that the corresponding perturbagen reversed the expression of the query signature. A zero or 'null' connectivity score indicates that the corresponding perturbagen had no self-consistent affect upon expression of the query signature. Instances are rank ordered by descending connectivity score as the default on the detailed result page. The top ranked instance ( ie "score" = +1) is said to be the most positively connected with the query signature. The bottom ranked instance ( ie "score" = -1) is the most negatively connected with the query signature. Note that the connectivity score is a relative value, and is a function of the composition of the set of instances upon which the query is executed. The connectivity score differs in this regard from the up scores and down scores, which are absolute values. The absolute strength of a given signature in a given instance can be gauged by the magnitude of the corresponding up score and down score. The connectivity score is a combination of the up score and down score.

see how connectivity score is calculated

contact us: A mailto link for the cmap user-support system found under the help tab.

-d-

data matrix: The matrix of ranks for every probe set in the feature set for each instance in cmap from which connectivity scores are calculated.

The data matrix can be downloaded from the download tab.

see how probe sets are ordered for an instance

detailed result page: A page displaying a selected result with tools to aid in its analysis and interpretation.

Load a result by making a selection from the 'select a result' dropdown menu. The full page will load automatically.

The number of instances upon which the query was executed is shown together with the signature link, providing access to the signature popup containing details for the signature used to generate the result, and the export link, allowing download of the full result accessed from this page. A link to the corresponding permuted result page is also provided.

The table displays the batch, the cmap name, the concentration, the cell line, the connectivity score, the up score, the down score, the ATC code for the perturbagen and the instance_id for all instances in the result, together with a fixed rank and a floating column containing a graphical representation of the result known as the barview. Source information, shading and a link to ChemBank can be accessed through the instance button. Clicking on an individual ATC code will spawn the ATC popup. Detailed analyses of the enrichment of the query signature in any one instance are provided with the plot popup and table popup, both also accessed through the instance button. Instances are ordered in descending order of connectivity score and up score, by default. The order of instances can be changed with the sort buttons in the column headers. One may scroll through the table using the scroll bar at the right of the page. The position in the full table of the portion of the table displayed is shown by the arrow head to the left of the barview. One may jump to any location in the table by clicking on that location in the barview. A table populated with only those instances selected with the shading tool can be displayed with the isolate shaded link located above the table.

The detailed result page is accessed through the result tab.

see how to interpret a result

down score: A value between +1 and -1 representing the absolute enrichment of a down tag list in a given instance. The down score is the "down" value reported on the detailed result page and the result detail window. A high positive down score indicates that the corresponding perturbagen induced the expression of the probe sets in the down tag list. A high negative down score indicates that the corresponding perturbagen repressed the expression of the probe sets in the down tag list. The connectivity score is a combination of the up score and down score.

see how connectivity score is calculated, up score

download tab: A link at the top of all pages providing access to bulk download of the cmap .cel files, the data matrix, the MSigDB tag lists used to calculate specificity and the instance inventory.

-e-

export link: A link on the detailed result page, permuted result page and instance page that allows download of the unabridged and summary views of the results, and the instance inventory, respectively.

expression values: The average difference values and confidence calls for each probe set derived from a single .cel file. These parameters are computed with Affymetrix Microarray Suite (MAS) version 5.0.

-f-

feature set: The universe of probe sets recognized by cmap, defined as the 22,283 probe sets on the HG-U133A array.

-g-

getting started page: A page containing a short and simple tutorial on cmap use. The getting started page is accessed through the help tab.

.grp file: see tag list, how to make a .grp file

-h-

help tab: A link at the top of all pages providing access to the getting started page, the topics page , the publication page and the contact us link.

HG-U133A: Affymetrix GeneChip Human Genome U133A Array (part number 510681). The probe sets on this array define the feature set.

HL60: Human promyelocytic cell line established by leukopheresis from promyelocytic leukemia (ATCC# CCL-240). HL60 cells are cultured in RPMI supplemented with 10% fetal bovine serum and 1% penicillin-streptomycin-glutamate. Further information can be found at http://www.atcc.org/common/catalog/numSearch/numResults.cfm?atccNum=CCL-240

HT_HG-U133A: Affymetrix GeneChip Human Genome HT U133A Array (part number 550002). Six probe sets on the HG-U133A are absent from this array. Data from these arrays are rendered compatible with the feature set by setting the average difference values and confidence calls for the missing probe sets to 0 and 'A', respectively.

HT_HG-U133A_EA: Affymetrix GeneChip Human Genome HT U133A Array, early access version (part number 520276). This array contains 676 probe sets not present on the HG-U133A array. Fifteen probe sets on the HG-U133A are absent. Data from these arrays are rendered compatible with the feature set by deleting expression values from the extra probe sets and setting the average difference values and confidence calls for the missing probe sets to 0 and 'A', respectively.

The library files for this array can be found at https://www.affymetrix.com/support/developer/tools/hta_tools.affx (where it is referred to as U133AAofAv2).

-i-

INN: International Nonproprietary Name. The INNs are unique, distinctive and universal names for drug substances selected by The World Health Organization. More information can be found at http://www.who.int/medicines/services/inn/en/. The current build uses Lists 1-96 of Proposed INN and Lists 1-57 of Recommended INN Cumulative List No 12.

instance: A treatment and control pair and the list of probe sets ordered by their extent of differential expression between this treatment and control pair. The instance is the basic unit of data and metadata in cmap. Each instance is uniquely identified by an instance_id. Every instance has a number of attributes including a unique identifier (ie instance_id), the batch in which it was produced, the cmap name of the perturbagen, the source of that perturbagen, the concentration of that perturbagen, the cmap cell line used, and the scan numbers for the treatment and its control(s). All instances in the current build and their attributes are accessible from the instance page.

see how probe sets are ordered for an instance

instance button: The icon displayed on each row of the tables in the instance page, detailed result page and result detail window that provides access to additional information about, or the ability to provide shading of, the corresponding instance. Options from the instance buttons on the instance page are the source popup and a link to ChemBank. Options from the instance buttons on the detailed result page are shading, the plot popup, the table popup, the source popup and a link to ChemBank. Options from the instance buttons on the result detail window are the plot popup, the table popup, the source popup and a link to ChemBank.

instance page: A page displaying or providing access to all instances in the current build, their attributes and the corresponding raw expression data.

The total number of instances and an export link, allowing download of the instance inventory, is shown. All instances generated from a given perturbagen can be found by typing in the 'search' textbox and selecting a cmap name from the autocomplete menu that appears.

The table displays the batch, the cmap name, the concentration, the cell line, the instance_id and the scan numbers for the corresponding treatment and vehicle scans. Source information and a link to ChemBank can be accessed through the instance button. Download of individual .cel files is triggered by clicking on scan numbers. The order of instances can be changed with the sort buttons in the column headers.

Full scan numbers for vehicle scans from certain batches are not displayed. These may be reconstructed by appending each of the extensions shown (eg ".H01") in the 'vehicle scan' field to the twenty-two character number preceding the period from the corresponding perturbation scan number.

The instance page is accessed through the instance tab.

instance query page: A page that allows execution of a query with a signature produced on the fly from a number of selected instances. Such signatures are known as internal signatures or cmap signatures.

Use the textbox and dynamic tables to select up to three instances from which a signature is to be produced. Begin typing in the 'search' textbox, select a cmap name from the autocomplete menu and select an instance using the 'select this instance' button. Type an amplitude threshold value in the 'up threshold' textbox and the 'down threshold' textbox. The up tag list for the internal signature will be populated with probe sets that have amplitudes greater than or equal to the 'up tag threshold' in all selected instances. The down tag list will be populated with probe sets that have amplitudes less than or equal to the 'down tag threshold' in all selected instances. The number of probe sets selected with a given pair of tag threshold values can be seen by clicking the 'check tags' button. Execute the query by clicking the 'execute query' button.

Successful execution will produce a page summarizing the query and a link labeled 'view results now' that is a shortcut to the permuted result page.

Note that the signature and the results generated with the instance query page are automatically named "temporary" and saved to the sandbox. However, these signatures and results are overwritten on subsequent execution of the quick query page or the instance query page.

Executing a query with signatures generated from less than three different instances, or with instances from the same batch is not recommended.

The instance query page is accessed through the query tab

see signature query page, quick query page

instance tab: A link at the top of all pages providing access to the instance page and the .cel file download utility.

internal signature: A signature derived from a number of instances. Also known as a cmap signature. Internal signatures are generated using the instance query page.

isolate shaded link: A link provided above the table in the detailed result page that spawns a window displaying only those instances already selected using the shading tool.

-j-

-k-

Kolmogorov-Smirnov statistic: The non-parametric rank statistic upon which the cmap analytic is based. A detailed description of the Kolmogorov-Smirnov statistic can be found in Nonparametric Statistical Methods (second edition) by Myles Hollander and Douglas Wolfe (1999).

see how connectivity score is calculated

-l-

load signature page: A page provided for the upload of a signature.

Type a name for the signature in the 'title' textbox and a description of the signature in the 'comment' textbox. Use the browse buttons on the 'up tags' and 'down tags' fields to locate the .grp files containing the up tag list and the down tag list, respectively. Use the browse buttons on the 'supporting documentation' fields to locate any supporting documentation, such as a .pdf of a paper or an . xls showing the probe set mapping scheme. The signature and its descriptions are uploaded with the 'load signature' button.

Successful execution will produce a page summarizing the signature and a link labeled 'execute query with this signature' that is a shortcut to the signature query page.

The tag lists are checked for invalid and duplicate probe sets, and that the total number of tags does not exceed one thousand. Uploading will be halted and these errors reported, if they are found.

The load signature page is accessed through the query tab.

see signature query page

-m-

MCF7: Human breast epithelial adenocarcinoma cell line derived from pleural effusion (ATCC# HTB-22). MCF7 cells are cultured in DMEM supplemented with 10% fetal bovine serum and 1% penicillin-streptomycin-glutamine. More information can be found at http://www.atcc.com/common/catalog/numSearch/numResults.cfm?atccNum=HTB-22.

MSigDB: The Molecular Signatures Database. A database of gene sets with descriptions maintained by the Broad Institute. MSigDB monographs for the gene sets used to generate the signatures used in the calculation of specificity can be accessed through the specificity popup. MSigDB can be accessed directly at http://www.broad.mit.edu/gsea/msigdb/.

-n-

NetAffx: The Affymetix tools and annotation website (http://www.affymetrix.com/analysis/netaffx/index.affx). Registration is required.

non-null percentage: A measure of the support for the connection between a set of instances and a signature of interest based upon the behavior of the individual instances in that set. The non-null percentage is reported on the permuted result page.

The non-null percentage is defined as the percentage of all instances in a set of instances that share the majority non-null category of connectivity score. For example, if a perturbagen is represented by five instances and three of those instances have a positive connectivity score, one instance has a null connectivity score and one instance has a negative connectivity score, the non-null percentage for that perturbagen in that result is 60%.

The non-null percentage was introduced to remedy the failure of the measure of enrichment upon which permutation p is based to consider the sign of the connectivity scores of the component instances and the questionable meaning of the ranks of null instances. Sets of instances with less than a threshold proportion of instances with similarly-signed connectivity scores are considered to have insufficient support and their permutation p-values are therefore set to null. This threshold level is set to 50%.

see how permutation p is estimated

null instance: An instance from a given result with a connectivity score of zero. By convention, instances are set to null if the up score and down score have the same sign.

see how connectivity score is calculated

-o-

-p-

p-value: see permutation p

PC3: Epithelial cell line established from human prostate adenocarcinoma (ATCC# CRL-1435). PC3 cells are cultured in RPMI supplemented with 10% fetal bovine serum, 1% sodium pyruvate and 1% penicillin-streptomycin-glutamate. More information can be found at: http://www.atcc.org/common/catalog/numSearch/numResults.cfm?atccNum=CRL-1435

permutation p: An estimate of the likelihood that the enrichment of a set of instances in the list of all instances in a given result would be observed by chance. This value is determined empirically by computing the enrichment of one hundred thousand sets of instances selected at random from the set of all instances in the result.

see how permutation p is estimated, how to interpret a result

permuted result page: The primary cmap result-summary page. This page provides perturbagen- (rather than instance-) centric views of a selected result with rankings based upon estimates of the likelihood that the enrichment of a set of instances in the ordered list of all instances for that result would be observed by chance.

Load a result by making a selection from the 'select a result' dropdown menu. The full page will load automatically.

The number of instances upon which the query was executed is shown together with the signature link, providing access to the signature popup containing details for the signature used to generate the result, and the export link, allowing download of the summary views displayed on this page.

Three types of instance sets are considered: (1) all instances of the same perturbagen, (2) all instances of the same perturbagen made in the same cell line and (3) all instances whose perturbagens share a level-4 ATC code (eg N05ACXX). These are accessed through the by name tab, by name and cell line tab and by ATC code tab, respectively.

The table displays the cmap name, the cmap name and cell line, or the level-4 ATC code associated with the instance set, the arithmetic mean of the connectivity scores for those instances (labeled "mean"), the number of those instances (labeled "n"), a measure of the enrichment of those instances in the order list of all instances, a permutation p-value for that enrichment score (labeled "p"), the specificity of that enrichment (labeled "specificity") and the non-null percentage (labeled "% non-null"). Each row in the table has a barview icon to access the result detail window for that set of instances.

The rows in the table are ordered in ascending order of p-value then ascending order of (absolute) enrichment. The one hundred top-ranked sets of instances are automatically loaded and displayed in groups of twenty. Click the navigation links beneath the table to move through the pages. To quickly navigate to the rows associated with a perturbagen of interest begin typing in the 'search' textbox above the table and select a cmap name from the autocomplete menu. Return to the full-page view by clicking the 'show all' button.

Clicking any non-zero value of specificity will spawn the specificity popup. Clicking on an ATC code fragment on the table found under the by ATC code tab will spawn the ATC popup.

The permuted result page is accessed through the result tab.

see how to interpret a result, how permutation p is estimated

perturbagen: Any modality, such as addition of a small molecule or introduction of a genetic reagent (eg shRNA). Perturbagens (or groups of closely related perturbagens) are identified by their cmap name. A single perturbagen may be represented by multiple instances.

plot popup: A display accessed through the instance button on the detailed result page or the result detail window that provides a graphical representation of the enrichment of the current signature in the selected instance.

Comparison of the example plots in panel A and panel B, which represent the same signature in two different instances, illustrates that the maximum deviation from zero on the y-axis for the trajectory of the green (up tags) and red (down tags) lines (which define the up score and down score, respectively) is a function of the position of the probe sets in the corresponding tag lists in the list of all probe sets in the feature set ordered by the extent of their differential expression in the selected instance, represented by the x-axis (see how probe sets are ordered for an instance). The location of a tag is appreciated by an up-tick in the trajectory of the colored lines. Specifically then, in panel A, the up tags are nearer the top of the rank ordered list of probe sets than they are in panel B. Likewise, the down tags are nearer the bottom of the ordered list of probe sets in panel A than in B. Correspondingly, the up score and down score for the signature in the instance represented in panel A is approximately +0.8 and -0.8, respectively, but only +0.6 and -0.6 for the instance represented in panel B. Both instances will receive a positive connectivity score but the score associated with panel A will be greater than that for panel B. Panel C represents the same signature in a third instance and illustrates negative connectivity. The up tags are near the bottom of the ordered list of probe sets and the down tags are near the top. The up score and down score are approximately -0.6 and +0.6, respectively, translating to a negative connectivity score.

Note that the up score and down score are absolute measures, unlike connectivity score, which is a relative value. Thus, the plot popup is an excellent visual aid to assessing the strength of a signature in a specific instance. It is conveniently used in conjunction with the table popup.

see how the plot popup is generated, how to interpret a result

Panel A

Panel B

Panel C

probe set: The collection of match and mismatch oligonucleotides on an Affymetrix GeneChip microarray designed against a given transcript which together allow the relative level of that transcript to be estimated. Probe sets are uniquely identified with a code number that, by convention, ends with "_at" ( eg 200800_s_at). Tag lists and instances are populated with probe sets from the feature set. Detailed descriptions of individual probe sets can be found through NetAffx.

publication page: A page listing and providing links to all publications relating to cmap and its use. The publication page can be accessed through the help tab.

-q-

query: Asking cmap to compute connectivity scores for all instances with respect to a specified signature or a number of selected instances. The product of a query is a result.

Three query pages are provided: the quick query page, the signature query page and the instance query page.

query signature: see signature

query tab: A link at the top of all pages providing access to the quick query page, the load signature page, the signature query page, and the instance query page.

quick query page: A page that allows execution of a query without prior upload of the signature.

Use the browse buttons on the 'up tag file' and 'down tag file' fields to locate the .grp files containing the up tag list and the down tag list, respectively. The query is executed with the 'execute query' button.

Successful execution will produce a page summarizing the query and a link labeled 'view this result now' that is a shortcut to the permuted result page.

The tag lists are checked for invalid and duplicate probe sets, and that the total number of tags does not exceed one thousand. Uploading will be halted and these errors reported, if they are found.

Note that the signature used and the results generated with the quick query page are automatically named "temporary" and saved to the sandbox. However, these signatures and results are overwritten on subsequent execution of the quick query page or the instance query page.

The quick query page is accessed through the query tab.

see signature query page, instance query page

-r-

result: The set of instances and their associated connectivity score, up score and down score generated upon execution of a query. Results are automatically saved in the sandbox and may be displayed, sorted and analyzed using the detailed result page and the permuted result page.

Each result has a number of attributes. These include a unique identifier (ie result_id), a name, a creation date, the signature used and the total number of instances considered. Results generated with the quick query page and the instance query page are automatically named "temporary." However, temporary results are overwritten on subsequent execution of the quick query page or the instance query page. The attributes of a result are displayed at the top of the detailed result page and permuted result page.

see how to interpret a result

result detail window: A window spawned by clicking the barview icon displayed on each row of the permuted result page. The result detail window displays a barview and a table providing the batch, the cmap name, the concentration, the cell line, the connectivity score, the up score, the down score and the instance_id of all instances in the corresponding set of instances.

result tab: A link at the top of all pages providing access to the detailed result page and the permuted result page.

-s-

sandbox: The private work space provided to each user in which signatures and results are saved. Up to ten signatures and ten results may be stored. The contents of the sandbox populate the 'select a signature' dropdown menu on the signature query page and the 'select a result' dropdown menu on the detailed result page and permuted result page. The sandbox page provides a list of signatures and results in the sandbox, and allows unwanted signatures and results to be deleted.

sandbox page: A page providing access to the list of signatures and results in the sandbox, together with their attributes. Unwanted signatures and results may be deleted by checking the corresponding 'delete' checkbox and pressing the 'update' button.

The sandbox page is accessed through the admin tab.

scan number: A unique identifier given to a .cel file.

score: see connectivity score

shading: A utility accessed through the instance button on the detailed result page that allows instances or groups of instances in a result to be selected, shaded in a chosen color in the table on the detailed result page and highlighted in black on the barview.

Clicking on the instance button and then 'shading' reveals a menu with the options for selecting all instances of the same cmap name as the current instance ('name'), the current instance only ('instance'), all instances from the batch of the current instance ('batch'), all instances with an ATC code similar to the current instance ('ATC'), or 'reset', which cancels all current shading. Clicking 'name', 'instance', or 'batch' spawns a submenu allowing the selection of the shading color or an option to cancel any shading currently applied to that instance or set of instances. Clicking 'ATC' spawns an intermediary menu that controls the level of ATC code similarity to be applied. Clicking on a level then spawns the color selection or cancel submenu. Shading is applied upon clicking a color option.

see how to interpret a result

signature: The list of genes up- and down- regulated in a biological distinction of interest derived from a transcriptional profiling experiment. For example, the genes induced and repressed following the exposure of a cell line to a small molecule relative to a vehicle control treatment would constitute a signature. Likewise, genetic manipulations can be used as the perturbagen, and treatments can be applied in vitro or in vivo. Organic phenotypes (eg drug-resistant versus drug-sensitive) may also provide the distinction of interest. Any marker-selection algorithm or heuristic can be used to produce a signature.

No specific advice on experimental design for the construction of signatures is given. Rather, investigators should use their domain expertise and experience with the biology of interest to guide the collection of data for signature derivation.

Signatures are used to query cmap. Signatures of between ten genes and five hundred genes have been found to perform well. Signatures are represented by a pair of tag lists. Signatures may be uploaded into the sandbox, together with optional descriptions and supporting information files, using the load signature page. Queries may then be executed using the signature query page. Queries may also be executed without prior upload of the signature using the quick query page. Alternatively, a signature may be generated on the fly from a number of selected instances and executed using the instance query page. These are known as internal signatures.

Each signature has a number of attributes. These include a unique identifier (ie signature_id), a name, a description, and a creation date. Note that signatures used to execute a query with the quick query page and the instance query page are automatically uploaded to the sandbox and named "temporary" and given "temporary signature" as a description. However, temporary signatures are overwritten on subsequent execution of the quick query page or the instance query page.

The attributes of the signature used to produce a given result can be viewed, and the component .grp files and any supporting information files downloaded from the signature popup.

signature popup: A window accessed through the signature link on the detailed result page and permuted result page that displays the attributes of the signature used to generate that result and allows the corresponding .grp files and any supporting information files (if any) to be downloaded.

signature query page: A page that allows execution of a query with a signature preloaded using the load signature page.

Type a name for the result in the 'result name' textbox and select a signature from the 'signature' dropdown menu. The query is executed with the 'execute query' button.

Successful execution will produce a page summarizing the query and a link labeled 'view this result now' that is a shortcut to the permuted result page.

The signature query page is accessed through the query tab.

see quick query page, instance query page

signature link: A link on signature name shown on the detailed result page and the permuted result page that spawns the signature popup.

SKMEL5: Human malignant melanoma cell line derived from a metastatic axillary node (ATCC# HTB-70). SKMEL5 cells are cultured in DMEM supplemented with 10% fetal bovine serum, 1% penicillin-streptomycin-glutamate, 1% non-essential amino acids and 1% sodium pyruvate. More information can be found at http://www.atcc.org/common/catalog/numSearch/numResults.cfm?atccNum=HTB-70

specificity: An estimate of the uniqueness of the connectivity between a set of instances and a signature of interest based upon the behavior of that set of instances in a large number of results produced from a collection of diverse signatures. Specificity is reported on the permuted result page.

Specificity is defined as the frequency at which the enrichment of a set of instances in the ordered list of all instances in a given result is equaled or exceeded by the enrichment of that same set of instances in the set of results produced from queries executed with 312 published, experimentally-derived signatures extracted from MSigDB. The frequency distribution is computed separately for positively and negatively enriched signatures. High specificity scores indicate that the extent of connectivity found between the set of instances and the signature of interest is unexceptional and/or the perturbagens involved have multiple biological effects.

Clicking on a non-zero value of specificity on the permuted result page will spawn the specificity popup.

The MSigDB tag lists used in the specificity evaluation can be downloaded from the download tab.

see how permutation p is estimated, how to interpret a result

specificity popup : A window spawned by clicking any non-zero value of specificity displayed on the permuted result page. It displays an ordered list of MSigDB-derived signatures more strongly connected with the set of instances selected than the signature used to produce the result, the enrichment of that set of instances in the list of all instances ordered with respect to that MSigDB-derived signature and links to the corresponding MSigDB monographs.

sort button: Buttons displayed in column headers of tables on some pages that allow control over the order in which rows are displayed. The up button resorts the rows in ascending numeric or alphabetic order. The down button resorts the rows in descending numeric or alphabetic order.

source: Detailed information about the specific perturbagen used to generate an instance. Every source has a number of attributes. These include a unique identifier (source_id), a manufacturer (or the third party supplier if the manufacturer is unknown), the manufacturer's or supplier's name for the perturbagen and the manufacturer's or supplier's reference number for the perturbagen.

source popup: A display accessed through the instance button on the detailed result page, the result detail window and the instance page that provides source information for the selected instance. If cmap performed identity and purity checking of the perturbagen in question, the raw UV and mass spectrograms may be viewed with the 'identity/purity: click to read' link.

ssMCF7: MCF7 cells cultured in phenol red free DMEM supplemented with 5% charcoal/ dextran stripped fetal bovine serum, 1% penicillin-streptomycin-glutamine and 3.3 mg/ml bovine serum albumin immediately before and during treatments.

-t-

table popup: A display accessed through the instance button on the detailed result page and result detail window that provides, for each probe set in both tag lists in the signature in the selected instance, its rank, the running sum at that tag (see how the plot popup is generated), and its amplitude. These values are displayed as "rank", "score" and "amplitude", respectively. The table popup is a convenient complement to the plot popup when assessing the strength of a signature in a specific instance.

see how running sum at a tag is computed for the table popup, how to interpret a result

tag: A probe set in a tag list.

tag list: A list of probe sets from the feature set representing either the 'up' genes in a signature (the 'up tag list') or the 'down' genes in a signature (the 'down tag list'). One up tag list and one down tag list together represent a signature and both are required to execute a query using the quick query page or to upload a signature using the load signature page. Note that the total number of probe sets in the up tag list and down tag lists may not together exceed 1000. Text files with the .grp extension are used to contain tag lists. A .grp file has one probe set on each line.

see how to make a .grp file

Tag lists may be conveniently constructed from a signature populated with UniGene identifiers, RefSeq identifiers, Genbank identifiers, gene symbols, etc using the 'Batch Query' tool on NetAffx and selecting "HG-U133A" from the 'Select a GeneChip Array' menu. A variety of tools are available for the generation of tag lists from signatures from non-human species. These include GeneCruiser (http://www.genecruiser.org). Our preferred method uses UniGene and HomoloGene from NBCI (http://www.ncbi.nlm.nih.gov/) to obtain human UniGene identifiers for signature genes for submission to NetAffx, as described above.

topics page: This page.

-u-

up score: A value between +1 and -1 representing the absolute enrichment of a up tag list in a given instance. The up score is the "up" value reported on the detailed result page and the result detail window. A high positive up score indicates that the corresponding perturbagen induced the expression of the probe sets in the up tag list. A high negative up score indicates that the corresponding perturbagen repressed the expression of the probe sets in the up tag list. The connectivity score is a combination of the up score and the down score.

see how connectivity score is calculated, down score

user profile page: A page allowing update of your user profile. The user profile page is accessed from the admin tab.

-v-

-w-

-x-

-y-

-z-

-how to-

how to make a .grp file

The simplest way to make a .grp file is with Microsoft Excel. Open a new worksheet and put each HG-U133A probe set (eg 200800_s_at) from the tag list on its own row. From the 'File' menu, select 'Save As...', and select 'Text (Tab delimited) (*.txt)' from the 'Save as type:' dropdown in the dialog box that appears. Type "filename.grp" in the 'File name:' textbox, including the double quotes (unless you are using in Mac, in which case the quotes can be omitted). Choose where you wish your .grp file to be saved and click the 'Save' button.

how probe sets are ordered for an instance

All probe sets in the feature set (22,283) are rank ordered by the extent of their differential expression between the treatment and control pair as follows.

All average difference values from the treatment scan are scaled relative to those from its corresponding control using a linear-fit-on-P-call algorithm, giving scaled average difference values. In six-well batches, average difference values for the control are from a single scan. In ninety-six-well batches, the control expression values are the composite of the five or six control scans. Average difference values for each probe set are the arithmetic mean of the five or six individual average difference values for that probe set. A probe set is given a 'P' call if more than three of the individual confidence calls are 'P' for that probe set. Average difference (control) and scaled average difference (treatment) values less than a primary threshold value are set to that threshold value. Average difference (control) and scaled average difference (treatment) values less than a second, lower threshold value are set to that secondary threshold value.

A ratio value d and a secondary ratio value d ′ are then computed for each probe set p:

...where t is the thresholded scaled average difference value, t ′ is the secondary thesholded scaled average difference value (treatment), c is the thresholded average difference value, and c′ is the secondary thresholded average difference value (control).

Probe sets are given a rank r in descending order of d, with the population of probe sets with d=1 subsorted in descending order of d′, where r = 1, 2, ..., 22,283.

The rank r of every probe set in all instances in the current build is contained in the data matrix.

how connectivity score is calculated

For each instance i in a collection of c total instances, the Kolmogorov-Smirnov statistic is computed for each tag list in the signature, giving ksiup and ksidown.

Let n be the number of probe sets in the feature set (22,283) and t be the number of probe sets (or tags) in the tag list. Order all n probe sets by the extent of their differential expression for the current instance i (see how probe sets are ordered for an instance). Construct a vector V of the position (1... n) of each probe set in the tag list in the ordered list of all probe sets and sort these components in ascending order such that V(j) is the position of tag j, where j = 1, 2, ..., t.

Compute the following two values:

... and set ksi = a, if a > b. Set ksi = -b if b > a. The up scores and down scores are ksiup and ksidown, respectively. These values are reported on the detailed result page and result detail window as "up" and "down", respectively.

The connectivity score Si is set to zero where ksiup and ksidown have the same sign. Otherwise, set si to be ksiup - ksidown, p to be max( si) and q to be min( si) across all instances in the collection c. The connectivity score Si for the remaining instances is defined as si / p, where si > 0, or ( si / q), where si < 0. The "score" values reported on the detailed result page and the result detail window are Si.

how the plot popup is generated

The graphic is a plot of the running sum of consecutive values of a vector V (y-axis) at each probe set i in the feature set ordered by the extent of their differential expression (see how probe sets are ordered for an instance) (x-axis). Let n be the number of probe sets in the feature set (22,283), t be the number of probe sets (or tags) in the tag list, z be n t, and V(i ) be the component of the vector V corresponding to probe set i. Set V(i ) = 1 / t if i is in the tag list and V(i) = -1 / z if not. This representation is based on a conventional Kolmogorov-Smirnov statistic plot.

how running sum at a tag is computed for the table popup

Let n be the number of probe sets in the feature set (22,283) and t be the number of probe sets (or tags) in the tag list. Order all n probe sets by the extent of their differential expression for the current instance i (see how probe sets are ordered for an instance). Construct a vector V of the position (1... n) of each probe set in the tag list in the ordered list of all probe sets and sort these components in ascending order such that V(j) is the position of tag j, where j = 1, 2, ..., t. The value reported as "score" for tag j in the table popup is defined as:

how permutation p is estimated

The Kolmogorov-Smirnov statistic is computed for the set of t instances in the list of all n instances in a result ordered in descending order of connectivity score and up score (see how connectivity score is calculated), giving an enrichment score ks0.

Construct a vector V of the position (1... n) of each instance in the set of t instances in the ordered list of all n instances and sort these components in ascending order such that V(j) is the position of instance j, where j = 1, 2, ..., t.

Compute the following two values:

... and set ks0 = a, if a > b. Set ks0 = -b if b > a. These values are reported as "enrichment" on the permuted result page and the specificity popup.

For each of d trials, select t instances at random from the set of n instances and compute ksm, where m = 1, 2, ..., d. Count the number of times q that...

... is true. The frequency of this event (q / d) can be taken as a (two-sided) p-value.

Null p-values are assigned to sets of instances where t = 1, where the mean of the connectivity scores for the set of t instances is zero, or where the non-null percentage for the set of t instances is <50%. Otherwise, the evaluation of (q / d), where d = 100,000, is reported as "p" on the permuted result page.

how to interpret a result

Note that cmap is, at best, a hypothesis generating tool. Only after these hypotheses have been confirmed by biological experimentation can definitive conclusions be made.

Hypotheses are generated by searching for multiple independent instances of the same perturbagen, or group of functionally or structurally similar perturbagens, with consistently high (positive or negative) connectivity scores. A variety of tools are provided to identify such perturbagens.

This process begins by loading the result of interest into the permuted result page. The default view presents the twenty perturbagens best connected (positively and negatively) with the signature of interest, according to permutation p and pre-filtered with respect to non-null percentage. Clicking the by name and cell line tab presents the twenty best connected perturbagens when cell line is taken into account. Clicking the by ATC code tab displays the twenty best connected sets of perturbagens sharing level-4 ATC codes. Use your knowledge of the signature and your ultimate objective to decide which ranking to focus on. For example, if you know that your signature is likely to be context dependent ( eg produced in estrogen-receptor-positive tissue specimens) the table found under the by name and cell line tab might be most appropriate. If you are attempting to find an existing drug for a new indication the ATC code view might be best. Only therapeutic small molecules have ATC codes.

Positive enrichment scores are of interest if perturbagens inducing the biological state represented by the signature used to produce the result are sought. Likewise, if reversal or repression of the biological state encoded in the query signature is required, perturbagens with negative enrichment scores are of interest.

Inspect the specificity values associated with each perturbagen. These provide a measure of the uniqueness of the connection between a perturbagen and the signature of interest. High values mean that many signatures show good connectivity with these instances. This may indicate that the connectivity is unexceptional. However, dismissing hypotheses solely on the basis of high specificity scores is not recommended. They may simply indicate that the perturbagen in question has many biological activities in addition to the one encoded in the query signature. High specificity scores may even provide additional support for a hypothesis. Click the specificity value to spawn the specificity popup and inspect the list of better-scoring signatures displayed there. For example, if your query signature derives from a set of leukemia specimens and the top-scoring gene sets displayed in the popup are also leukemia related, confidence in your hypothesis would be enhanced.

Next, examine the individual instances that represent the perturbagen or ATC code. Click the barview icon to spawn the result detail window. Barviews that resemble panels A and B (see below) show clear, strong positive connectivity. Barviews resembling panel D represent strong negative connectivity. Barviews resembling panels C and E show more modest positive connectivity and require additional consideration. Look at the attributes of each instance displayed in the result detail window. Should the null instances in panel C- and panel E- like barviews come from treatments made at a lower concentration or in different cell lines than the positive-scoring instances, this concentration or context dependency, respectively, would enhance the confidence in these connections. Likewise, if considering a barview from the by ATC code tab and the positively-connected instances come from one perturbagen and the null instances from another, a perturbagen-specific rather than an ATC code-wide hypothesis would be suggested.

Generating hypotheses solely on the relative ranking of instances is perhaps warranted where the connectivity is strong, as exemplified in panels A, B and D. But more detailed analyses of individual instances of the candidate perturbagen are always recommended. These should include study of the up score and down score, the plot popup and the table popup.

The magnitude of the corresponding up score and down score should be considered. These values are displayed on the detailed result page and the result detail window and, unlike connectivity score, are absolute values between +1 and -1. They have a number of uses. First, on occasions when some instances of a perturbagen of interest have a high connectivity score but others have null scores, confidence in a hypothesis may be enhanced by finding either the up score or down score of those null instances to be as high as those of the other non-null instances. Second, and much more important, these values provide an absolute measure of the connectivity of the instance in question with the signature of interest. However, one must first consider the nature and origin of the signature. When derived from the treatment of cultured cells with a small-molecule perturbagen, for example, it is beyond doubt that it is possible for a small molecule to effect the biological state of interest. In other words, the signature is known to be druggable , and high up scores and down scores are therefore required for high confidence. At the other extreme, if the signature derives from two very different organic biological states, such as tissue specimens from normal human breast and human breast cancer, for example, there is no reason to believe that any single perturbagen could ever induce such a large and complex state shift. Considerably lower up scores and down scores would be expected. However, these may still be meaningful.

Examination of the trajectories of the up and down components of the signature on the plot popup is a useful complement to the up scores and down scores in developing confidence in hypotheses, particularly those for complex organic biological states and where there is no evidence that the signature is directly druggable. This tool is of little additional value where up scores and down scores are high. The plot popup for a given instance is accessed through the instance button by selecting 'plot.' Good separation of the two trajectories and maximum deviation from zero near opposite extremes of the x-axis is desirable when attempting to build confidence in particular hypotheses for challenging signatures. Alternatively, a minor peak at either end of the x-axis implies a strong affect on a portion of the signature. This may be sufficient evidence to call a hit for a complex signature, provided the effect is seen in multiple instances of the same perturbagen.

Finally, since all of the analyses described above are based upon a non-parametric rank statistic, perturbagens that induce only modest changes in global gene expression and those that have profound effects can give rise to instances that cannot be immediately distinguished. However, one may reasonably have higher confidence in a hypothesis based upon the ranking of instances from a perturbagen that causes large changes in the expression of the probe sets in the tag lists representing the query signature than one from instances where the extent of induced differential expression is lower. Consideration should therefore be given to the amplitudes of individual probe sets in the relevant tag lists for all instances of a perturbagen of interest. These can be found through the instance button by selecting 'table.' However, remember that a fundamental principle of the signature-based methods used by cmap is that consideration of a set of genes as a whole provides much more power for the discovery of connections than the behavior of individual genes. So, while the study of the amplitudes of individual probe sets is recommended, it should only be used to modulate confidence in existing hypotheses.

Note that when using internal signatures the instances used to derive the signature are likely to be, almost by definition, the highest ranked instances overall. These should not be considered in assessments of the enrichment of instances derived from these same perturbagens, as they will unfairly bias the distribution. Similarly, high connectivity scores for instances from the same batch as signature instances should be treated with caution, due to batch effects. This is especially true for internal signatures derived from instances from the same batch are why it is strongly recommended that internal signatures not be derived from such instances. The absence of batch biases should be confirmed in results produced by all signatures, not just internal signatures. This can be done by inspecting the contents of the 'batch' column in the result detail window for all instances of the perturbagen of interest or by shading by batch on the detailed result page.

Panel A

Panel B

Panel C

Panel D

Panel E