Use this to run the Harmony algorithm on gene expression or PCA matrix.

HarmonyMatrix(data_mat, meta_data, vars_use, do_pca = TRUE, npcs = 20,
  theta = NULL, lambda = NULL, sigma = 0.1, nclust = NULL,
  tau = 0, block.size = 0.05, max.iter.harmony = 10,
  max.iter.cluster = 200, epsilon.cluster = 1e-05,
  epsilon.harmony = 1e-04, plot_convergence = FALSE,
  return_object = FALSE, verbose = TRUE, reference_values = NULL,
  cluster_prior = NULL)

Arguments

data_mat

Matrix of normalized gene expession (default) or PCA embeddings (see do_pca). Cells can be rows or columns.

meta_data

Either (1) Dataframe with variables to integrate or (2) vector with labels.

vars_use

If meta_data is dataframe, this defined which variable(s) to remove (character vector).

do_pca

Whether to perform PCA on input matrix.

npcs

If doing PCA on input matrix, number of PCs to compute.

theta

Diversity clustering penalty parameter. Specify for each variable in vars_use Default theta=2. theta=0 does not encourage any diversity. Larger values of theta result in more diverse clusters.

lambda

Ridge regression penalty parameter. Specify for each variable in vars_use. Default lambda=1. Lambda must be strictly positive. Smaller values result in more aggressive correction.

sigma

Width of soft kmeans clusters. Default sigma=0.1. Sigma scales the distance from a cell to cluster centroids. Larger values of sigma result in cells assigned to more clusters. Smaller values of sigma make soft kmeans cluster approach hard clustering.

nclust

Number of clusters in model. nclust=1 equivalent to simple linear regression.

tau

Protection against overclustering small datasets with large ones. tau is the expected number of cells per cluster.

block.size

What proportion of cells to update during clustering. Between 0 to 1, default 0.05. Larger values may be faster but less accurate

max.iter.harmony

Maximum number of rounds to run Harmony. One round of Harmony involves one clustering and one correction step.

max.iter.cluster

Maximum number of rounds to run clustering at each round of Harmony.

epsilon.cluster

Convergence tolerance for clustering round of Harmony. Set to -Inf to never stop early.

epsilon.harmony

Convergence tolerance for Harmony. Set to -Inf to never stop early.

plot_convergence

Whether to print the convergence plot of the clustering objective function. TRUE to plot, FALSE to suppress. This can be useful for debugging.

return_object

(Advanced Usage) Whether to return the Harmony object or only the corrected PCA embeddings.

verbose

Whether to print progress messages. TRUE to print, FALSE to suppress.

reference_values

(Advanced Usage) Defines reference dataset(s). Cells that have batch variables values matching reference_values will not be moved.

cluster_prior

(Advanced Usage) Provides user defined clusters for cluster initialization. If the number of provided clusters C is less than K, Harmony will initialize K-C clusters with kmeans. C cannot exceed K.

Value

By default, matrix with corrected PCA embeddings. If return_object is TRUE, returns the full Harmony object (R6 reference class type).

Examples

## By default, Harmony inputs a normalized gene expression matrix
# NOT RUN { harmony_embeddings <- HarmonyMatrix(exprs_matrix, meta_data, 'dataset') # }
## Harmony can also take a PCA embeddings matrix data(cell_lines_small) pca_matrix <- cell_lines_small$scaled_pcs meta_data <- cell_lines_small$meta_data harmony_embeddings <- HarmonyMatrix(pca_matrix, meta_data, 'dataset', do_pca=FALSE)
#> Harmony 1/10
#> Harmony 2/10
#> Harmony 3/10
#> Harmony 4/10
#> Harmony 5/10
#> Harmony 6/10
#> Harmony 7/10
#> Harmony 8/10
#> Harmony converged after 8 iterations
## Output is a matrix of corrected PC embeddings dim(harmony_embeddings)
#> [1] 300 20
harmony_embeddings[seq_len(5), seq_len(5)]
#> X1 X2 X3 X4 X5 #> [1,] 0.008448274 -0.0018134832 -1.552202e-03 -0.0077926057 0.0001466930 #> [2,] -0.011292669 0.0003581516 -1.989622e-05 -0.0014619744 0.0034635288 #> [3,] 0.010883023 -0.0050407453 1.579927e-03 -0.0043506676 -0.0006763035 #> [4,] 0.009927078 -0.0014537998 1.159716e-03 -0.0080219594 -0.0014029851 #> [5,] -0.010156191 0.0027586043 -3.074632e-03 -0.0000108491 0.0009655249
## Finally, we can return an object with all the underlying data structures harmony_object <- HarmonyMatrix(pca_matrix, meta_data, 'dataset', do_pca=FALSE, return_object=TRUE)
#> Harmony 1/10
#> Harmony 2/10
#> Harmony 3/10
#> Harmony 4/10
#> Harmony 5/10
#> Harmony 6/10
#> Harmony 7/10
#> Harmony 8/10
#> Harmony 9/10
#> Harmony 10/10
dim(harmony_object$Y) ## cluster centroids
#> [1] 20 10
dim(harmony_object$R) ## soft cluster assignment
#> [1] 10 300
dim(harmony_object$Z_corr) ## corrected PCA embeddings
#> [1] 20 300
head(harmony_object$O) ## batch by cluster co-occurence matrix
#> [,1] [,2] [,3] #> [1,] 6.197592 9.983329e+00 8.725975e-05 #> [2,] 11.953170 7.462688e-06 2.536973e+01 #> [3,] 7.495340 1.563652e+01 6.057042e-07 #> [4,] 10.170492 1.752684e+01 7.977053e-07 #> [5,] 10.999486 1.901373e-05 2.325515e+01 #> [6,] 10.060248 1.894319e+01 1.766110e-06