Main Harmony interface — HarmonyMatrix • harmony

Use this to run the Harmony algorithm on gene expression or PCA matrix.

HarmonyMatrix(data_mat, meta_data, vars_use, do_pca = TRUE, npcs = 20,
  theta = NULL, lambda = NULL, sigma = 0.1, nclust = NULL,
  tau = 0, block.size = 0.05, max.iter.harmony = 10,
  max.iter.cluster = 200, epsilon.cluster = 1e-05,
  epsilon.harmony = 1e-04, plot_convergence = FALSE,
  return_object = FALSE, verbose = TRUE, reference_values = NULL,
  cluster_prior = NULL)

Arguments

data_mat: Matrix of normalized gene expession (default) or PCA embeddings (see do_pca). Cells can be rows or columns.
meta_data: Either (1) Dataframe with variables to integrate or (2) vector with labels.
vars_use: If meta_data is dataframe, this defined which variable(s) to remove (character vector).
do_pca: Whether to perform PCA on input matrix.
npcs: If doing PCA on input matrix, number of PCs to compute.
theta: Diversity clustering penalty parameter. Specify for each variable in vars_use Default theta=2. theta=0 does not encourage any diversity. Larger values of theta result in more diverse clusters.
lambda: Ridge regression penalty parameter. Specify for each variable in vars_use. Default lambda=1. Lambda must be strictly positive. Smaller values result in more aggressive correction.
sigma: Width of soft kmeans clusters. Default sigma=0.1. Sigma scales the distance from a cell to cluster centroids. Larger values of sigma result in cells assigned to more clusters. Smaller values of sigma make soft kmeans cluster approach hard clustering.
nclust: Number of clusters in model. nclust=1 equivalent to simple linear regression.
tau: Protection against overclustering small datasets with large ones. tau is the expected number of cells per cluster.
block.size: What proportion of cells to update during clustering. Between 0 to 1, default 0.05. Larger values may be faster but less accurate
max.iter.harmony: Maximum number of rounds to run Harmony. One round of Harmony involves one clustering and one correction step.
max.iter.cluster: Maximum number of rounds to run clustering at each round of Harmony.
epsilon.cluster: Convergence tolerance for clustering round of Harmony. Set to -Inf to never stop early.
epsilon.harmony: Convergence tolerance for Harmony. Set to -Inf to never stop early.
plot_convergence: Whether to print the convergence plot of the clustering objective function. TRUE to plot, FALSE to suppress. This can be useful for debugging.
return_object: (Advanced Usage) Whether to return the Harmony object or only the corrected PCA embeddings.
verbose: Whether to print progress messages. TRUE to print, FALSE to suppress.
reference_values: (Advanced Usage) Defines reference dataset(s). Cells that have batch variables values matching reference_values will not be moved.
cluster_prior: (Advanced Usage) Provides user defined clusters for cluster initialization. If the number of provided clusters C is less than K, Harmony will initialize K-C clusters with kmeans. C cannot exceed K.

Value

By default, matrix with corrected PCA embeddings. If return_object is TRUE, returns the full Harmony object (R6 reference class type).

Examples



## By default, Harmony inputs a normalized gene expression matrix
# NOT RUN {
harmony_embeddings <- HarmonyMatrix(exprs_matrix, meta_data, 'dataset')
# }
## Harmony can also take a PCA embeddings matrix
data(cell_lines_small)
pca_matrix <- cell_lines_small$scaled_pcs
meta_data <- cell_lines_small$meta_data
harmony_embeddings <- HarmonyMatrix(pca_matrix, meta_data, 'dataset',
                                    do_pca=FALSE)
#> Harmony 1/10
#> Harmony 2/10
#> Harmony 3/10
#> Harmony 4/10
#> Harmony 5/10
#> Harmony 6/10
#> Harmony 7/10
#> Harmony 8/10
#> Harmony converged after 8 iterations

## Output is a matrix of corrected PC embeddings
dim(harmony_embeddings)
#> [1] 300  20
harmony_embeddings[seq_len(5), seq_len(5)]
#>                X1            X2            X3            X4            X5
#> [1,]  0.008448274 -0.0018134832 -1.552202e-03 -0.0077926057  0.0001466930
#> [2,] -0.011292669  0.0003581516 -1.989622e-05 -0.0014619744  0.0034635288
#> [3,]  0.010883023 -0.0050407453  1.579927e-03 -0.0043506676 -0.0006763035
#> [4,]  0.009927078 -0.0014537998  1.159716e-03 -0.0080219594 -0.0014029851
#> [5,] -0.010156191  0.0027586043 -3.074632e-03 -0.0000108491  0.0009655249

## Finally, we can return an object with all the underlying data structures
harmony_object <- HarmonyMatrix(pca_matrix, meta_data, 'dataset',
                                    do_pca=FALSE, return_object=TRUE)
#> Harmony 1/10
#> Harmony 2/10
#> Harmony 3/10
#> Harmony 4/10
#> Harmony 5/10
#> Harmony 6/10
#> Harmony 7/10
#> Harmony 8/10
#> Harmony 9/10
#> Harmony 10/10
dim(harmony_object$Y) ## cluster centroids
#> [1] 20 10
dim(harmony_object$R) ## soft cluster assignment
#> [1]  10 300
dim(harmony_object$Z_corr) ## corrected PCA embeddings
#> [1]  20 300
head(harmony_object$O) ## batch by cluster co-occurence matrix
#>           [,1]         [,2]         [,3]
#> [1,]  6.197592 9.983329e+00 8.725975e-05
#> [2,] 11.953170 7.462688e-06 2.536973e+01
#> [3,]  7.495340 1.563652e+01 6.057042e-07
#> [4,] 10.170492 1.752684e+01 7.977053e-07
#> [5,] 10.999486 1.901373e-05 2.325515e+01
#> [6,] 10.060248 1.894319e+01 1.766110e-06