batch
. If TRUE
, then results are reproducible when n_sgd_threads > 1
(as long as you use set.seed
). The price to be paid is that the optimization is slightly less efficient (because coordinates are not updated as quickly and hence gradients are staler for longer), so it is highly recommended to set n_epochs = 500
or higher. Thank you to Aaron Lun who not only came up with a way to implement this feature, but also wrote an entire C++ implementation of UMAP which does it (https://github.com/jlmelville/uwot/issues/83).opt_args
. The default optimization method when batch = TRUE
is Adam. You can control its parameters by passing them in the opt_args
list. As Adam is a momentum-based method it requires extra storage of previous gradient data. To avoid the extra memory overhead you can also use opt_args = list(method = "sgd")
to use a stochastic gradient descent method like that used when batch = FALSE
.epoch_callback
. You may now pass a function which will be invoked at the end of each epoch. Mainly useful for producing an image of the state of the embedding at different points during the optimization. This is another feature taken from umappp.pca_method
, used when the pca
parameter is supplied to reduce the initial dimensionality of the data. This controls which method is used to carry out the PCA and can be set to one of:
"irlba"
which uses irlba::irlba
to calculate a truncated SVD. If this routine deems that you are trying to extract 50% or more of the singular vectors, you will see a warning to that effect logged to the console."rsvd"
, which uses irlba::svdr
for truncated SVD. This method uses a small number of iterations which should give an accuracy/speed up trade-off similar to that of the scikit-learn TruncatedSVD method. This can be much faster than using "irlba"
but potentially at a cost in accuracy. However, for the purposes of dimensionality reduction as input to nearest neighbor search, this doesn’t seem to matter much."bigstatsr"
, which uses the bigstatsr package will be used. Note: that this is not a dependency of uwot
. If you want to use bigstatsr
, you must install it yourself. On platforms without easy access to fast linear algebra libraries (e.g. Windows), using bigstatsr
may give a speed up to PCA calculations."svd"
, which uses base::svd
. Warning: this is likely to be very slow for most datasets and exists as a fallback for small datasets where the "irlba"
method would print a warning."auto"
(the default) which uses "irlba"
to calculate a truncated SVD, unless you are attempting to extract 50% or more of the singular vectors, in which case "svd"
is used.ret_nn = TRUE
. If the names exist in more than one of the input data parameters listed above, but are inconsistent, no guarantees are made about which names will be used. Thank you jwijffels for reporting this.umap_transform
, the learning rate is now down-scaled by a factor of 4, consistent with the Python implementation of UMAP. If you need the old behavior back, use the (newly added) learning_rate
parameter in umap_transform
to set it explicitly. If you used the default value in umap
when creating the model, the correct setting in umap_transform
is learning_rate = 1.0
.nn_method = "annoy"
and verbose = TRUE
would lead to an error with datasets with fewer than 50 items in them.umap_transform
(this was incorrectly documented to work).umap_transform
was wrong in other ways: it has now been corrected to indicate that there should be neighbor data for each item in the test data, but the neighbors and distances should refer to items in training data (i.e. the data used to build the model).n_neighbors
parameter is now correctly ignored in model generation if pre-calculated nearest neighbor data is provided.grain_size
didn’t do anything.This release is mainly to allow for some internal changes to keep compatibility with RcppAnnoy, used for the nearest neighbor calculations.
umap
and tumap
now note that the contents of the model
list are subject to change and not intended to be part of the uwot public API. I recommend not relying on the structure of the model
, especially if your package is intended to appear on CRAN or Bioconductor, as any breakages will delay future releases of uwot to CRAN.metric = "correlation"
a distance based on the Pearson correlation (https://github.com/jlmelville/uwot/issues/22). Supporting this required a change to the internals of how nearest neighbor data is stored. Backwards compatibility with models generated by previous versions using ret_model = TRUE
should have been preserved.nn_method
, for umap_transform
: pass a list containing pre-computed nearest neighbor data (identical to that used in the umap
function). You should not pass anything to the X
parameter in this case. This extends the functionality for transforming new points to the case where nearest neighbor data between the original data and new data can be calculated external to uwot
. Thanks to Yuhan Hao for contributing the PR (https://github.com/jlmelville/uwot/issues/63 and https://github.com/jlmelville/uwot/issues/64).init
, for umap_transform
: provides a variety of options for initializing the output coordinates, analogously to the same parameter in the umap
function (but without as many options currently). This is intended to replace init_weighted
, which should be considered deprecated, but won’t be removed until uwot 1.0 (whenever that is). Instead of init_weighted = TRUE
, use init = "weighted"
; replace init_weighted = FALSE
with init = "average"
. Additionally, you can pass a matrix to init
to act as the initial coordinates.umap_transform
: previously, setting n_epochs = 0
was ignored: at least one iteration of optimization was applied. Now, n_epochs = 0
is respected, and will return the initialized coordinates without any further optimization.verbose = TRUE
: the progress bar calculations were taking up a detectable amount of time and has now been fixed. With very small data sets (< 50 items) the progress bar will no longer appear when building the index.n_threads
is now NULL
to provide a bit more protection from changing dependencies.grain_size
parameter has been undeprecated. As the version that deprecated this never made it to CRAN, this is unlikely to have affected many people.grain_size
parameter is now ignored and remains to avoid breaking backwards compatibility only.ret_extra
, a vector which can contain any combination of: "model"
(same as ret_model = TRUE
), "nn"
(same as ret_nn = TRUE
) and fgraph
(see below).ret_extra
vector contains "fgraph"
, the returned list will contain an fgraph
item representing the fuzzy simplicial input graph as a sparse N x N matrix. For lvish
, use "P"
instead of "fgraph
" (https://github.com/jlmelville/uwot/issues/47). Note that there is a further sparsifying step where edges with a very low membership are removed if there is no prospect of the edge being sampled during optimization. This is controlled by n_epochs
: the smaller the value, the more sparsifying will occur. If you are only interested in the fuzzy graph and not the embedded coordinates, set n_epochs = 0
.unload_uwot
, to unload the Annoy nearest neighbor indices in a model. This prevents the model from being used in umap_transform
, but allows for the temporary working directory created by both save_uwot
and load_uwot
to be deleted. Previously, both load_uwot
and save_uwot
were attempting to delete the temporary working directories they used, but would always silently fail because Annoy is making use of files in those directories.init = "spca"
, fixed values of a
and b
(rather than allowing them to be calculated through setting min_dist
and spread
) and approx_pow = TRUE
. Using the tumap
method with init = "spca"
is probably the most robust approach.n_epochs = 0
. This used to behave like (n_epochs = NULL
) and gave a default number of epochs (dependent on the number of vertices in the dataset). Now it more usefully carries out all calculations except optimization, so the returned coordinates are those specified by the init
parameter, so this is an easy way to access e.g. the spectral or PCA initialization coordinates. If you want the input fuzzy graph (ret_extra
vector contains "fgraph"
), this will also prevent the graph having edges with very low membership being removed. You still get the old default epochs behavior by setting n_epochs = NULL
or to a negative value.save_uwot
and load_uwot
have been updated with a verbose
parameter so it’s easier to see what temporary files are being created.save_uwot
has a new parameter, unload
, which if set to TRUE
will delete the working directory for you, at the cost of unloading the model, i.e. it can’t be used with umap_transform
until you reload it with load_uwot
.save_uwot
now returns the saved model with an extra field, mod_dir
, which points to the location of the temporary working directory, so you should now assign the result of calling save_uwot
to the model you saved, e.g. model <- save_uwot(model, "my_model_file")
. This field is intended for use with unload_uwot
.load_uwot
also returns the model with a mod_dir
item for use with unload_uwot
.save_uwot
and load_uwot
were not correctly handling relative paths.load_uwot
in uwot 0.1.4 to work with newer versions of RcppAnnoy (https://github.com/jlmelville/uwot/issues/31) failed in the typical case of a single metric for the nearest neighbor search using all available columns, giving an error message along the lines of: Error: index size <size> is not a multiple of vector size <size>
. This has now been fixed, but required changes to both save_uwot
and load_uwot
, so existing saved models must be regenerated. Thank you to reporter OuNao.n_threads
caused a crash. This was particularly insidious if running with a system with only one default thread available as the default n_threads
becomes 0.5
. Now n_threads
(and n_sgd_threads
) are rounded to the nearest integer.ERROR: there is already an InterruptableProgressMonitor instance defined
.verbose = TRUE
, the a
, b
curve parameters are now logged.Even with a fix for the bug mentioned above, if the nearest neighbor index file is larger than 2GB in size, Annoy may not be able to read the data back in. This should only occur with very large or high-dimensional datasets. The nearest neighbor search will fail under these conditions. A work-around is to set n_threads = 0
, because the index will not be written to disk and re-loaded under these circumstances, at the cost of a longer search time. Alternatively, set the pca
parameter to reduce the dimensionality or lower n_trees
, both of which will reduce the size of the index on disk. However, either may lower the accuracy of the nearest neighbor results.
Initial CRAN release.
tmpdir
, which allows the user to specify the temporary directory where nearest neighbor indexes will be written during Annoy nearest neighbor search. The default is base::tempdir()
. Only used if n_threads > 1
and nn_method = "annoy"
.Fixed an issue with lvish
where there was an off-by-one error when calculating input probabilities.
Added a safe-guard to lvish
to prevent the gaussian precision, beta, becoming overly large when the binary search fails during perplexity calibration.
The lvish
perplexity calibration uses the log-sum-exp trick to avoid numeric underflow if beta becomes large.
pcg_rand
. If TRUE
(the default), then a random number generator from the PCG family is used during the stochastic optimization phase. The old PRNG, a direct translation of an implementation of the Tausworthe “taus88” PRNG used in the Python version of UMAP, can be obtained by setting pcg_rand = FALSE
. The new PRNG is slower, but is likely superior in its statistical randomness. This change in behavior will be break backwards compatibility: you will now get slightly different results even with the same seed.fast_sgd
. If TRUE
, then the following combination of parameters are set: n_sgd_threads = "auto"
, pcg_rand = FALSE
and approx_pow = TRUE
. These will result in a substantially faster optimization phase, at the cost of being slightly less accurate and results not being exactly repeatable. fast_sgd = FALSE
by default but if you are only interested in visualization, then fast_sgd
gives perfectly good results. For more generic dimensionality reduction and reproducibility, keep fast_sgd = FALSE
.init_sdev
which specifies how large the standard deviation of each column of the initial coordinates should be. This will scale any input coordinates (including user-provided matrix coordinates). init = "spca"
can now be thought of as an alias of init = "pca", init_sdev = 1e-4
. This may be too aggressive scaling for some datasets. The typical UMAP spectral initializations tend to result in standard deviations of around 2
to 5
, so this might be more appropriate in some cases. If spectral initialization detects multiple components in the affinity graph and falls back to scaled PCA, it uses init_sdev = 1
.init_sdev
, the init
options sspectral
, slaplacian
and snormlaplacian
have been removed (they weren’t around for very long anyway). You can get the same behavior by e.g. init = "spectral", init_sdev = 1e-4
. init = "spca"
is sticking around because I use it a lot.init = "spca"
.<random>
header. This breaks backwards compatibility even if you set pcg_rand = FALSE
.metric = "cosine"
results were incorrectly using the unmodified Annoy angular distance.categorical
metric (fixes https://github.com/jlmelville/uwot/issues/20).n_components
(e.g. approximately 50% faster optimization time with MNIST and n_components = 50
).pca_center
, which controls whether to center the data before applying PCA. It would be typical to set this to FALSE
if you are applying PCA to binary data (although note you can’t use this with setting with metric = "hamming"
)metric
is "manhattan"
and "cosine"
. It’s still not applied when using "hamming"
(data still needs to be in binary format, not real-valued).pca
and pca_center
parameter values for a given data block by using a list for the value of the metric, with the column ids/names as an unnamed item and the overriding values as named items, e.g. instead of manhattan = 1:100
, use manhattan = list(1:100, pca_center = FALSE)
to turn off PCA centering for just that block. This functionality exists mainly for the case where you have mixed binary and real-valued data and want to apply PCA to both data types. It’s normal to apply centering to real-valued data but not to binary data.umap_transform
, where negative sampling was over the size of the test data (should be the training data).verbose = TRUE
, log the Annoy recall accuracy, which may help tune values of n_trees
and search_k
.n_sgd_threads
, which controls the number of threads used in the stochastic gradient descent. By default this is now single-threaded and should result in reproducible results when using set.seed
. To get back the old, less consistent, but faster settings, set n_sgd_threads = "auto"
.alpha
is now learning_rate
.gamma
is now repulsion_strength
.laplacian
and normlaplacian
).init
options: sspectral
, snormlaplacian
and slaplacian
. These are like spectral
, normlaplacian
, laplacian
respectively, but scaled so that each dimension has a standard deviation of 1e-4. This is like the difference between the pca
and spca
options.pca
: set this to a positive integer to reduce matrix of data frames to that number of columns using PCA. Only works if metric = "euclidean"
. If you have > 100 columns, this can substantially improve the speed of the nearest neighbor search. t-SNE implementations often set this value to 50.metric
: instead of specifying a single metric name (e.g. metric = "euclidean"
), you can pass a list, where the name of each item is the metric to use and the value is a vector of the names of the columns to use with that metric, e.g. metric = list("euclidean" = c("A1", "A2"), "cosine" = c("B1", "B2", "B3"))
treats columns A1
and A2
as one block, using the Euclidean distance to find nearest neighbors, whereas B1
, B2
and B3
are treated as a second block, using the cosine distance.categorical
.y
may now be a data frame or matrix if multiple target data is available.target_metric
, to specify the distance metric to use with numerical y
. This has the same capabilities as metric
.scale = "Z"
To Z-scale each column of input (synonym for scale = TRUE
or scale = "scale"
).scale = "colrange"
to scale columns in the range (0, 1).y
, you may pass nearest neighbor data directly, in the same format as that supported by X
-related nearest neighbor data. This may be useful if you don’t want to use Euclidean distances for the y
data, or if you have missing data (and have a way to assign nearest neighbors for those cases, obviously). See the Nearest Neighbor Data Format section for details.ret_nn
: when TRUE
returns nearest neighbor matrices as a nn
list: indices in item idx
and distances in item dist
. Embedded coordinates are in embedding
. Both ret_nn
and ret_model
can be TRUE
, and should not cause any compatibility issues with supervised embeddings.nn_method
can now take precomputed nearest neighbor data. Must be a list of two matrices: idx
, containing integer indexes, and dist
containing distances. By no coincidence, this is the format return by ret_nn
.n_components = 1
was broken (https://github.com/jlmelville/uwot/issues/6)init
parameter were being modified, in defiance of basic R pass-by-copy semantics.metric = "cosine"
is working again for n_threads
greater than 0
(https://github.com/jlmelville/uwot/issues/5)August 5 2018. You can now use an existing embedding to add new points via umap_transform
. See the example section below.
August 1 2018. Numerical vectors are now supported for supervised dimension reduction.
July 31 2018. (Very) initial support for supervised dimension reduction: categorical data only at the moment. Pass in a factor vector (use NA
for unknown labels) as the y
parameter and edges with bad (or unknown) labels are down-weighted, hopefully leading to better separation of classes. This works remarkably well for the Fashion MNIST dataset.
July 22 2018. You can now use the cosine and Manhattan distances with the Annoy nearest neighbor search, via metric = "cosine"
and metric = "manhattan"
, respectively. Hamming distance is not supported because RcppAnnoy doesn’t yet support it.