Known issues: https://github.com/PredictiveEcology/reproducible/issues
lwgeom
now a suggested packageterra
class objects can now be correctly saved and recovered by Cache
fixErrors
can now distinguish testValidity = NA
meaning don’t fix errors and testValidity = FALSE
run buffering which fixes many errors, but don’t test whether there are any invalid polygons first (maybe slow), or testValidity = TRUE
meaning test for validity, then if some are invalid, then run buffer.reproducible.useNewDigestAlgorithm = 2
which will have user visible changes. To keep old behaviour, set options(reproducible.useNewDigestAlgorithm = 1)
options(reproducible.showSimilar)
is set. It is now more compact e.g., 3 lines instead of 5.sf
methods to studyAreaName
Cache
returns; i.e., a 2nd time through a Cache would return a cached copy, when some of the arguments were different. It occurred for when the differences were in unnamed arguments only.reproducible
will be slowly changing the defaults for vector GIS datasets from the sp
package to the sf
package. There is a large user-visible change that will come (in the next release), which will cause prepInputs
to read .shp
files with sf::st_read
instead of raster::shapefile
, as it is much faster. To change now, set options("reproducible.shapefileRead" = "sf::st_read")
fun
in prepInputs
for shapefiles (.shp
) is now sf::st_read
if the system has sf
installed. This can be overridden with options("reproducible.shapefileRead" = "raster::shapefile")
, and this is indicated with a message at the moment this is occurring, as it will cause different behaviour.quick
argument in Cache
can now be a character vector, allowing individual character arguments to be digested as character vectors and others to be digested as files located at the specified path as represented by the character vector.objSize
previously included objects in namespaces
, baseenv
and emptyenv
, so it was generally too large. Now uses the same criteria as pryr::object_size
unzip
missing (thanks to @CeresBarros #202)7z.exe
on Windows if the object is larger than 2GB, if can’t find unzip
.fun
argument in prepInputs
and family can now be a quoted expression.archive
argument in prepInputs
can now be NA
which means to treat the file downloaded not as an archive, even if it has a .zip
file extensionprepInputs
postProcess
especially for very large objects (>5GB tested). Previously, it was running many fixErrors
calls; now only calls fixErrors
on fail of the proximate call (e.g., st_crop or whatever)retry
now has a new argument exprBetween
to allow for doing something after the fail (for example, if an operation fails, e.g., st_crop
, then run fixErrors
, then return back to st_crop
for the retry)Cache
now has MUCH better nested levels detection, with messaging… and control of how deep the Caching goes seems good, via useCache = 2 will only Cache 2 levels in…archive
argument in prepInputs
family can now be NA … meaning do not try to unzip even if it is a .zip
file or other standard archive extensiongdb.zip
files (e.g., a file with a .zip extension, but that should not be opened with an unzip-type program) can now be opened with prepInputs(url = "whateverUrl", archive = NA, fun = "sf::st_read")
fun
argument in prepInputs
can now be a quoted function call.preProcess
now does a better job with large archives that can’t be correctly handled with the default zip
and unzip
with R, by trying system2
calls to possible 7z.exe
or other options on Linux-alikes.Copy
generic no longer has fileBackedDir
argument. It is now passed through with the ...
. This was creating a bug with some cases where fileBackedDir
was not being correctly executed.fixErrors()
now better handles sf
polygons with mixed geometries that include points.Cache
writeOutputs.Raster
attempted to change datatype
of Raster
class objects using the setReplacement dataType<-
, without subsequently writing to disk via writeRaster
. This created bad values in the Raster*
object. This now performs a writeRaster
if there is a datatype
passed to writeOutputs
e.g., through prepInputs
or postProcess
.updateSlotFilename
has many more tests.prepInputs(..., fun = NA)
now is the correct specification for “do not load object into R”. This essentially replicates preProcess
with same arguments.Copy
did not correctly copy RasterStack
s when some of the RasterLayer
objects were in memory, some on disk; raster::fromDisk
returned FALSE
in those cases, so Copy
didn’t occur on the file-backed layer files. Using Filenames
instead to determine if there are any files that need copying.options("reproducible.useNewDigestAlgorithm" = 2)
options("reproducible.polygonShortcut" = FALSE)
as there were still too many edge cases that were not covered.RasterStack
objects with a single file (thus acting like a RasterBrick
) are now handled correctly by Cache
and prepInputs
families, especially with new options("reproducible.useNewDigestAlgorithm" = 2)
, though in tests, it worked with default alsoRSQLite
now uses a RNG during dbAppend
; this affected 2 tests (#185).rgeos
paddedFloatToChar
to reproducible from SpaDES.core.%>%
code from magrittr
to allow the cached alternative, %C%
. With new magrittr
pipe now in compiled source code, more of the legacy code is required here.reproducible.messageColourPrepInputs
for all prepInputs
functions; reproducible.messageColourCache
for all Cache
functions; and reproducible.messageColourQuestion
for questions that require user input. Defaults are cyan
, blue
and green
respectively. These are user-visible colour changes.Cache
cases where a file.link
is used instead of saving.options(reproducible.verbose = 0)
will turn off almost all messaging.postProcess
and family now have filename2 = NULL
as the default, so not saved to disk. This is a change.verbose
is now an argument throughout, whose default is getOption(reproducible.verbose)
, which is set by default to 1
. Thus, individual function calls can be more or less verbose, or the whole session via option.RasterStack
objects were not correctly saved to disk under some conditions in postProcess
- fixedpostProcess
now uses a simpler single call to gdalwarp
, if available, for RasterLayer
class to accomplish cropInputs
, projectInputs
, maskInputs
, and writeOutputs
all at once. This should be faster, simpler and, perhaps, more stable. It will only be invoked if the RasterLayer
is too large to fit into RAM. To force it to be used the user must set useGDAL = "force"
in prepInputs
or postProcess
or globally with options("reproducible.useGDAL" = "force")
postProcess
when using the new gdalwarp
, has better persistence of colour table, and NA values as these are kept with better reliabilityCache
now works as expected (e.g., with parallel processing, it will avoid collisions) with SQLite thanks to suggestion here: https://stackoverflow.com/a/44445010Raster
class objects to account for more of the metadata (including the colortable). This will change the digest value of all Raster
layers, causing re-run of Cache
Require
, pkgDep
, trimVersionNumber
, normPath
, checkPath
that were moved to Require
package. For backwards compatibility, these are imported and reexportedfile.move
used to rename/copy files across disks (a situation where file.rename
would fail)DBI
type functions now have default cachePath
of getOption("reproducible.cachePath")
Cache(prepInputs, ...
on a file-backed Raster*
class object now gives the non-Cache repository folder as the filename(returnRaster)
. Previously, the return object would contain the cache repository as the folder for the file-backed Raster*
backports
, memoise
, quickPlot
, R.utils
, remotes
, tools
, and versions
; moved to Suggests: fastdigest
, gdalUtils
, googledrive
, httr
, qs
, rgdal
, sf
, testthat
; added: Require
. Now there are 12 non-base packages listed in Imports. This is down from 31 prior to Ver 1.0.0.file.link
not file.symlink
for saveToCache
. This would have resulted in C Stack overflow errors due to missing original file in the file.symlink
unzip
when extracting large (>= 4GB) files (#145, @tati-micheletti)projectInputs
when converting to longlat projections, setMinMax
for gdalwarp
resultsFilenames
now consistently returns a character vector (#149)file.link
does not succeed.raster
) are updated.options('reproducible.cacheSaveFormat')
on the fly; cache will look for the file by cacheId
and write it using options('reproducible.cacheSaveFormat')
. If it is in another format, Cache will load it and resave it with the new format. Experimental still.Copy
methods for refClass
objects, SQLite
and moved environment
method into ANY
as it would be dispatched for unknown classes that inherit from environment
, of which there are many and this should be interceptedRequire
can now handle minimum version numbers, e.g., Require("bit (>=1.1-15.2)")
; this can be worked into downstream tools. Still experimental.file.link
or file.symlink
if an existing Cache entry with identical output exists and it is large (currently 1e6
bytes); this will save disk space.elapsedTimeDigest
, elapsedTimeFirstRun
, and elapsedTimeLoad
, respectively.preProcess
). Includes 2 new functions, tempdir2
and tempfile2
for use with reproducible
packagereproducible.tempPath
, which is used for the new control of temporary files. Defaults to file.path(tempdir(), "reproducible")
. This feature was requested to help manage large amounts of temporary objects that were not being easily and automatically cleaneddrv
and conn
; user may need to manually call movedCache
if cache is not responding correctly. File-backed Rasters are automatically updated with new paths.Raster*
will have their filenames updated on the fly during a Cache recovery. User doesn’t need to do anything.postProcess
now will perform simple tests and skip cropInputs
and projectInputs
with a message if it can, rather than using Cache
to “skip”. This should speed up postProcess
in many cases.Cache
has change. Now, cacheId
is shown in all cases, making it easier to identify specific items in the cache.Copy
only creates a temporary directory for filebacked rasters; previously any Copy
command was creating a temporary directory, regardless of whether it was neededcropInputs.spatialObjects
had a bug when object was a large non-Raster class.cropInputs
may have failed due to “self intersection” error when x was a SpatialPolygons*
object; now catches error, runs fixErrors
and retries crop
. Great reprex by @tati-micheletti. Fixed in commit 89e652ef111af7de91a17a613c66312c1b848847
.Filenames
bugfix related to RasterBrick
prepInputs
does a better job of keeping all temporary files in a temporary folder; and cleans up after itself better.prepInputs
now will not show message that it is loading object into R if fun = NULL
(#135).options("reproducible.useDBI" = FALSE)
DBI
package directly, without archivist
. This has much improved speed.options("reproducible.cacheSaveFormat")
. This can be either rds
(default) or qs
. All cached objects will be saved with this format. Previously it was rda
.qs::qsave
. In many cases, this has much improved speed and file sizes compared to rds
; however, testing across a wide range of conditions will occur before it becomes the default....
because Cache
is now much faster, the default is to turn memoising off, via options("reproducible.useMemoise" = FALSE)
. In cases of large objects, memoising should still be faster, so user can still activate it, setting the option to TRUE
.postProcess
arg useGDAL
can now take "force"
as the default behaviour is to not use GDAL if the problem can fit into RAM and sf
or raster
tools will be faster than GDAL
toolsuseCloud
argument in Cache
and family has slightly modified functionality (see ?Cache new section useCloud
) and now has more tests including edge cases, such as useCloud = TRUE, useCache = 'overwrite'
. The cloud version now will also follow the "overwrite"
command.archivist
; moved to Suggests.bitops
, dplyr
, fasterize
, flock
, git2r
, lubridate
, RcppArmadillo
, RCurl
and tidyselect
. Some of these went to Suggests.postProcess
calls that use GDAL made more robust (including #93).dplyr
as a direct dependency. It is still an indirect dependency through DiagrammeR
reproducible.showSimilarDepth
allows for a deeper assessment of nested lists for differences between the nearest cached object and the present object. This greater depth may allow more fine tuned understanding of why an object is not correctly cachingoptions("reproducible.futurePlan")
to something other than FALSE
, then it will show download progress if the file is “large”.googledrive
v 1.0.0 (#119)pkgDep2
, a new convenience function to get the dependencies of the “first order” dependencies.useCache
, used in many functions (incl Cache
, postProcess
) can now be numeric, a qualitative indicator of “how deep” nested Cache
calls should set useCache = TRUE
– implemented as 1 or 2 in postProcess
currently. See ?Cache
pkgDep
was becoming unreliable for unknown reasons. It has been reimplemented, much faster, without memoising. The speed gains should be immediately noticeable (6 second to 0.1 second for pkgDep("reproducible")
)retry
to use exponential backoff when attempting to access online resources (#121)useCloud
and cloudFolderID
. This is a new approach to cloud caching. It has been tested with file backed RasterLayer
, RasterStack
and RasterBrick
and all normal R objects. It will not work for any other class of disk-backed files, e.g., ff
or bigmatrix
, nor is it likely to work for R6 class objects.Cache
, i.e., useCache
and cloudFolderID
downloadData
from Google Drive now protects against HTTP2 error by capturing error and retrying. This is a curl issue for interrupted connections.rcnst
errors on R-devel, tested using devtools::check(env_vars = list("R_COMPILE_PKGS"=1, "R_JIT_STRATEGY"=4, "R_CHECK_CONSTANTS"=5))
cacheRepo
: getArtifact
, getCacheId
, getUserTags
retry
, a new function, wraps try
with an explicit attempt to retry the same code upon error. Useful for flaky functions, such as googldrive::drive_download
which sometimes fails due to curl
HTTP2 error.Rcpp
functionality as the functions were no longer faster than their R base alternatives.prepInputs
was not correctly passing useCache
cropInputs
was reprojecting extent of y as a time saving approach, but this was incorrect if studyArea
is a SpatialPolygon
that is not close to filling the extent. It now reprojects studyArea
directly which will be slower, but correct. (#93)CHECKSUMS.txt
should now be ordered consistently across operating systems (note: base::order
will not succeed in doing this –> now using .orderDotsUnderscoreFirst
)cloudSyncCache
has a new argument: cacheIds
. Now user can control entries by cacheId
, so can delete/upload individual objects by cacheId
postProcess
family for sf
class objectscloudCache
bugfixes for more casestibble
from Imports as it’s no longer being used%>%
pipe that was long ago deprecated. User should use %C%
if they want a pipe that is Cache-aware. See examples.options
descriptions now in reproducible
, see ?reproducibleOptions
cacheRepo
and options("reproducible.cachePath")
can take a vector of paths. Similar to how .libPaths() works for libraries, Cache
will search first in the first entry in the cacheRepo
, then the second etc. until it finds an entry. It will only write to the first entry.options("reproducible.useCache" = "devMode")
. The point of this mode is to facilitate using the Cache when functions and datasets are continually in flux, and old Cache entries are likely stale very often. In devMode
, the cache mechanism will work as normal if the Cache call is the first time for a function OR if it successfully finds a copy in the cache based on the normal Cache mechanism. It differs from the normal Cache if the Cache call does not find a copy in the cacheRepo
, but it does find an entry that matches based on userTags
. In this case, it will delete the old entry in the cacheRepo
(identified based on matching userTags
), then continue with normal Cache
. For this to work correctly, userTags
must be unique for each function call. This should be used with caution as it is still experimental.options("reproducible.useNewDigestAlgorithm" = FALSE)
. There is a message of this change on package load.cloud*
functions, especially cloudCache
which allows sharing of Cache among collaborators. Currently only works with googledrive
assessDataType
to consolidate assessDataTypeGDAL
and assessDataType
into single function (#71, @ianmseddy)cc
: new function – a shortcut for some commonly used options for clearCache()
prepInputs
to handle .rar
archives, on systems with correct binaries to deal with them (#86, @tati-micheletti)fastdigest::fastdigest
as it is not return the identical hash across operating systemsprepInputs
on GIS objects that don’t use raster::raster
to load object were skipping postProcess
. Fixed.prepInputs
would cause virtually all entries in CHECKSUMS.txt
to be deleted. 2 cases where this happened were identified and corrected.data.table
class objects would give an error sometimes due to use of attr(DT)
. Internally, attributes are now added with data.table::setattr
to deal with this.gdalwarp
from prostProcess
now correctly matches extent (#73, @tati-micheletti)preProcess
(#92, @tati-micheletti)remotes
to Imports and removed devtools
New value possible for options(reproducible.useCache = 'overwrite')
, which allows use of Cache
in cases where the function call has an entry in the cacheRepo
, will purge it and add the output of the current call instead.
New option reproducible.inputPaths
(default NULL
) and reproducible.inputPathsRecursive
(default FALSE
), which will be used in prepInputs
as possible directory sources (searched recursively or not) for files being downloaded/extracted/prepared. This allows the using of local copies of files in (an)other location(s) instead of downloading them. If local location does not have the required files, it will proceed to download so there is little cost in setting this option. If files do exist on local system, the function will attempt to use a hardlink before making a copy.
dlGoogle()
now sets options(httr_oob_default = TRUE)
if using Rstudio Server.
Files in CHECKSUMS
now sorted alphabetically.
Checksums
can now have a CHECKSUMS.txt
file located in a different place than the destinationPath
Attempt to select raster resampling method based on raster type if no method supplied (#63, @ianmseddy)
projectInputs
new function assessDataTypeGDAL
, used in postProcess
, to identify smallest datatype
for large Raster* objects passed to GDAL system call
Raster
objects, enact gdalwarp
system call if raster::canProcessInMemory(x,4) = FALSE
for faster and memory-safe processingRaster
objects, including factor rastersextractFromArchive
for large (>2GB) zip files. In the R
help manual, unzip
fails for zip files >2GB. This uses a system call if the zip file is too large and fails using base::unzip
.raster::getData
issues.Cache()
when deeply nested, due to grep(sys.calls(), ...)
that would take long and hang.preProcess(url = NULL)
(#65, @tati-micheletti)clearCache
(#67), especially for large Raster
objects that are stored as binary R
files (i.e., .rda
)raster
package changes in development version of raster
packageraster::projectRaster
.robustDigest
now does not include Cache
-added attributespreProcess()
(#68, @tati-micheletti)future
to Suggests.future
for Cache
saving to SQLite database, via options("reproducible.futurePlan")
, if the future
package is installed. This is FALSE
by default.do.call
function is Cached, previously, it would be labelled in the database as do.call
. Now it attempts to extract the actual function being called by the do.call
. Messaging is similarly changed.reproducible.ask
, logical, indicating whether clearCache
should ask for deletions when in an interactive sessionprepInputs
, preProcess
and downloadFile
now have dlFun
, to pass a custom function for downloading (e.g., “raster::getData”)prepInputs
will automatically use readRDS
if the file is a .rds
.prepInputs
will return a list
if fun = "base::load"
, with a message; can still pass an envir
to obtain standard behaviour of base::load
.clearCache
- new argument ask
.assessDataType
, used in postProcess
, to identify smallest datatype
for Raster* objects, if user does not pass an explicit datatype
in prepInputs
or postProcess
(#39, @CeresBarros).git2r
update (@stewid, #36)..prepareRasterBackedFile
– now will postpend an incremented numeric to a cached copy of a file-backed Raster object, if it already exists. This mirrors the behaviour of the .rda
file. Previously, if two Cache events returned the same file name backing a Raster object, even if the content was different, it would allow the same file name. If either cached object was deleted, therefore, it would cause the other one to break as its file-backing would be missing.spades.XXX
and should have been reproducible.XXX
.copyFile
did not perform correctly under all cases; now better handling of these cases, often sending to file.copy
(slower, but more reliable)extractFromArchive
needed a new Checksum
function call under some circumstancesextractFromArchive
– when dealing with nested zips, not all args were passed in recursively (#37, @CeresBarros)prepInputs
– arguments that were same as Cache
were not being correctly passed internally to Cache
, and if wrapped in Cache, it was not passed into prepInputs. Fixed..prepareFileBackedRaster
was failing in some cases (specifically if it was inside a do.call
) (#40, @CeresBarros).Cache
was failing under some cases of Cache(do.call, ...)
. Fixed.Cache
– when arguments to Cache were the same as the arguments in FUN
, Cache would “take” them. Now, they are correctly passed to the FUN
.preProcess
– writing to checksums may have produced a warning if CHECKSUMS.txt
was not present. Now it does not.new functions:
convertPaths
and convertRasterPaths
to assist with renaming moved files.prepInputs
– new features
alsoExtract
now has more options (NULL
, NA
, "similar"
) and defaults to extracting all files in an archive (NULL
).postProcess
altogether if no studyArea
or rasterToMatch
. Previously, this would invoke Cache even if there was nothing to postProcess
.copyFile
correctly handles directory names containing spaces.makeMemoisable
fixed to handle additional edge cases.new functions:
prepInputs
to aid in data downloading and preparation problems, solved in a reproducible, Cache-aware way.postProcess
which is a wrapper for sequences of several other new functions (cropInputs
, fixErrors
, projectInputs
, maskInputs
, writeOutputs
, and determineFilename
)downloadFile
can handle Google Drive and ftp/http(s) fileszipCache
and mergeCache
compareNA
does comparisons with NA as a possible value e.g., compareNA(c(1,NA), c(2, NA))
returns FALSE, TRUE
Cache – new features:
showSimilar
, verbose
which can help with debugginguseCache
which allows turning caching on and off at a high level (e.g., options(“useCache”))cacheId
which allows user to hard code a result from a CachedigestPathContent
–> quick
, compareRasterFileLength
–> length
Cache
function calls, unless explicitly set on the inner functionsuserTags
added automatically to cache entries so much more powerful searching via showCache(userTags="something")
checksums
now returns a data.table with the same columns whether write = TRUE
or write = FALSE
.
clearCache
and showCache
now give messages and require user intervention if request to clearCache
would be large quantities of data deleted
memoise::memoise
now used on 3rd run through an identical Cache
call, dramatically speeding up in most cases
new options: reproducible.cachePath
, reproducible.quick
, reproducible.useMemoise
, reproducible.useCache
, reproducible.useragent
, reproducible.verbose
asPath
has a new argument indicating how deep should the path be considered when included in caching (only relevant when quick = TRUE
)
New vignette on using Cache
Cache is parallel
-safe, meaning there are tryCatch
around every attempt at writing to SQLite database so it can be used safely on multi-threaded machines
bug fixes, unit tests, more imports
for packages e.g., stats
updates for R 3.6.0 compact storage of sequence vectors
experimental pipes (%>%
, %C%
) and assign %<%
several performance enhancements
mergeCache
: a new function to merge two different Cache repositories
memoise::memoise
is now used on loadFromLocalRepo
, meaning that the 3rd time Cache()
is run on the same arguments (and the 2nd time in a session), the returned Cache will be from a RAM object via memoise. To stop this behaviour and use only disk-based Caching, set options(reproducible.useMemoise = FALSE)
.
Cache assign – %<%
can be used instead of normal assign, equivalent to lhs <- Cache(rhs)
.
new option: reproducible.verbose, set to FALSE by default, but if set to true may help understand caching behaviour, especially for complex highly nested code.
all options now described in ?reproducible
.
All Cache arguments other than FUN and … will now propagate to internal, nested Cache calls, if they are not specified explicitly in each of the inner Cache calls.
Cached pipe operator %C%
– use to begin a pipe sequence, e.g., Cache() %C% ...
Cache arg sideEffect
can now be a path
Cache arg digestPathContent
default changed from FALSE (was for speed) to TRUE (for content accuracy)
New function, searchFull
, which shows the full search path, known alternatively as “scope”, or “binding environments”. It is where R will search for a function when requested by a user.
Uses memoise::memoise
for several functions (loadFromLocalRepo
, pkgDep
, package_dependencies
, available.packages
) for speed – will impact memory at the expense of speed.
New Require
function
require
on those 20 packages, but require
does not check for dependencies and deal with them if missing: it just errors. This speed should be fast enough for many purposes.remove dplyr
from Imports
Add RCurl
to Imports
change name of digestRaster
to .digestRaster
digestRaster
affecting in-memory rastersrgdal
to SuggestsSpaDES
package.