Summary
This version introduces collapse_groups()
and friends, as well as summarize_balances()
and ranked_balances()
. It also improves numerical balancing in fold()
which breaks reproducibility.
Changes
Breaking: The numerical balancing (num_col
) in fold()
gets multiple improvements. This breaks reproducibility in some contexts.
Fixes bug with selection of groups to redistribute when extreme_pairing_levels > 1
. The groupings were likely to be fine, but the fix should give better groupings on average.
When possible, it redistributes the smallest and/or largest group if they are 1 standard deviation from the second smallest/largest group to avoid imbalances due to very small/large scores.
Adds use of extreme triplet grouping when too few grouping columns are created with extreme pairing. This can lead to an increase in the number of created fold columns. In some cases, these groupings may be more balanced than with extreme pairing, but on average extreme pairing leads to more balanced groupings. See rearrr::triplet_extremes()
for more on extreme triplet grouping.
Adds argument use_of_triplets
in fold()
to allow using extreme triplet grouping instead of extreme pairing or disabling it completely.
Adds collapse_groups()
for collapsing a set of existing groups into a smaller set of groups. Can balance the new groups by size and by numeric, categorical and ID columns. The more of these you balance at a time, the less balanced each will tend to be. Compare settings by summarizing the balances with summarize_balances()
afterwards. For creating the most balanced groups, enable auto_tune
.
Adds collapse_groups_by_size()
, collapse_groups_by_numeric()
, collapse_groups_by_levels()
, and collapse_groups_by_ids()
. These are wrappers of collapse_groups()
for a simplified interface.
Adds summarize_balances()
for inspecting the balance of numeric, categorical, and ID columns in-and-between groups.
Adds ranked_balances()
for extracting the across-group standard deviations of balances from the output of summarize_balances()
. The standard deviations are a measure of how balanced a split is.
Adds "every"
method to grouping functions. Groups every n
data points together.
Prepares package’s tests for checkmate 2.1.0
.
Breaking: Rewrites large parts of the numerical balancing engine used in fold()
and partition()
. This produces different groups in some cases. Outsources extreme pairing to rearrr::pair_extremes()
. Now uses hierarchical shuffling (rearrr::shuffle_hierarchy()
) in partition()
and some cases of fold()
(relevant when extreme_pairing_levels
> 1). If you need reproducibility, the last version prior to this breaking change can be installed with devtools::install_github("ludvigolsen/groupdata2@v1.4.2")
.
Imports rearrr
for use in numerical balancing.
Minor improvements to vignettes.
Adds summarize_group_cols()
for finding the number of groups per fold column along with statistics about the number of rows per group.
Breaking: Fixes internal sorting of fold columns. This sometimes changes the order of fold columns, compared to the previous version.
Adds tidyr
as a required dependency. Previously, it was suggested.
Breaking: In fold()
, the k
argument can now be a multi-element vector with one k
(number of folds) per fold column. This functionality required a minor rewrite, why you might see interchanged fold column names in comparison to the previous versions.
Bug fix: In fold()
and partition()
, when specifying multiple cat_col
columns and num_col
in the same call, it would fail. This now works.
data.frames
(meaning that they are applied group-wise): fold()
, partition()
, group()
, group_factor()
, splt()
, balance()
, upsample()
, downsample()
, differs_from_previous()
, and find_missing_starts()
. A message is generated once per session, when the input is grouped, to help users understand why their code is breaking.checkmate
compatibility.
Small speed up of n_dist
grouping method.
Adds Zenodo DOI for easier citation.
Adds lifecycle
badges to function documentation.
Adds argument handle_na
to differs_from_previous()
and find_starts()
.
Bug fix: In grouping functions with method l_starts
and n = "auto"
, NA
s are now replaced by a unique value before finding group starts. E.g. c(1,1,1,2,2,NA,NA,4,4)
yields 4 groups.
More explicit: the data
argument in fold()
and participant
takes a data frame, not a vector.
Possibly breaking change: Adds checkmate
input checks. Improves error messages but also restricts behavior.
Adds xpectr
as suggested package. Doubles number of unit tests.
Adds all_groups_identical()
for testing if two grouping factors contain the same groups, looking only at the group members, allowing for different group names / identifiers.
Unit tests were made compatible with R versions lower than 3.6.
Adds badges to README, including travis-ci status, AppVeyor status, Codecov, min. required R version, CRAN version and monthly CRAN downloads. Note: Zenodo badge will be added post release.
Bug fix: fold()
ungroups dataset before removing existing fold columns.
Unit tests are skipped on R versions lower than 3.6.
New main function: balance()
used for up- and downsampling of data to balance sample size within categories and IDs. Thanks for the request from @jjesusfilho (#3).
New wrapper function: upsample()
wraps balance()
with size="max"
.
New wrapper function: downsample()
wraps balance()
with size="min"
.
Adds parameter num_col
to fold()
and partition()
for balancing on a numeric column.
Adds parameter id_aggregation_fn
to fold()
and partition()
. Used when balancing on both id_col
and num_col
.
Adds helper tool differs_from_previous()
. Finds values in a vector that differs from the previous value by some threshold. Similar to find_starts()
.
Adds parameter num_fold_cols
to fold()
. Useful for creating multiple fold columns for repeated cross-validation.
Adds parameter unique_fold_cols_only
to fold()
. Whether to only include unique fold columns or not.
Adds parameter max_iters
to fold()
. How many times to attempt creating unique fold columns. Note that it is possible to get fewer fold columns than specified in num_fold_cols
.
Adds parameter parallel
to fold()
. When creating multiple unique fold columns, we can run the column comparisons in parallel. Requires registered parallel backend.
Adds parameter handle_existing_fold_cols
to fold()
. When calling fold()
on a data frame that already contains columns with names starting with ".folds"
, we can either keep them and add more, or replace them.
Fixed behavior in fold()
when k is a percentage (between 0-1). It is now interpreted as the approximate size of each fold and used to calculate the number of folds. E.g. k=0.2
will lead to 5 folds.
New main function: partition()
- used for creating balanced partitions by partition sizes.
New method category: l_
methods - n is passed as a list.
New method: l_sizes
- Uses list of group sizes to create grouping factor. Can be used for partitioning (e.g. n = c(0.2, 0.3)
returns 3 groups with 0.2 (20%), 0.3 (30%) and the exceeding 0.5 (50%) of the data points).
New method: l_starts
- Uses list of start positions to create groups. Define which values from a vector to start a new group at. Skip to later appearances of a value. Use n = ‘auto’ to automatically find starts using find_starts()
.
New helper tool: find_starts()
- Finds start positions in a vector. I.e. values that differ from the previous value. Get the values or indices of the values. Output can be used as n
in l_starts
method.
New helper tool: find_missing_starts()
- Returns the start positions that would be recursively removed when using the l_starts
method with remove_missing_starts set to TRUE.
Added argument remove_missing_starts
to grouping functions. Recursively remove the starting positions not found with l_starts
method.
New method: primes
- similar to staircase
but with primes as steps (e.g. group sizes 2,3,5,7..).
New remainder tool: %primes%
- similar to %staircase%
but for the new primes method.
Submitted package to CRAN.
Main functions and tools of this version is group_factor()
, group()
, splt()
, fold()
, and %staircase%
.