Former kgram_freqs
class is now called sbo_kgram_freqs
. The constructor kgram_freqs()
is still available as an alias to sbo_kgram_freqs()
.
Former sbo_preds
class is now substituted by two classes:
- `sbo_predictor`: for interactive use
- `sbo_predtable`: for storing text predictors out of memory (e.g.
`save()` to file)
sbo_predictor
and sbo_predtable
objects are obtained by the homonym constructors, which are now S3 generics accepting character
input, as well as sbo_kgram_freqs
and sbo_predtable
(for the sbo_predictor()
constructor) class objects. In particular, these allow to directly train a text predictor without storing the intermediate sbo_dictionary
, and kgram_freqs
objects.
The behaviour of the dict
argument in kgram_freqs()
and kgram_freqs_fast()
has changed, now accepting either a sbo_dictionary
, a character
or a formula
(see also ‘New features’).
The sbo_predictor
implementation dramatically improves the speed of predict()
(by a factor of x10). A single call to predict()
now allocates a few kBs of RAM (whereas it previously allocated few MBs, c.f. issue #10).
Metadata of sbo_kgram_freqs
and sbo_pred*
objects is now stored via attributes (#11).
sbo_dictionary
.word_coverage
with generic constructors and a preconfigured plot()
method.kgram_freqs()
and sbo_pred*()
can now be built also with a fixed target coverage fraction of training corpus.prune()
generic function for reducing -gram order of kgram_freqs
and sbo_predtable
’s.summary()
methods for sbo_kgram_freqs
and sbo_pred*
objects; correspondingly, the output of print()
has been simplified considerably (#5).sbo_kgram_freqs
, sbo_dictionary
, sbo_predictor
and sbo_predtable
can be constructed either through the homonymous constructors, or through the aliases kgram_freqs()
, dictionary()
, predictor()
, predtable()
.sbo
now has SystemRequirements: C++11
, for correct integration with C++11 code (in particular std::unordered_map
).
Model training (with sbo_predictor()
) is now considerably faster, due to optimizations in the algorithm for building Stupid Back-Off prediction tables.
The Stupid Back-Off algorithm is now thoroughly tested, and small inconsistencies between the predict.kgram_freqs()
and predict.sbo_predictor()
methods have been fixed, including:
- Proper handling of unknown words
- Consistent handling of ties in prediction probabilities.
Model evaluation in eval_sbo_predictor()
is now carried out by sampling a single sentence from each document in test corpus.
Removed unnecessary dependencies from Depends
and Imports
package fields.
erase
argument in preprocess()
and kgram_freqs_fast()
, c.f. issue #17.kgramFreqs
class, as per §1.6.4 of the “Writing R extensions” guide.kgram_freqs_fast()
for fast and memory efficient kgram tokenization using the default text preprocessing utility.kgram_freqs()
, get_word_freqs()
, preprocess()
, and predict.sbo_preds()
has been entirely rewritten in C++.tokenize_sentences()
function for sentence level tokenization.kgram_freqs()
now accepts any user defined single character EOS token, through the EOS
argument.preproc
argument to kgram_freqs()
and get_word_freqs()
, for custom training corpus preprocessing.dict
argument of kgram_freqs()
now also accepts numeric values, allowing to build a dictionary directly from the training corpus.predict
method for sbo_kgram_freqs
class.