textLBAM()
returns the library as a datafram
textPredict()
detectsmodel_type
.- Instead of having to specify the URL, one can now specify the model name from the Language-Based Assessmet Model (L-BAM) library.
- Including default option to download an updated version of the L-BAM file
- fixing bugs related to text prediction functions
- adding method_typ = "texttrained" and "finetuned"
- streamlining code for implicit motives output
- adding
textFindNonASCII()
function and feature intextEmbed()
to warn and clean non-ASCII characters. This may change results slightly. - removed
type
parameter in textPredict() and instead giving both probability and class.
textClassify()
is now calledtextClassifyPipe()
textPredict()
is now calledtextPredictR()
- Making
textAssess()
,textPredict()
andtextClassify()
works the same, now taking the parametermethod
with the string "text" to using textPredict(), and "huggingface" to using textClassifyPipe().
- updating python code, including adding parameters
hg_gated
,hg_token
, andtrust_remote_code
. - changed parameter name from
return_incorrect_results
toforce_return_results
- changed default of
function_to_apply
= NULL instead of "none"; this is to mimic huggingface default. textWordPrediction
since it is under development and note tested.
- updating security issues with python packages.
- updating the default range of penalties in textTrain() functions.
- updating textPredict() functionality
- Improving
textTrainN()
includingsubsets
sampling (new: default change fromrandom
tosubsets
),use_same_penalty_mixture
(new:default change fromFALSE
toTRUE
) andstd_err
(new output). - Improving
textTrainPlot()
- Improving
textPredict()
functionality. - Implementing experimental features related to
textTopics()
textTopics()
trains a BERTopic model with different modules and returns the model, data, and topic_document distributions based on c-td-idftextTopicsTest()
can perform multiple tests (correlation, t-test, regression) between a BERTopic model fromtextTopics()
and datatextTopicsWordcloud()
can plot word clouds of topics tested withtextTopicsTest()
textTopicsTree()
prints out a tree structure of the hierarchical topic structure
textEmbed()
is now fully embedding one column at the time; and reducing word_types for each column. This can break some code; and produce different results in plots where word_types are based on several embedded columns.textTrainN()
andtextTrainNPlot()
evaluates prediction accuracy across number of cases.textTrainRegression()
andtextTrainRandomForest
now takes tibble as input in strata.
- multinomial regression in
textTrainRegression()
textPredictTest()
can handleauc
textEmbed()
is faster (thanks to faster handling of aggregating layers)- Added
sort
parameter intextEmbedRawLayers()
.
- Tests using training with random forest was updated since outcomes changed when updating from R 4.2 to R 4.3.1. (see test_2_textTrain.R in tests/testthat folder)
Possibility to use GPU for MacOS M1 and M2 chip using device = "mps" in textEmbed()
textFineTune()
as an experimental function is implemented
max_length
implemented in textTranslate()
textEmbedReduce()
implemented
- Fixing textEmbed error for many non-BERT models.
- fixed
textEmbed(decontextualize=TRUE)
, which gave error.
- Removing
textSimialirtyTest()
for version 1.0 because it needs more evaluations.
- changed hard coded "bert-base-uncased" to
model
, so thatlayers
= -2 works intextEmbed()
. - Update logging level critical using integer 50 with
set_verbosity
. - changed in
sorting_xs_and_x_append
from Dim to Dim0 when renaming x_appended variables. - changed
first
toappend_first
and made it an option intextTrainRegression()
andtextTrainRandomForest()
.
- The default setting of textEmbed() is now providing token-level embeddings and text-level embeddings. Word_type embeddings are optional.
- In
textEmbed()
layers = 11:12
is nowsecond_to_last
. - In
textEmbedRawLayers
default is nowsecond_to_last
. - In
textEmbedLayerAggregation()
layers = 11:12
is nowlayers = "all"
. - In
textEmbed()
andtextEmbedRawLayers()
x
is now calledtexts
. textEmbedLayerAggregation()
now useslayers = "all"
,aggregation_from_layers_to_tokens
,aggregation_from_tokens_to_texts
.
textZeroShot()
is implemented.textDistanceNorm()
andtextDistanceMatrix()
textDistance()
can compute cosinedistance
.textModelLayers()
provides N layers for a given model
max_token_to_sentence
in textEmbed()
aggregate_layers
is now calledaggregation_from_layers_to_tokens
.aggregate_tokens
is now calledaggregation_from_tokens_to_texts
.single_word_embeddings
is now calledword_types_embeddings
textEmbedLayersOutput()
is now calledtextEmbedRawLayers()
- adding
textDimName()
- DEFAULT CHANGE in
textEmbed()
:dim_name
=TRUE
- DEFAULT CHANGE in
textEmbed()
:single_context_embeddings
=TRUE
- DEFAULT CHANGE in
textEmbed()
: device = "gpu" - Adding specific layer aggregations for
explore_words
intextPlot()
- Adding
x_append_target
intextPredict()
function
- updating
textClassify()
,textGeneration()
,textNER()
,textSum()
,textQA()
, andtextTranslate()
.
- harmonizing
x_add
tox_append
across functions - adding
set_seed
to language analysis tasks
- abstracting function for sorting out
x'
in training and prediction textPredict
does not takeword_embeddings
andx_append
(notnew_data
)
textClassify()
(under development)textGeneration()
(under development)textNER()
(under development)textSum()
(under development)textQA()
(under development)textTranslate()
(under development)
- New function:
textSentiment()
, from huggingface transformers models. - add progression for time consuming functions including
textEmbed()
,textTrainRegression()
,textTrainRandomForest()
andtextProjection()
.
- Option
dim_names
to set unique dimension names intextEmbed()
andtextEmbedStatic()
. textPreictAll()
function that can take several models, word embeddings, and variables as input to provide multiple outputs.- option to add variables to the embeddings in
textTrain()
functions withx_append
.
- text version is printed from DESCRIPTION-file (rather than manual update)
textPredict
related functions are located in its own file- textEmbed comment includes
text_version
number textEmbedLayersOutput
andtextEmbed
can providesingle_context_embeddings
- Removed
return_tokens
option from textEmbed (since it is only relevant for textEmbedLayersOutput) - removed the empty list
$single_we
whendecontexts
isFALSE
.
- Visualization of the download process of language models
- Can set error level from python
Logistic
regression is default for classification in textTrain.- Megatron language model functionality
- When GPU is not found, CPU is used.
- Option to set
model_max_length
intextEmbed()
. textModels()
show downloaded models.textModelsRemove()
deletes specified models.
- Fixed error for unpaired
textSimilarityTest()
when uneven number of cases are tested.
- Inclusion of
textDistance()
function with distance measures. - Adding more measures to
textSimilarity()
. - Adding functionality from
textSimilarity()
intextSimilarityTest()
,textProjection()
andtextCentrality()
for plotting. - Adding information about how
textTrainRegression()
concatenates word embeddings when provided with a list of several word embeddings. - Adding two word embedding dimensions to example data of single word embeddings to match the 10 of the contextualized embeddings in
word_embeddings_4$singlewords_we
.
- In
textCentrality()
, words to be plotted are selected withword_data1_all$extremes_all_x >= 1
(rather than==1
).
textSimilarityMatrix()
computes semantic similarity among all combinations in a given word embedding.
textDescriptives()
gets options to remove NA and compute total scores.
- inclusion of
textDescriptives()
- prompt option added to
textrpp_initiate()
tokenization
is made withNLTK
from python.
- Code has been cleaned up and prepared for CRAN
- New functions being tested:
textWordPredictions()
(which has a trial period/not fully developed and might be removed in future versions); p-values are not yet implemented. - Possibility to use
textPlot()
for objects from bothtextProjection()
andtextWordPredictions()
- Changed wordembeddigs to word_embeddings through out the code/package.
- Warnings about seed when using multi-cores on Mac is addressed.
textrpp_initiate()
runs automatically inlibrary(text)
when default environment exits- Python warnings a captured in embedding comments
- Option to print python options to console
- Updated the permutation test for plotting and
textSimilarityTest()
.
- Changed from
stringr
tostringi
(and removed tokenizer) as imported package
textrpp_install()
installs aconda
environment with text required python packages.textrpp_install_virtualenv()
install a virtual environment with text required python packages.textrpp_initialize()
initializes installed environment.textrpp_uninstall()
uninstallsconda
environment.
textEmbed()
andtextEmbedLayersOutput()
support the use of GPU using thedevice
setting.remove_words
makes it possible to remove specific words fromtextProjectionPlot()
- In
textProjetion()
andtextProjetionPlot()
it now possible to add points of the aggregated word embeddings in the plot - In
textProjetion()
it now possible to manually add words to the plot in order to explore them in the word embedding space. - In
textProjetion()
it is possible to add color or remove words that are more frequent on the opposite "side" of its dot product projection. - In
textProjection()
withsplit == quartile
, the comparison distribution is now based on the quartile data (rather than the data for mean)
- If any of the tokens to remove is "[CLS]", subtract 1 on token_id so that it works with layer_aggregation_helper. 0.9.11
- Can now submit one word to
textEmbed()
withdecontexts=TRUE
.
textSimilarityTest()
is not giving error when using method = unpaired, with unequal number of participants in each group.
textPredictTest()
function to significance test correlations of different models. 0.9.11
- If any of the tokens to remove is "[CLS]", subtract 1 on token_id so that it works with layer_aggregation_helper. 0.9.11
This version is now on CRAN.
- Adding option to deselect the
step_centre
andstep_scale
in training. - Cross-validation method in
textTrainRegression()
andtextTrainRandomForrest()
have two optionscv_folds
andvalidation_split
. (0.9.02) - Better handling of
NA
instep_naomit
in training. DistilBert
model works (0.9.03)
textProjectionPlot()
plots words extreme in more than just one feature (i.e., words are now plotted that satisfy, for example, bothplot_n_word_extreme
andplot_n_word_frequency
). (0.9.01)textTrainRegression()
andtextTrainRandomForest()
also have function that select the max evaluation measure results (before only minimum was selected all the time, which, e.g., was correct for rmse but not for r) (0.9.02)- removed
id_nr
in training and predict by using workflows (0.9.02).