Skip to content

Tools to create and manage CWB-indexed corpora

Notifications You must be signed in to change notification settings

PolMine/cwbtools

Repository files navigation

License: GPL v3 R build status CRAN_Status_Badge codecov DOI

Tools to Create, Modify and Manage Corpora for the Corpus Workbench (CWB)

The Corpus Workbench (CWB) is a classic indexing and query engine to efficiently work with large, linguistically annotated corpora. The cwbtools package offers a set of tools to conveniently create, modify and manage CWB indexed corpora from within R. It complements R packages that use the CWB as a backend for text mining with R, namely the RcppCWB package for low-level access to CWB indexed corpora, and polmineR as a toolset to implement common text mining workflows.

Installation

The package is available via CRAN and can be installed as follows on Windows, macOS and Linux.

install.packages("cwbtools")

To install the development version of the package, use the installation mechanism offered by the remotes package. On Windows, an installation of Rtools may be necessary.

# Make sure the remotes package is present
if (!"remotes" %in% installed.packages()[,"Package"]) install.packages("remotes")
Sys.setenv(R_REMOTES_STANDALONE = "true")
remotes::install_github("PolMine/cwbtools", ref = "dev", force = TRUE)

Explanatory note:

The default approach to install the development version cwbtools from GitHub would be devtools::install_github("PolMine/cwbtools", ref = "dev"). However, the concurrent dependency of devtools and of cwbtools on the curl package may cause nerve-wrecking problems if curl can be updated: If a newer version of curl is available, the user will be prompted whether this update is desired. Most users will agree. However, this update will fail because curl is loaded by devtools, and parts of the curl package cannot be deleted/updated (the dynamic library that is loaded).

To avoid having to perform manual updates in the correct order, using the original install_github() function of the remotes package is recommended. When setting the environment variable R_REMOTES_STANDALONE to true, the remotes package will rely on a minimal set of additional packages. The aforementioned situation that may make the installation of cwbtools difficult for most users is omitted.

Acknowledgements

The CWB is a classical indexing and query engine. Its character as an open source project is of great value for the community working with corpora. The enduring effort of the developers of the CWB is gratefully acknowledged!