Skip to content

Commit

Permalink
Merge branch 'dev'
Browse files Browse the repository at this point in the history
  • Loading branch information
MartinHammarstedt committed Dec 7, 2023
2 parents d3eb0db + be14322 commit 25c3ba1
Show file tree
Hide file tree
Showing 137 changed files with 4,233 additions and 1,808 deletions.
53 changes: 53 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
name: CI
on: [push, pull_request]

jobs:
checks:
name: ${{ matrix.task.name }} py-${{ matrix.python-version }} on ${{ matrix.os }}
runs-on: ${{ matrix.os }}

strategy:
max-parallel: 4
fail-fast: false
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
os: [ubuntu-latest]
task:
- name: Run tests
run: |
source venv/bin/activate
pytest -m noexternal
steps:
- name: Checkout code
uses: nschloe/action-cached-lfs-checkout@v1

- name: Set up Python ${{ matrix.python-version }}
id: setup-python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}

# Load cached venv if cache exists
- name: Load cached venv
id: cached-dependencies
uses: actions/cache@v3.2.3
with:
path: venv
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/pyproject.toml') }}-${{ hashFiles('.github/workflows/ci.yml') }}

# Create virtual environment and install dependencies if cache does not exist
- name: Create venv and install dependencies
if: steps.cached-dependencies.outputs.cache-hit != 'true'
run: |
python3 -m venv venv
source venv/bin/activate
pip install -e .[dev]
- name: Setup Sparv
run: |
source venv/bin/activate
sparv setup -d $PWD
- name: ${{ matrix.task.name }}
run: ${{ matrix.task.run }}
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@ build/
dist/
*.egg-info/
*.egg
MANIFEST*

# Unit test / coverage reports
.pytest_cache/
Expand Down
88 changes: 86 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,82 @@
# Changelog

## [5.2.0] - 2023-12-07

### Added

- Added support for tab autocompletion in bash.
- Added importer for PDF files.
- Added new `misc:inherit` annotator for inheriting attributes.
- Added `korp.wordpicture_no_sentences` setting to disable generation of Word Picture sentences table.
- `util.mysql_wrapper` can now execute SQL queries remotely over SSH.
- Added several uninstallers:
- `cwb:uninstall_corpus`
- `korp:uninstall_config`
- `korp:uninstall_lemgrams`
- `korp:uninstall_timespan`
- `korp:uninstall_wordpicture`
- `stats_export:uninstall_freq_list`
- `stats_export:uninstall_sbx_freq_list`
- `stats_export:uninstall_sbx_freq_list_date`
- `xml_export:uninstall`
- `xml_export:uninstall`
- Added `MarkerOptional` class.
- Added stats export for Swedish from the 1800s.
- `korp:wordpicture` table name is now configurable using `korp.wordpicture_table`.
- Added utility function `util.system.gpus()` which returns a list of GPUs, ordered by free memory in descending order.
- Sparv will automatically order the GPUs in the environment variable `CUDA_VISIBLE_DEVICES` by the amount of free
memory that was available when Sparv started.
- Stanza now always selects the GPU with the most free memory.
- The preloader can now be gracefully stopped by sending an interrupt signal to the Sparv process.
- Added `HeaderAnnotations` and `HeaderAnnotationsAllSourceFiles` classes.
- Added `korp.keep_undefined_annotations` setting, to include even undefined annotations in the Korp config.
- Added `dateformat.pre_regex` setting.
- Added `--json-log` flag to enable JSON format for logging.
- Added support for restricting a whole module to one or more languages by using the `__language__` variable.
- Running `sparv schema` will now generate a JSON schema which can be used to validate corpus config files.
- More strict config validation, including validation of config values and data types.
- Most Sparv decorators now have a `priority` parameter, to control the order in which functions are run.
- Added `util.misc.dump_yaml()` utility function for exporting YAML.

### Changed

- Added support for Python 3.10 and 3.11.
- Dropped support for Python 3.6 and 3.7.
- `AnnotationAllSourceFiles` now have the same methods as `Annotation`.
- The util function `install_mysql` can now install locally as well as to a remote server.
- Pre-built SALDO models are now downloaded instead of being built on demand.
- `xml_export:install` and `xml_export:install_scrambled` can now install locally.
- `korp:relations`, `korp:relations_sql` and `korp:install_relations` has been renamed to `korp:wordpicture`,
`korp:wordpicture_sql` and `korp:install_wordpicture` respectively.
- Target path is no longer optional for the utility functions `install_path` and `rsync`.
- The classes `SourceAnnotations` and `SourceAnnotationsAllSourceFiles` are now pre-parsed, immutable iterables instead
of lists that need parsing and expanding.
- The classes `AllSourceFilenames`, `ExportAnnotations`, `ExportAnnotationsAllSourceFiles` and `ExportAnnotationNames`
are now immutable iterables instead of lists.
- Removed the flags `--rerun-incomplete` and `--mark-complete`, as Sparv will now always rerun incomplete files.
- Sparv will now recognize when source files have been deleted and trigger the necessary reruns. Previously, only
additions and modifications were recognized.
- Illegal characters are now replaced with underscore in XML element and attribute names during XML export. This also
applies to CWB and Korp config exports.
- Not specifying a corpus language now excludes all language specific annotators.
- When an unhandled exception occurs, the relevant source document will be displayed in the log.
- `localhost` as an installation target is no longer handled as if host was omitted.
- Removed `critical` log level.

### Fixed

- Several bugs fixed in `korp:config`.
- Fixed bug where Sparv would hang if an error occurred in a preloaded annotator.
- Fixed occasional crash in `cwb:encode` when old CWB export hadn't been removed first.
- Fixed bug when using relative socket path while also using `--dir`.
- Fixed quoting of paths in `util.system.rsync`.
- It's no longer possible to create an infinite loop of classes referring to each other.
- Elapsed time exceeding 24 hours no longer gets cut off in the `--stats` output.
- Fixed bug where error messages were not getting written to the log file when the `--log debug`
flag was used.
- Fixed bug that prevented Stanza from using GPU.
- Fixed crash when exporting scrambled XML without any text.

## [5.1.0] - 2022-11-03

### Added
Expand Down Expand Up @@ -89,7 +166,7 @@
- `korp.remote_cwb_registry` is now called `cwb.remote_registry_dir`
- `korp.remote_host` has been split into `korp.remote_host` (host for SQL files) and `cwb.remote_host` (host for CWB
files)
- install target `korp:install_corpus` has been renamed and split into `cwb:install_corpus` and
- install target `korp:install_corpus` has been renamed and split into `cwb:install_corpus` and
`cwb:install_corpus_scrambled`
- Renamed the following stats exports:
`stats_export:freq_list` is now called `stats_export:sbx_freq_list`
Expand Down Expand Up @@ -200,7 +277,7 @@
- New plugin system facilitates installation of Sparv plugins (like FreeLing).

- New format for corpus config files
- The new format is yaml which is easier to write and more human readable than makefiles.
- The new format is yaml which is easier to write and more human-readable than makefiles.
- There is a command-line wizard which helps you create corpus config files.
- You no longer have to specify XML elements and attributes that should be kept from the original files. The XML
parser now parses all existing elements and their attributes by default. Their original names will be kept and
Expand Down Expand Up @@ -229,3 +306,10 @@
- Improved code modularity
- Increased independence between modules and language models
- This facilitates adding new annotation modules and import/export formats.

[5.2.0]: https://github.com/spraakbanken/sparv-pipeline/releases/tag/v5.2.0
[5.1.0]: https://github.com/spraakbanken/sparv-pipeline/releases/tag/v5.1.0
[5.0.0]: https://github.com/spraakbanken/sparv-pipeline/releases/tag/v5.0.0
[4.1.1]: https://github.com/spraakbanken/sparv-pipeline/releases/tag/v4.1.1
[4.1.0]: https://github.com/spraakbanken/sparv-pipeline/releases/tag/v4.1.0
[4.0.0]: https://github.com/spraakbanken/sparv-pipeline/releases/tag/v4.0.0
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ If you have any questions, problems or suggestions please contact <sb-sparv@sven
* A Unix-like environment (e.g. Linux, OS X or [Windows Subsystem for
Linux](https://docs.microsoft.com/en-us/windows/wsl/about)) *Note:* Most of Sparv's features should work in a Windows
environment as well, but since we don't do any testing on Windows we cannot guarantee anything.
* [Python 3.6.2](http://python.org/) or newer
* [Python 3.8](https://python.org/) or newer.

## Installation

Expand Down Expand Up @@ -44,8 +44,7 @@ Before cloning the repository with [git](https://git-scm.com/downloads) make sur
Storage](https://git-lfs.github.com/) installed (`apt install git-lfs`). Some files will not be downloaded correctly
otherwise.

We recommend that you set up a virtual environment and install the dependencies (including the dev dependencies) listed
in `setup.py`:
Install the dependencies, including the dev dependencies. We recommend that you first set up a virtual environment:

```
python3 -m venv venv
Expand Down
4 changes: 2 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ cd md2pdf
./make_pdf.sh
```

<!--
## MISC
### URLs that may have to be updated regularly

- Java download: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
-->
6 changes: 3 additions & 3 deletions docs/developers-guide/config-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ the decorator belonging to that same function, but the declaration may be done i
Sparv function, or even a different module.

Please note that it is mandatory to set a description for each declared config parameter. These descriptions are
displayed to the user when lising modules with the `sparv modules` command.
displayed to the user when listing modules with the `sparv modules` command.


## Config hierarchy
Expand All @@ -47,8 +47,8 @@ When Sparv processes the corpus configuration it will look for config values in
priority order:
1. the corpus configuration file
2. a parent corpus configuration file
2. the default configuration file in the [Sparv data directory](user-manual/installation-and-setup.md#setting-up-sparv)
3. config default values defined in the Sparv decorators (as shown above)
3. the default configuration file in the [Sparv data directory](user-manual/installation-and-setup.md#setting-up-sparv)
4. config default values defined in the Sparv decorators (as shown above)

This means that if a config parameter is given a default value in a Sparv decorator it can be overridden by the default
configuration file which in turn can be overridden by the user's corpus config file.
Expand Down
8 changes: 4 additions & 4 deletions docs/developers-guide/general-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
This section will give a brief overview of how Sparv modules work and introduce some general concepts. More details are
provided in the following chapters.

The Sparv Pipeline is comprised of some core functionality and many different modules containing Sparv functions that
The Sparv Pipeline is made up of a core and different modules. The modules contain Sparv functions that
serve different purposes like reading and parsing source files, building or downloading models, producing
annotations and producing output files that contain the source text and annotations. All of these modules (i.e. the
code inside the `sparv/modules` directory) are replacable. A Sparv function is decorated with a special
code inside the `sparv/modules` directory) are replaceable. A Sparv function is decorated with a special
[decorator](developers-guide/sparv-decorators) that tells Sparv what purpose it serves. A function's parameters hold
information about what input is needed in order to run the function and what output is produced by it. The Sparv core
automatically finds all decorated functions, scans their parameters and builds a registry for what modules are available
automatically finds all decorated functions, scans their parameters and builds a registry of what modules are available
and how they depend on each other.


Expand All @@ -34,7 +34,7 @@ Some Sparv functions may require annotations from other functions before they ca
expressed in the function arguments. By using special [Sparv classes](developers-guide/sparv-classes) as default
arguments in a function's signature the central Sparv registry can automatically keep track of what annotations can be
produced by what function and in what order things need to be run. These dependencies can either be described in a
module specific manner or in a more abstact way. For example, an annotator producing word base forms (or lemmas) may
module specific manner or in a more abstract way. For example, an annotator producing word base forms (or lemmas) may
depend on a part-of-speech annotation with a specific tagset and therefore this annotator might define that its input
needs to be an annotation produced by a specific module. A part-of-speech tagger on the other hand usually needs word
segments as input, and it probably does not matter exactly what module produces these segments. In this case the
Expand Down
Loading

0 comments on commit 25c3ba1

Please sign in to comment.