Skip to content

Commit

Permalink
Merge branch 'release-4.0.0.rc1'
Browse files Browse the repository at this point in the history
  • Loading branch information
mpenkov committed Mar 19, 2021
2 parents 8624aa2 + a8c0001 commit 4a241f0
Show file tree
Hide file tree
Showing 164 changed files with 3,005 additions and 233,489 deletions.
13 changes: 13 additions & 0 deletions .github/FUNDING.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# These are supported funding model platforms

github: [piskvorky] # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
patreon: # Replace with a single Patreon username
open_collective: # Replace with a single Open Collective username
ko_fi: # Replace with a single Ko-fi username
tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel
community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry
liberapay: # Replace with a single Liberapay username
issuehunt: # Replace with a single IssueHunt username
otechie: # Replace with a single Otechie username
custom: # Replace with up to 4 custom sponsorship URLs e.g., ['link1', 'link2']

86 changes: 86 additions & 0 deletions .github/workflows/build-wheels.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
name: Build wheels

on:
push:
branches: [ develop ]
pull_request:
branches: [ develop ]
schedule:
- cron: '0 0 * * sun,wed'

jobs:
build:
runs-on: ${{ matrix.os }}
defaults:
run:
shell: bash
strategy:
fail-fast: false
matrix:
python-version: [3.6, 3.7, 3.8]
os: [ubuntu-latest, macos-latest]
platform: [x64]
include:
- os: ubuntu-latest
python-version: 3.7
skip-network-tests: 1
- os: ubuntu-latest
python-version: 3.8
skip-network-tests: 1
- os: macos-latest
travis-os-name: osx # For multibuild
skip-network-tests: 1
env:
PKG_NAME: gensim
REPO_DIR: gensim
BUILD_COMMIT: HEAD
PLAT: x86_64
UNICODE_WIDTH: 32
MB_PYTHON_VERSION: ${{ matrix.python-version }} # MB_PYTHON_VERSION is needed by Multibuild
TEST_DEPENDS: Morfessor==2.0.2a4 python-levenshtein==0.12.0 visdom==0.1.8.9 pytest mock cython nmslib pyemd testfixtures scikit-learn pyemd
DOCKER_TEST_IMAGE: multibuild/xenial_x86_64
TRAVIS_OS_NAME: ${{ matrix.travis-os-name }}
SKIP_NETWORK_TESTS: ${{ matrix.skip-network-tests }}

steps:
- uses: actions/checkout@v2
with:
submodules: recursive
fetch-depth: 0
- name: Print environment variables
run: |
echo "PLAT: ${PLAT}"
echo "DOCKER_TEST_IMAGE: ${DOCKER_TEST_IMAGE}"
echo "TEST_DEPENDS: ${TEST_DEPENDS}"
echo "TRAVIS_OS_NAME: ${TRAVIS_OS_NAME}"
echo "SKIP_NETWORK_TESTS: ${SKIP_NETWORK_TESTS}"
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install virtualenv
- name: Build and Install Wheels
run: |
echo ::group::Set up Multibuild
source multibuild/common_utils.sh
source multibuild/travis_steps.sh
source config.sh
echo ::endgroup::
echo ::group::Before install
before_install
echo ::endgroup::
echo ::group::Build wheel
build_wheel $REPO_DIR ${{ matrix.PLAT }}
echo ::endgroup::
echo ::group::Install run
install_run ${{ matrix.PLAT }}
echo ::endgroup::
- name: Upload wheels to s3://gensim-wheels
if: always()
run: |
pip install wheelhouse-uploader
ls wheelhouse/*.whl
python -m wheelhouse_uploader upload --local-folder wheelhouse/ --no-ssl-check gensim-wheels --provider S3 --no-enable-cdn
60 changes: 60 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
name: Tests
on:
push:
branches: [ develop ]
pull_request:
branches: [ develop ]

jobs:
tests:
name: ${{ matrix.name }}
runs-on: ${{ matrix.os }}
defaults:
run:
shell: bash
strategy:
fail-fast: false
matrix:
include:
- {name: Linux, python: 3.6, os: ubuntu-20.04, tox: 'flake8,flake8-docs'}
- {name: Linux, python: 3.6, os: ubuntu-20.04, tox: 'py36-linux'}
- {name: Linux, python: 3.7, os: ubuntu-20.04, tox: 'py37-linux'}
- {name: Linux, python: 3.8, os: ubuntu-20.04, tox: 'py38-linux'}
env:
TOX_PARALLEL_NO_SPINNER: 1

steps:
- uses: actions/checkout@v2
- name: Setup up Python ${{ matrix.python }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python }}
- name: Update pip
run: python -m pip install -U pip

#
# Work-around mysterious build problem
# https://github.com/RaRe-Technologies/gensim/pull/3078/checks?check_run_id=2117914443
# https://www.scala-sbt.org/1.x/docs/Installing-sbt-on-Linux.html
#
- name: Update sbt
run: |
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | sudo apt-key add
sudo apt-get update -y
sudo apt-get install -y sbt
- name: Install tox, gdb
run: |
pip install tox
sudo apt-get update -y
sudo apt-get install -y gdb
- name: Enable core dumps
run: ulimit -c unlimited -S # enable core dumps
- name: Run tox tests
run: tox -e ${{ matrix.tox }}
- name: Collect corefile
if: ${{ failure() }}
run: |
pwd
COREFILE=$(find . -maxdepth 1 -name "core*" | head -n 1)
if [[ -f "$COREFILE" ]]; then EXECFILE=$(gdb -c "$COREFILE" -batch | grep "Core was generated" | tr -d "\`" | cut -d' ' -f5); file "$COREFILE"; gdb -c "$COREFILE" "$EXECFILE" -x continuous_integration/debug.gdb -batch; fi
15 changes: 15 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,18 @@ data
*.inv
*.js
docs/_images/

#
# Generated by Cython
#
gensim/_matutils.c
gensim/corpora/_mmreader.c
gensim/models/doc2vec_corpusfile.cpp
gensim/models/doc2vec_inner.cpp
gensim/models/fasttext_corpusfile.cpp
gensim/models/fasttext_inner.c
gensim/models/nmf_pgd.c
gensim/models/word2vec_corpusfile.cpp
gensim/models/word2vec_inner.c

.ipynb_checkpoints
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "multibuild"]
path = multibuild
url = https://github.com/matthew-brett/multibuild.git
91 changes: 45 additions & 46 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,53 +1,52 @@
sudo: false

cache:
apt: true
directories:
- $HOME/.cache/pip
- $HOME/.ccache
- $HOME/.pip-cache
dist: trusty
branches:
only:
- /v\d+\.\d+\.\d+/
language: python
arch: arm64-graviton2
dist: focal
virt: vm
group: edge
services: docker
env:
TOX_PARALLEL_NO_SPINNER: 1

global:
- REPO_DIR=gensim
- BUILD_COMMIT=HEAD
- UNICODE_WIDTH=32
- PLAT=aarch64
- MB_ML_VER=2014
- SKIP_NETWORK_TESTS=1
- DOCKER_TEST_IMAGE=multibuild/xenial_arm64v8
- BUILD_DEPENDS="numpy==1.19.2 scipy==1.5.3"
- TEST_DEPENDS="pytest mock cython nmslib pyemd testfixtures Morfessor==2.0.2a4 python-levenshtein==0.12.0 visdom==0.1.8.9 scikit-learn"

matrix:
include:
- python: '3.6'
env: TOXENV="flake8,flake8-docs"

- python: '3.8'
- os: linux
env:
- TOXENV="py38-linux"
dist: bionic

- python: '3.7'
- MB_PYTHON_VERSION=3.6
- os: linux
env:
- TOXENV="py37-linux"
# The following two lines used to be necessary because Travis left files lying around in ~/.aws/,
# messing up our tests. Now fixed since https://github.com/travis-ci/travis-ci/issues/7940
# - BOTO_CONFIG="/dev/null"
#sudo: true
dist: xenial

- python: '3.6'
env: TOXENV="py36-linux"


- MB_PYTHON_VERSION=3.7
- os: linux
env:
- MB_PYTHON_VERSION=3.8
- os: linux
env:
- MB_PYTHON_VERSION=3.9
before_install:
- source multibuild/common_utils.sh
- source multibuild/travis_steps.sh
- before_install
install:
- pip install tox
- sudo apt-get install -y gdb


before_script:
- ulimit -c unlimited -S # enable core dumps


script: tox -vv


after_failure:
- pwd
- COREFILE=$(find . -maxdepth 1 -name "core*" | head -n 1)
- if [[ -f "$COREFILE" ]]; then EXECFILE=$(gdb -c "$COREFILE" -batch | grep "Core was generated" | tr -d "\`" | cut -d' ' -f5); file "$COREFILE"; gdb -c "$COREFILE" "$EXECFILE" -x continuous_integration/debug.gdb -batch; fi
- build_wheel $REPO_DIR $PLAT
script:
- install_run $PLAT
after_script:
- ls -laht ${TRAVIS_BUILD_DIR}/wheelhouse/
- pip install wheelhouse-uploader
- python -m wheelhouse_uploader upload --local-folder ${TRAVIS_BUILD_DIR}/wheelhouse/ --no-ssl-check gensim-wheels --provider S3 --no-enable-cdn

notifications:
email:
- penkov+gensimwheels@pm.me
on_success: always
on_failure: always
82 changes: 82 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,87 @@
Changes
=======

## 4.0.0.rc1, 2021-03-19

**⚠️ Gensim 4.0 contains breaking API changes! See the [Migration guide](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4) to update your existing Gensim 3.x code and models.**

Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.

### Main highlights (see also *👍 Improvements* below)

* Massively optimized popular algorithms the community has grown to love: [fastText](https://radimrehurek.com/gensim/models/fasttext.html), [word2vec](https://radimrehurek.com/gensim/models/word2vec.html), [doc2vec](https://radimrehurek.com/gensim/models/doc2vec.html), [phrases](https://radimrehurek.com/gensim/models/phrases.html):

a. **Efficiency**

| model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput |
|----------|------------|--------|
| fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / **1.26 GB** / 914k words/s |
| word2vec | 1.7h / 0.36 GB / 1685k words/s | **1.2h** / 0.33 GB / 1762k words/s |

In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. ([4.0 benchmarks](https://github.com/RaRe-Technologies/gensim/issues/2887#issuecomment-711097334))

b. **Robustness**. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

c. **Simplified OOP model** for easier model exports and integration with TensorFlow, PyTorch &co.

These improvements come to you transparently aka "for free", but see [Migration guide](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4) for some changes that break the old Gensim 3.x API. **Update your code accordingly**.

* Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.
- Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.

So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.

* Dropped Python 2. Gensim 4.0 is Py3.6+. Read our [Python version support policy](https://github.com/RaRe-Technologies/gensim/wiki/Gensim-And-Compatibility).
- If you still need Python 2 for some reason, stay at [Gensim 3.8.3](https://github.com/RaRe-Technologies/gensim/releases/tag/3.8.3).

* A new [Gensim website](https://radimrehurek.com/gensim_4.0.0) – finally! 🙃

So, a major clean-up release overall. We're happy with this **tighter, leaner and faster Gensim**.

This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.

### :star2: New Features

* Default to pickle protocol 4 when saving models (__[piskvorky](https://github.com/piskvorky)__, [#3065](https://github.com/RaRe-Technologies/gensim/pull/3065))
* Record lifecycle events in Gensim models (__[piskvorky](https://github.com/piskvorky)__, [#3060](https://github.com/RaRe-Technologies/gensim/pull/3060))
* Make WMD normalization optional (__[piskvorky](https://github.com/piskvorky)__, [#3073](https://github.com/RaRe-Technologies/gensim/pull/3073))

### :red_circle: Bug fixes

* fix RuntimeError in export_phrases (change defaultdict to dict) (__[thalishsajeed](https://github.com/thalishsajeed)__, [#3041](https://github.com/RaRe-Technologies/gensim/pull/3041))

### :books: Tutorial and doc improvements

* fix various documentation warnings (__[mpenkov](https://github.com/mpenkov)__, [#3077](https://github.com/RaRe-Technologies/gensim/pull/3077))
* Fix broken link in run_doc how-to (__[sezanzeb](https://github.com/sezanzeb)__, [#2991](https://github.com/RaRe-Technologies/gensim/pull/2991))
* Point WordEmbeddingSimilarityIndex documentation to gensim.similarities (__[Witiko](https://github.com/Witiko)__, [#3003](https://github.com/RaRe-Technologies/gensim/pull/3003))
* Make the link to the Gensim 3.8.3 documentation dynamic (__[Witiko](https://github.com/Witiko)__, [#2996](https://github.com/RaRe-Technologies/gensim/pull/2996))

### :+1: Improvements

### :warning: Removed functionality

* remove on_batch_begin and on_batch_end callbacks (__[mpenkov](https://github.com/mpenkov)__, [#3078](https://github.com/RaRe-Technologies/gensim/pull/3078))
* remove pattern dependency (__[mpenkov](https://github.com/mpenkov)__, [#3012](https://github.com/RaRe-Technologies/gensim/pull/3012))
* rm gensim.viz submodule (__[mpenkov](https://github.com/mpenkov)__, [#3055](https://github.com/RaRe-Technologies/gensim/pull/3055))

### :warning: Deprecations (will be removed in the next major release)

### ??? Misc

**FIXME** This is a list of PRs that I couldn't find an appropriate section for.
We could make some other section for them or remove them from the changelog entirely.
This is probably OK as-is for the release candidate, but we should clean this up for the proper, final release.

* [MRG] Add Github sponsor + donation nags (__[piskvorky](https://github.com/piskvorky)__, [#3069](https://github.com/RaRe-Technologies/gensim/pull/3069))
* Update URLs (__[jonaschn](https://github.com/jonaschn)__, [#3063](https://github.com/RaRe-Technologies/gensim/pull/3063))
* Fix race condition in FastText tests (__[sleepy-owl](https://github.com/sleepy-owl)__, [#3059](https://github.com/RaRe-Technologies/gensim/pull/3059))
* Add py39 wheels to travis/azure (__[FredHappyface](https://github.com/FredHappyface)__, [#3058](https://github.com/RaRe-Technologies/gensim/pull/3058))
* Update repos before trying to install gdb (__[janaknat](https://github.com/janaknat)__, [#3035](https://github.com/RaRe-Technologies/gensim/pull/3035))
* transformed camelCase to snake_case test names (__[sezanzeb](https://github.com/sezanzeb)__, [#3033](https://github.com/RaRe-Technologies/gensim/pull/3033))
* move x86 tests from Travis to GHA, add aarch64 wheel build to Travis (__[janaknat](https://github.com/janaknat)__, [#3026](https://github.com/RaRe-Technologies/gensim/pull/3026))
* Add Github Actions x86 and mac jobs to build python wheels (__[janaknat](https://github.com/janaknat)__, [#3024](https://github.com/RaRe-Technologies/gensim/pull/3024))

## 4.0.0beta, 2020-10-31

**⚠️ Gensim 4.0 contains breaking API changes! See the [Migration guide](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4) to update your existing Gensim 3.x code and models.**
Expand Down Expand Up @@ -104,6 +185,7 @@ Production stability is important to Gensim, so we're improving the process of *
* [#2926](https://github.com/RaRe-Technologies/gensim/pull/2926): Rename `num_words` to `topn` in dtm_coherence, by [@MeganStodel](https://github.com/MeganStodel)
* [#2937](https://github.com/RaRe-Technologies/gensim/pull/2937): Remove Keras dependency, by [@piskvorky](https://github.com/piskvorky)
* Removed all code, methods, attributes and functions marked as deprecated in [Gensim 3.8.3](https://github.com/RaRe-Technologies/gensim/releases/tag/3.8.3).
* Removed pattern dependency (PR [#3012](https://github.com/RaRe-Technologies/gensim/pull/3012), [@mpenkov](https://github.com/mpenkov)). If you need to lemmatize, do it prior to passing the corpus to gensim.

---

Expand Down
6 changes: 6 additions & 0 deletions ISSUE_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,12 @@ What are you trying to achieve? What is the expected result? What are you seeing

Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").

If your problem is with a specific Gensim model (word2vec, lsimodel, doc2vec, fasttext, ldamodel etc), include the following:

```python
print(my_model.lifecycle_events)
```

#### Versions

Please provide the output of:
Expand Down
Loading

0 comments on commit 4a241f0

Please sign in to comment.