Skip to content

Commit

Permalink
Develop v3 0 0 RC (#332)
Browse files Browse the repository at this point in the history
* argschecker updated #178

* Reverted to latest pdmarima (#212)

* Removed erroneous 2nd arima fit (#212)

arima fits on construction, don't need to explicitly call 'fit'

* Removed erroneous 2nd arima fit (#212)

arima fits on construction, don't need to explicitly call 'fit'

* Reverted to original test specification (#212)

* Reverted to original test specification (#212)

* Added version debug code (#212)

As requested on pmdarima bug reporting page

* Added version debug code (#212)

As requested on pmdarima bug reporting page

* Report warnings and errors (#212)

May have accidentely been surpressing errors - that could be reporting why the test fails

* Report warnings and errors (#212)

May have accidentely been surpressing errors - that could be reporting why the test fails

* resolves #217

* resolves #217

* test

* test

* test

* test

* test

* test

* changed the tests with real data to check if random numbers were comfusing the models, hence the big discrepancies

* changed the tests with real data to check if random numbers were comfusing the models, hence the big discrepancies

* Updated Arima to use pmdarima rather than pyramid-arima (#212)

* Updated Arima to use pmdarima rather than pyramid-arima (#212)

* Test pmdarima 1.0.0 to test windows (#212)

Seeing if an earlier version of pmdarima works in windows

* Test pmdarima 1.0.0 to test windows (#212)

Seeing if an earlier version of pmdarima works in windows

* emtech report to file!

* emtech report to file!

* Reverted to latest pdmarima (#212)

* Reverted to latest pdmarima (#212)

* Removed erroneous 2nd arima fit (#212)

arima fits on construction, don't need to explicitly call 'fit'

* Removed erroneous 2nd arima fit (#212)

arima fits on construction, don't need to explicitly call 'fit'

* Reverted to original test specification (#212)

* Reverted to original test specification (#212)

* Added version debug code (#212)

As requested on pmdarima bug reporting page

* Added version debug code (#212)

As requested on pmdarima bug reporting page

* Report warnings and errors (#212)

May have accidentely been surpressing errors - that could be reporting why the test fails

* Report warnings and errors (#212)

May have accidentely been surpressing errors - that could be reporting why the test fails

* analyzer ngrams processing was not stopping unigrams :)

* analyzer ngrams processing was not stopping unigrams :)

* adjusted tests to reflect bug fixes in stoplists processing

* adjusted tests to reflect bug fixes in stoplists processing

* added a check on the returned tuple for stopwords. That will enable users to optimize list without having to re-compute tf-idf

* added a check on the returned tuple for stopwords. That will enable users to optimize list without having to re-compute tf-idf

* pmdarima>=110

* pmdarima>=110

* added a check on the returned tuple for stopwords. That will enable users to optimize list without having to re-compute tf-idf

* added a check on the returned tuple for stopwords. That will enable users to optimize list without having to re-compute tf-idf

* rid of vectorizer. Only vocabulary needed

* rid of vectorizer. Only vocabulary needed

* 225 ridof pmdarima (#226)

* rid of vectorizer. Only vocabulary needed

* rid of pmd. Also realized that two of our test series were identical. No need to test them twice :)

* pmd left.

* just to check why one excepts and other doesn't

* rid of vectorizer. Only vocabulary needed

* scipy was the proble, in the end. Has to be >=1.2.1

* 225 ridof pmdarima (#226)

* rid of vectorizer. Only vocabulary needed

* rid of pmd. Also realized that two of our test series were identical. No need to test them twice :)

* pmd left.

* just to check why one excepts and other doesn't

* rid of vectorizer. Only vocabulary needed

* scipy was the proble, in the end. Has to be >=1.2.1

* 223 pipeline bug (#224)

* rid of vectorizer. Only vocabulary needed

* pickle-depickle tfidf test now represents different executions (#223)
WordAnalyser reset between calls to main() - will catch if stopwords
etc not populated

* 223 pipeline bug (#224)

* rid of vectorizer. Only vocabulary needed

* pickle-depickle tfidf test now represents different executions (#223)
WordAnalyser reset between calls to main() - will catch if stopwords
etc not populated

* Travis now reports python packages in use

Added `pip freeze` to travis.yml

* Travis now reports python packages in use

Added `pip freeze` to travis.yml

* Corrected pip listing of packages

* Corrected pip listing of packages

* 228 data path (#229)

* Removed override to 'data' path and added date info #228
Now reports date range of patents in use
* Removed 2nd construction of WordAnalyser #228

* 228 data path (#229)

* Removed override to 'data' path and added date info #228
Now reports date range of patents in use
* Removed 2nd construction of WordAnalyser #228

* 230 arima failing (#231)

* Alternative method to annoy ARIMA #230

* 230 arima failing (#231)

* Alternative method to annoy ARIMA #230

* 227 bug csv date (#233)


* Testing python 3.7.3 via pip and *correctly* switch to Xenial linux (#227)

* Checks if DF date column is a string and converts to datetime #227

* Oops. Test failing as date_column not always corrected to datetime #227

* 227 bug csv date (#233)


* Testing python 3.7.3 via pip and *correctly* switch to Xenial linux (#227)

* Checks if DF date column is a string and converts to datetime #227

* Oops. Test failing as date_column not always corrected to datetime #227

* csv dates come as strings. Type-check to see what's going on and conv… (#232)

* moved things around a bit. type check after df creaation inside not read from pickle clause. If read from pickle, that should have been taken care of..

* csv dates come as strings. Type-check to see what's going on and conv… (#232)

* moved things around a bit. type check after df creaation inside not read from pickle clause. If read from pickle, that should have been taken care of..

* Remove leading zero trimming (#235) (#239)

* Remove leading zero trimming (#235) (#239)

* added argument for embeddings threshold

* added argument for embeddings threshold

* resolves #250 (#251)

* scipy==1.2.1 else breaks

* new gensim breaks windows! Force 3.4.0

* resolves #250 (#251)

* scipy==1.2.1 else breaks

* new gensim breaks windows! Force 3.4.0

* filtering rows now gets rid of corresponding rows in df (#249)

* filtering rows now gets rid of corresponding rows in df
* gensim & scipy version limited due to introduced instability in current versions

* filtering rows now gets rid of corresponding rows in df (#249)

* filtering rows now gets rid of corresponding rows in df
* gensim & scipy version limited due to introduced instability in current versions

* Update pygrams.py

Co-Authored-By: emily-tew <38726410+emily-tew@users.noreply.github.com>

* Update pygrams.py

Co-Authored-By: emily-tew <38726410+emily-tew@users.noreply.github.com>

* 248 tfidf filter (#254)

* Added prefilter of terms (#248)

* 248 tfidf filter (#254)

* Added prefilter of terms (#248)

* del

* del

* Update README.md

Missing `.` on `pip install -e .`

* Corrected check for empty CPC list (#261)

* cache 2 initial commit! (#269)

* cache 2 initial commit!

* fix-imports was calling the properties nd populating tfidf_mat. Disabled it. Plus some cosmetics

* helper function to safeguard from None idf or tfidf

* 257 add nmf code (#271)

Added NMF output

* resolves #272 (#275)

* Dictionary used to store CPC rather than list inside data frame

* 273 dates as ints (#277)

* Dates now pickled as integer array to save space (#273)
Tidied up date related utilities - added to date_utils from utils
Renamed 'iso dates' to 'year_week' dates to avoid confusion with 'real' iso
Column filter removed from DocumentsFilter
Removed time and CPC document weighting

* Update README.md

* 279 small adjustments (#280)

* Dates now pickled as integer array to save space (#273)
* Tidied up date related utilities - added to date_utils from utils
* Renamed 'iso dates' to 'year_week' dates to avoid confusion with 'real' iso
* Column filter removed from DocumentsFilter
* Removed time and CPC document weighting
* Removed unused parameters and synchronised variable names (#273)
* Added timing report and progress reports

* 278 move mask (#283)

* resolves #278

* Changed folders for cached outputs (#281) (#284)

* 285 data uspto (#286)


* error checks change...
* resolves #286

* 287 update system requirements section (#288)

* Updated System Performance section (System Requirements)

* minor mods

* Small bug (#289)

* threshold not a list

* save time series to file (#270)

* Update README.md

-it option was outdated

* 291 bug (#292)

resolves #291

* 294 fb (#295)



* resolves #294

* 296 emtech facelift (#297)

* resolves #296

* 298 nltk installation (#299)

NLTK data now downloaded during execution of `pip install` (fixes #298)

* 256 tech report 2 (#301)

resolves # 256

* Ch comments (#304)

* ch comments

* Checking changes were propagated correctly #256 (#305)

* Checking changes were propagated correctly #256

* Checking changes were propagated correctly - more missing #256

* Few american spellings caught #256

* Exponential emergence (#306)

* add exponential emergence

* #255 convert r scripts (#308)

state space model resolves #308 #255

* General facelift

* Refactoring for readability
* Corrected issue with calculation of Porter (was using head not tail of dataset)

* State space (#317)

* cache state-space data!

* two-stage grid search

* Corrected test with duplicated args (good spot...)
Now copes if min/max time series dates are not defined

* If smoothing not requested, ensure None is returned for smoothed dictionary

* Default predictor set now excludes LSTMs

* 319 cache (#320)

* #319 updated code and tests to reflect new cache usage

* 321 test stopwords (#322)


* #321 added stopwords to test folder for test specific variant

* #319 consistent cmd line args, GloVe can now be placed anywhere

* 315 clamp redo (#323)

* #315 clamp smoothed values at 0
* cast smoothed data back to lists (from numpy arrays) for consistency
* command line args now restricted to available smoothing and emergence
* added simple test for holt-winters to confirm -ve values not handled

* 326 mpq (#327)

* mpq tweak and cached data

* #328 added tests for example command line (#329)

* #328 added tests for example command line
* fixed: date not defined when not required causes failure
* #328 corrected execution folder for README tests

* Corrected merge

* Whitespace changes ready for merge to master

* Cleanup state space modelling

* Whups. Now checks tests again and only runs on travis... and not win32

* Whups. Now checks tests again and only runs on travis... and not win32

* 324 state space predictions (#325)

* #324 create table from state space results - work in progress
* tests TBA

* first commit

* #324 create table from state space results - with tests
* Trimmed SD not implemented

* #324 Trimmed SD implemented

* #324 report window size to HTML

* #324 WIP - needs refinement, but works for non-test. Test may blow graph generation.

* #328 multiplot added as option

* Cleanup state space modelling

* Whups. Now checks tests again and only runs on travis... and not win32

* Whups. Now checks tests again and only runs on travis... and not win32

* Merge issue with SSM
  • Loading branch information
IanGrimstead authored and thanasions committed Sep 25, 2019
1 parent f35f452 commit 267a9ce
Show file tree
Hide file tree
Showing 73 changed files with 2,247 additions and 3,787 deletions.
2 changes: 1 addition & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
[run]
omit = tests/*
omit = algorithms/* test* utils/* vanv/* vvcode/* support.py __init__.py
8 changes: 4 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,17 +45,17 @@ install:
- python --version
- python -m pip install -U pip
- python -m easy_install -U setuptools
# command to install dependencies
# - python setup.py install
- pip install -e .
# command to install dependencies; includes extra 'test' specific dependencies
- pip install -e .[test]

script:
# for codecov support
- pip install pytest pytest-cov
# to report installed packages
- pip freeze
# command to run tests
- pytest --cov-config .coveragerc --cov=./ tests/
- cd tests
- pytest --cov-config ../.coveragerc --cov=../ ./

after_success:
- bash <(curl -s https://codecov.io/bash)
37 changes: 25 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ The app pipeline (more details in the user option section):
2. **[Term Filters](#term-filters)** These filters work on term level. Examples are: search terms list (eg. pharmacy, medicine, chemist)
5. **Mask the TFIDF Matrix** Apply the filters to the TFIDF matrix
6. **[Emergence](#emergence-calculations)**
1. **[Emergence Calculations](#emergence-calculations)** Options include [Porter 2018](https://www.researchgate.net/publication/324777916_Emergence_scoring_to_identify_frontier_RD_topics_and_key_players) emergence calculations or curve fitting.
1. **[Emergence Calculations](#emergence-calculations)** Options include [Porter 2018](https://www.researchgate.net/publication/324777916_Emergence_scoring_to_identify_frontier_RD_topics_and_key_players) emergence calculations, curve fitting, or calculations designed to favour exponential like emergence.
2. **[Emergence Forecasts](#emergence-forecasts)** Options include ARIMA, linear and quadratic regression, Holt-Winters, LSTMs.
8. **[Outputs](#outputs)** The default 'report' output is a ranked and scored list of 'popular' ngrams or emergent ones if selected. Other outputs include a 'graph summary', word cloud and an html document as emergence report.

Expand Down Expand Up @@ -107,12 +107,12 @@ For example, for a corpus of book blurbs you could use:
python pygrams.py -th='blurb' -dh='published_date'
```

#### Using a pre-pickled TFIDF file (-it)
#### Using cached files to speed up processing (-uc)

In order save processing time, a pre-pickled TFIDF output file may be loaded instead of creating TFIDF by processing a document source. These files are cached automatically upon the first run with data and the directory hosting them inherits the outputs name given. Running pygrams with cached tfidf matrix:
In order save processing time, at various stages of the pipeline, we cache data structures that are costly and slow to compute, like the compressed tf-idf matrix, the timeseries matrix, the smooth series and its derivatives from kalman filter and others:

```
python pygrams.py -it USPTO-mdf-0.05
python pygrams.py -uc all-mdf-0.05-200501-201841
```

### TFIDF Dictionary
Expand Down Expand Up @@ -180,13 +180,13 @@ unbias results to avoid double or triple counting contained n-grams.
This argument can be used to filter documents to a certain timeframe. For example, the below will restrict the document cohort to only those from 20 Feb 2000 up to now (the default start date being 1 Jan 1900).

```
python pygrams.py -df=2000/02/20
python pygrams.py -dh publication_date -df=2000/02/20
```

The following will restrict the document cohort to only those between 1 March 2000 and 31 July 2016.

```
python pygrams.py -df=2000/03/01 -dt=2016/07/31
python pygrams.py -dh publication_date -df=2000/03/01 -dt=2016/07/31
```

#### Column features filters (-fh, -fb)
Expand All @@ -208,7 +208,7 @@ This filter assumes that values are '0'/'1', or 'Yes'/'No'.
This subsets the chosen patents dataset to a particular Cooperative Patent Classification (CPC) class, for example Y02. The Y02 classification is for "technologies or applications for mitigation or adaptation against climate change". An example script is:

```
python pygrams.py -cpc=Y02 -ps=USPTO-random-10000.pkl.bz2
python pygrams.py -cpc=Y02 -ds=USPTO-random-10000.pkl.bz2
```

In the console the number of subset patents will be stated. For example, for `python pygrams.py -cpc=Y02 -ps=USPTO-random-10000.pkl.bz2` the number of Y02 patents is 197. Thus, the TFIDF will be run for 197 patents.
Expand All @@ -232,12 +232,20 @@ An option to choose between popular or emergent terminology outputs. Popular ter
python pygrams.py -ts
```

#### Curve Fitting (-cf)
#### Emergence Index (-ei)

An option to choose between curve fitting or [Porter 2018](https://www.researchgate.net/publication/324777916_Emergence_scoring_to_identify_frontier_RD_topics_and_key_players) emergence calculations. Porter is used by default; curve fitting can be used instead, for example:
An option to choose between quadratic fitting, [Porter 2018](https://www.researchgate.net/publication/324777916_Emergence_scoring_to_identify_frontier_RD_topics_and_key_players) or gradients from state-space model using kalman filter smoothing emergence indexes. Porter is used by default; quadratic fitting can be used instead, for example:

```
python pygrams.py -ts -cf
python pygrams.py -ts -ei quadratic
```

#### Exponential (-exp)

An option designed to favour exponential like emergence, based on a yearly weighting function that linearly increases from zero, for example:

```
python pygrams.py -ts -exp
```

### Timeseries Forecasts
Expand Down Expand Up @@ -306,8 +314,13 @@ Python pygrams.py -nrm=False
Pygrams outputs a report of top ranked terms (popular or emergent). Additional command line arguments provide alternative options, for example a word cloud or 'graph summary'.

```
python pygrams.py -o='wordcloud'
python pygrams.py -o='graph'
python pygrams.py -o wordcloud
python pygrams.py -o graph
```

Time series analysis also supports a multiplot to present up to 30 terms time series (emergent and declining), output in the `outputs/emergence` folder:
```
python pygrams.py -ts -dh 'publication_date' -o multiplot
```

The output options generate:
Expand Down
5 changes: 3 additions & 2 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,16 @@ install:
- python -m pip install -U pip
- python -m easy_install -U setuptools
# command to install dependencies
- python setup.py install
- pip install -e .[test]
# also need to download punkt tokeniser data
- python -m nltk.downloader punkt averaged_perceptron_tagger wordnet

test_script:
# for codecov support
- pip install pytest pytest-cov
# command to run tests
- pytest --cov-report term --cov-report xml --cov=./ tests/
- cd tests
- pytest --cov-report term --cov-report xml --cov=../ ./

after_test:
- ps: |
Expand Down
2 changes: 2 additions & 0 deletions cached/all-mdf-0.05-200501-201841/USPTO-all.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/usr/bin/env bash
python pygrams.py -ts -ei gradients -nts 5 -mpq 50 -sma kalman -dt 2018/05/31 -tsdf 2012/06/01 -tsdt 2016/06/01 --test -pns 1 2 3 4 5 6 -dh publication_date -ds USPTO-granted-lite-all.pkl.bz2
File renamed without changes.
File renamed without changes.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
7 changes: 6 additions & 1 deletion config/stopwords_n.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,9 @@ situation
consist
first
plurality
second
second
example apparatus
generally describe
determination unit
determination unit determine
perform operation
Loading

0 comments on commit 267a9ce

Please sign in to comment.