Develop v3 0 0 RC (#332)

* argschecker updated #178 * Reverted to latest pdmarima (#212) * Removed erroneous 2nd arima fit (#212) arima fits on construction, don't need to explicitly call 'fit' * Removed erroneous 2nd arima fit (#212) arima fits on construction, don't need to explicitly call 'fit' * Reverted to original test specification (#212) * Reverted to original test specification (#212) * Added version debug code (#212) As requested on pmdarima bug reporting page * Added version debug code (#212) As requested on pmdarima bug reporting page * Report warnings and errors (#212) May have accidentely been surpressing errors - that could be reporting why the test fails * Report warnings and errors (#212) May have accidentely been surpressing errors - that could be reporting why the test fails * resolves #217 * resolves #217 * test * test * test * test * test * test * changed the tests with real data to check if random numbers were comfusing the models, hence the big discrepancies * changed the tests with real data to check if random numbers were comfusing the models, hence the big discrepancies * Updated Arima to use pmdarima rather than pyramid-arima (#212) * Updated Arima to use pmdarima rather than pyramid-arima (#212) * Test pmdarima 1.0.0 to test windows (#212) Seeing if an earlier version of pmdarima works in windows * Test pmdarima 1.0.0 to test windows (#212) Seeing if an earlier version of pmdarima works in windows * emtech report to file! * emtech report to file! * Reverted to latest pdmarima (#212) * Reverted to latest pdmarima (#212) * Removed erroneous 2nd arima fit (#212) arima fits on construction, don't need to explicitly call 'fit' * Removed erroneous 2nd arima fit (#212) arima fits on construction, don't need to explicitly call 'fit' * Reverted to original test specification (#212) * Reverted to original test specification (#212) * Added version debug code (#212) As requested on pmdarima bug reporting page * Added version debug code (#212) As requested on pmdarima bug reporting page * Report warnings and errors (#212) May have accidentely been surpressing errors - that could be reporting why the test fails * Report warnings and errors (#212) May have accidentely been surpressing errors - that could be reporting why the test fails * analyzer ngrams processing was not stopping unigrams :) * analyzer ngrams processing was not stopping unigrams :) * adjusted tests to reflect bug fixes in stoplists processing * adjusted tests to reflect bug fixes in stoplists processing * added a check on the returned tuple for stopwords. That will enable users to optimize list without having to re-compute tf-idf * added a check on the returned tuple for stopwords. That will enable users to optimize list without having to re-compute tf-idf * pmdarima>=110 * pmdarima>=110 * added a check on the returned tuple for stopwords. That will enable users to optimize list without having to re-compute tf-idf * added a check on the returned tuple for stopwords. That will enable users to optimize list without having to re-compute tf-idf * rid of vectorizer. Only vocabulary needed * rid of vectorizer. Only vocabulary needed * 225 ridof pmdarima (#226) * rid of vectorizer. Only vocabulary needed * rid of pmd. Also realized that two of our test series were identical. No need to test them twice :) * pmd left. * just to check why one excepts and other doesn't * rid of vectorizer. Only vocabulary needed * scipy was the proble, in the end. Has to be >=1.2.1 * 225 ridof pmdarima (#226) * rid of vectorizer. Only vocabulary needed * rid of pmd. Also realized that two of our test series were identical. No need to test them twice :) * pmd left. * just to check why one excepts and other doesn't * rid of vectorizer. Only vocabulary needed * scipy was the proble, in the end. Has to be >=1.2.1 * 223 pipeline bug (#224) * rid of vectorizer. Only vocabulary needed * pickle-depickle tfidf test now represents different executions (#223) WordAnalyser reset between calls to main() - will catch if stopwords etc not populated * 223 pipeline bug (#224) * rid of vectorizer. Only vocabulary needed * pickle-depickle tfidf test now represents different executions (#223) WordAnalyser reset between calls to main() - will catch if stopwords etc not populated * Travis now reports python packages in use Added `pip freeze` to travis.yml * Travis now reports python packages in use Added `pip freeze` to travis.yml * Corrected pip listing of packages * Corrected pip listing of packages * 228 data path (#229) * Removed override to 'data' path and added date info #228 Now reports date range of patents in use * Removed 2nd construction of WordAnalyser #228 * 228 data path (#229) * Removed override to 'data' path and added date info #228 Now reports date range of patents in use * Removed 2nd construction of WordAnalyser #228 * 230 arima failing (#231) * Alternative method to annoy ARIMA #230 * 230 arima failing (#231) * Alternative method to annoy ARIMA #230 * 227 bug csv date (#233) * Testing python 3.7.3 via pip and *correctly* switch to Xenial linux (#227) * Checks if DF date column is a string and converts to datetime #227 * Oops. Test failing as date_column not always corrected to datetime #227 * 227 bug csv date (#233) * Testing python 3.7.3 via pip and *correctly* switch to Xenial linux (#227) * Checks if DF date column is a string and converts to datetime #227 * Oops. Test failing as date_column not always corrected to datetime #227 * csv dates come as strings. Type-check to see what's going on and conv… (#232) * moved things around a bit. type check after df creaation inside not read from pickle clause. If read from pickle, that should have been taken care of.. * csv dates come as strings. Type-check to see what's going on and conv… (#232) * moved things around a bit. type check after df creaation inside not read from pickle clause. If read from pickle, that should have been taken care of.. * Remove leading zero trimming (#235) (#239) * Remove leading zero trimming (#235) (#239) * added argument for embeddings threshold * added argument for embeddings threshold * resolves #250 (#251) * scipy==1.2.1 else breaks * new gensim breaks windows! Force 3.4.0 * resolves #250 (#251) * scipy==1.2.1 else breaks * new gensim breaks windows! Force 3.4.0 * filtering rows now gets rid of corresponding rows in df (#249) * filtering rows now gets rid of corresponding rows in df * gensim & scipy version limited due to introduced instability in current versions * filtering rows now gets rid of corresponding rows in df (#249) * filtering rows now gets rid of corresponding rows in df * gensim & scipy version limited due to introduced instability in current versions * Update pygrams.py Co-Authored-By: emily-tew <38726410+emily-tew@users.noreply.github.com> * Update pygrams.py Co-Authored-By: emily-tew <38726410+emily-tew@users.noreply.github.com> * 248 tfidf filter (#254) * Added prefilter of terms (#248) * 248 tfidf filter (#254) * Added prefilter of terms (#248) * del * del * Update README.md Missing `.` on `pip install -e .` * Corrected check for empty CPC list (#261) * cache 2 initial commit! (#269) * cache 2 initial commit! * fix-imports was calling the properties nd populating tfidf_mat. Disabled it. Plus some cosmetics * helper function to safeguard from None idf or tfidf * 257 add nmf code (#271) Added NMF output * resolves #272 (#275) * Dictionary used to store CPC rather than list inside data frame * 273 dates as ints (#277) * Dates now pickled as integer array to save space (#273) Tidied up date related utilities - added to date_utils from utils Renamed 'iso dates' to 'year_week' dates to avoid confusion with 'real' iso Column filter removed from DocumentsFilter Removed time and CPC document weighting * Update README.md * 279 small adjustments (#280) * Dates now pickled as integer array to save space (#273) * Tidied up date related utilities - added to date_utils from utils * Renamed 'iso dates' to 'year_week' dates to avoid confusion with 'real' iso * Column filter removed from DocumentsFilter * Removed time and CPC document weighting * Removed unused parameters and synchronised variable names (#273) * Added timing report and progress reports * 278 move mask (#283) * resolves #278 * Changed folders for cached outputs (#281) (#284) * 285 data uspto (#286) * error checks change... * resolves #286 * 287 update system requirements section (#288) * Updated System Performance section (System Requirements) * minor mods * Small bug (#289) * threshold not a list * save time series to file (#270) * Update README.md -it option was outdated * 291 bug (#292) resolves #291 * 294 fb (#295) * resolves #294 * 296 emtech facelift (#297) * resolves #296 * 298 nltk installation (#299) NLTK data now downloaded during execution of `pip install` (fixes #298) * 256 tech report 2 (#301) resolves # 256 * Ch comments (#304) * ch comments * Checking changes were propagated correctly #256 (#305) * Checking changes were propagated correctly #256 * Checking changes were propagated correctly - more missing #256 * Few american spellings caught #256 * Exponential emergence (#306) * add exponential emergence * #255 convert r scripts (#308) state space model resolves #308 #255 * General facelift * Refactoring for readability * Corrected issue with calculation of Porter (was using head not tail of dataset) * State space (#317) * cache state-space data! * two-stage grid search * Corrected test with duplicated args (good spot...) Now copes if min/max time series dates are not defined * If smoothing not requested, ensure None is returned for smoothed dictionary * Default predictor set now excludes LSTMs * 319 cache (#320) * #319 updated code and tests to reflect new cache usage * 321 test stopwords (#322) * #321 added stopwords to test folder for test specific variant * #319 consistent cmd line args, GloVe can now be placed anywhere * 315 clamp redo (#323) * #315 clamp smoothed values at 0 * cast smoothed data back to lists (from numpy arrays) for consistency * command line args now restricted to available smoothing and emergence * added simple test for holt-winters to confirm -ve values not handled * 326 mpq (#327) * mpq tweak and cached data * #328 added tests for example command line (#329) * #328 added tests for example command line * fixed: date not defined when not required causes failure * #328 corrected execution folder for README tests * Corrected merge * Whitespace changes ready for merge to master * Cleanup state space modelling * Whups. Now checks tests again and only runs on travis... and not win32 * Whups. Now checks tests again and only runs on travis... and not win32 * 324 state space predictions (#325) * #324 create table from state space results - work in progress * tests TBA * first commit * #324 create table from state space results - with tests * Trimmed SD not implemented * #324 Trimmed SD implemented * #324 report window size to HTML * #324 WIP - needs refinement, but works for non-test. Test may blow graph generation. * #328 multiplot added as option * Cleanup state space modelling * Whups. Now checks tests again and only runs on travis... and not win32 * Whups. Now checks tests again and only runs on travis... and not win32 * Merge issue with SSM
datasciencecampus · Sep 25, 2019 · 267a9ce · 267a9ce
1 parent f35f452
commit 267a9ce
Show file tree

Hide file tree

Showing 73 changed files with 2,247 additions and 3,787 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -1,2 +1,2 @@
 [run]
-omit = tests/*
+omit = algorithms/* test* utils/* vanv/* vvcode/* support.py __init__.py
diff --git a/.travis.yml b/.travis.yml
@@ -45,17 +45,17 @@ install:
   - python --version
   - python -m pip install -U pip
   - python -m easy_install -U setuptools
-  # command to install dependencies
-  #  - python setup.py install
-  - pip install -e .
+  # command to install dependencies; includes extra 'test' specific dependencies
+  - pip install -e .[test]
 
 script:
   # for codecov support
   - pip install pytest pytest-cov
   # to report installed packages
   - pip freeze
   # command to run tests
-  - pytest --cov-config .coveragerc --cov=./ tests/
+  - cd tests
+  - pytest --cov-config ../.coveragerc --cov=../ ./
 
 after_success:
   - bash <(curl -s https://codecov.io/bash)
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@ The app pipeline (more details in the user option section):
    2. **[Term Filters](#term-filters)** These filters work on term level. Examples are: search terms list (eg. pharmacy, medicine, chemist)
 5. **Mask the TFIDF Matrix** Apply the filters to the TFIDF matrix
 6. **[Emergence](#emergence-calculations)**
-   1. **[Emergence Calculations](#emergence-calculations)** Options include [Porter 2018](https://www.researchgate.net/publication/324777916_Emergence_scoring_to_identify_frontier_RD_topics_and_key_players) emergence calculations or curve fitting. 
+   1. **[Emergence Calculations](#emergence-calculations)** Options include [Porter 2018](https://www.researchgate.net/publication/324777916_Emergence_scoring_to_identify_frontier_RD_topics_and_key_players) emergence calculations, curve fitting, or calculations designed to favour exponential like emergence. 
    2. **[Emergence Forecasts](#emergence-forecasts)** Options include ARIMA, linear and quadratic regression, Holt-Winters, LSTMs. 
 8. **[Outputs](#outputs)** The default 'report' output is a ranked and scored list of 'popular' ngrams or emergent ones if selected. Other outputs include a 'graph summary', word cloud and an html document as emergence report.
 
@@ -107,12 +107,12 @@ For example, for a corpus of book blurbs you could use:
 python pygrams.py -th='blurb' -dh='published_date'
 ```
 
-#### Using a pre-pickled TFIDF file (-it)
+#### Using cached files to speed up processing (-uc)
 
-In order save processing time, a pre-pickled TFIDF output file may be loaded instead of creating TFIDF by processing a document source. These files are cached automatically upon the first run with data and the directory hosting them inherits the outputs name given. Running pygrams with cached tfidf matrix:
+In order save processing time, at various stages of the pipeline, we cache data structures that are costly and slow to compute, like the compressed tf-idf matrix, the timeseries matrix, the smooth series and its derivatives from kalman filter and others:
 
 ```
-python pygrams.py -it USPTO-mdf-0.05
+python pygrams.py -uc all-mdf-0.05-200501-201841
 ```
 
 ### TFIDF Dictionary
@@ -180,13 +180,13 @@ unbias results to avoid double or triple counting contained n-grams.
 This argument can be used to filter documents to a certain timeframe. For example, the below will restrict the document cohort to only those from 20 Feb 2000 up to now (the default start date being 1 Jan 1900).
 
 ```
-python pygrams.py -df=2000/02/20
+python pygrams.py -dh publication_date -df=2000/02/20
 ```
 
 The following will restrict the document cohort to only those between 1 March 2000 and 31 July 2016.
 
 ```
-python pygrams.py -df=2000/03/01 -dt=2016/07/31
+python pygrams.py -dh publication_date -df=2000/03/01 -dt=2016/07/31
 ```
 
 #### Column features filters (-fh, -fb)
@@ -208,7 +208,7 @@ This filter assumes that values are '0'/'1', or 'Yes'/'No'.
 This subsets the chosen patents dataset to a particular Cooperative Patent Classification (CPC) class, for example Y02. The Y02 classification is for "technologies or applications for mitigation or adaptation against climate change". An example script is:
 
 ```
-python pygrams.py -cpc=Y02 -ps=USPTO-random-10000.pkl.bz2
+python pygrams.py -cpc=Y02 -ds=USPTO-random-10000.pkl.bz2
 ```
 
 In the console the number of subset patents will be stated. For example, for `python pygrams.py -cpc=Y02 -ps=USPTO-random-10000.pkl.bz2` the number of Y02 patents is 197. Thus, the TFIDF will be run for 197 patents.
@@ -232,12 +232,20 @@ An option to choose between popular or emergent terminology outputs. Popular ter
 python pygrams.py -ts
 ```
 
-#### Curve Fitting (-cf)
+#### Emergence Index (-ei)
 
-An option to choose between curve fitting or [Porter 2018](https://www.researchgate.net/publication/324777916_Emergence_scoring_to_identify_frontier_RD_topics_and_key_players)  emergence calculations. Porter is used by default; curve fitting can be used instead, for example:
+An option to choose between quadratic fitting, [Porter 2018](https://www.researchgate.net/publication/324777916_Emergence_scoring_to_identify_frontier_RD_topics_and_key_players) or gradients from state-space model using kalman filter smoothing  emergence indexes. Porter is used by default; quadratic fitting can be used instead, for example:
 
 ```
-python pygrams.py -ts -cf
+python pygrams.py -ts -ei quadratic
+```
+
+#### Exponential (-exp)
+
+An option designed to favour exponential like emergence, based on a yearly weighting function that linearly increases from zero, for example:
+
+```
+python pygrams.py -ts -exp
 ```
 
 ### Timeseries Forecasts
@@ -306,8 +314,13 @@ Python pygrams.py -nrm=False
 Pygrams outputs a report of top ranked terms (popular or emergent). Additional command line arguments provide alternative options, for example a word cloud or 'graph summary'.
 
 ```
-python pygrams.py -o='wordcloud'
-python pygrams.py -o='graph'
+python pygrams.py -o wordcloud
+python pygrams.py -o graph
+```
+
+Time series analysis also supports a multiplot to present up to 30 terms time series (emergent and declining), output in the `outputs/emergence` folder:
+```
+python pygrams.py -ts -dh 'publication_date' -o multiplot
 ```
 
 The output options generate:

diff --git a/appveyor.yml b/appveyor.yml
@@ -17,15 +17,16 @@ install:
   - python -m pip install -U pip
   - python -m easy_install -U setuptools
   # command to install dependencies
-  - python setup.py install
+  - pip install -e .[test]
   # also need to download punkt tokeniser data
   - python -m nltk.downloader punkt averaged_perceptron_tagger wordnet
 
 test_script:
   # for codecov support
   - pip install pytest pytest-cov
   # command to run tests
-  - pytest --cov-report term --cov-report xml --cov=./ tests/
+  - cd tests
+  - pytest --cov-report term --cov-report xml --cov=../ ./
 
 after_test:
   - ps: |

diff --git a/cached/all-mdf-0.05-200501-201841/USPTO-all.sh b/cached/all-mdf-0.05-200501-201841/USPTO-all.sh
@@ -0,0 +1,2 @@
+#!/usr/bin/env bash
+python pygrams.py -ts -ei gradients -nts 5 -mpq 50 -sma kalman -dt 2018/05/31 -tsdf 2012/06/01 -tsdt 2016/06/01 --test -pns 1 2 3 4 5 6 -dh publication_date -ds USPTO-granted-lite-all.pkl.bz2
diff --git a/...-mdf-0.05/USPTO-mdf-0.05-cpc_dict.pkl.bz2 → ...l-mdf-0.05-200501-201841/cpc_dict.pkl.bz2 b/...-mdf-0.05/USPTO-mdf-0.05-cpc_dict.pkl.bz2 → ...l-mdf-0.05-200501-201841/cpc_dict.pkl.bz2
diff --git a/...PTO-mdf-0.05/USPTO-mdf-0.05-dates.pkl.bz2 → .../all-mdf-0.05-200501-201841/dates.pkl.bz2 b/...PTO-mdf-0.05/USPTO-mdf-0.05-dates.pkl.bz2 → .../all-mdf-0.05-200501-201841/dates.pkl.bz2
diff --git a/cached/all-mdf-0.05-200501-201841/derivatives.pkl.bz2 b/cached/all-mdf-0.05-200501-201841/derivatives.pkl.bz2
diff --git a/cached/all-mdf-0.05-200501-201841/smooth_series_s.pkl.bz2 b/cached/all-mdf-0.05-200501-201841/smooth_series_s.pkl.bz2
diff --git a/...PTO-mdf-0.05/USPTO-mdf-0.05-tfidf.pkl.bz2 → .../all-mdf-0.05-200501-201841/tfidf.pkl.bz2 b/...PTO-mdf-0.05/USPTO-mdf-0.05-tfidf.pkl.bz2 → .../all-mdf-0.05-200501-201841/tfidf.pkl.bz2
diff --git a/cached/all-mdf-0.05-200501-201841/weekly_isodates.pkl.bz2 b/cached/all-mdf-0.05-200501-201841/weekly_isodates.pkl.bz2
diff --git a/cached/all-mdf-0.05-200501-201841/weekly_series_global.pkl.bz2 b/cached/all-mdf-0.05-200501-201841/weekly_series_global.pkl.bz2
diff --git a/cached/all-mdf-0.05-200501-201841/weekly_series_terms.pkl.bz2 b/cached/all-mdf-0.05-200501-201841/weekly_series_terms.pkl.bz2
diff --git a/config/stopwords_n.txt b/config/stopwords_n.txt
@@ -2,4 +2,9 @@ situation
 consist
 first
 plurality
-second
+second
+example apparatus
+generally describe
+determination unit
+determination unit determine
+perform operation
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		#!/usr/bin/env bash
		python pygrams.py -ts -ei gradients -nts 5 -mpq 50 -sma kalman -dt 2018/05/31 -tsdf 2012/06/01 -tsdt 2016/06/01 --test -pns 1 2 3 4 5 6 -dh publication_date -ds USPTO-granted-lite-all.pkl.bz2