Merge branch 'release-1.3'

DistrictDataLabs · Feb 9, 2021 · 38e6b31 · 38e6b31
2 parents d58ab34 + ad81093
commit 38e6b31
Show file tree

Hide file tree

Showing 102 changed files with 1,442 additions and 452 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -2,18 +2,18 @@ dist: xenial
 language: python
 matrix:
   include:
-    - name: "Python 3.6 on Xenial Linux"
-      python: '3.6'
-
     - name: "Python 3.7 on Xenial Linux"
       python: '3.7'
 
-    - name: "Miniconda 3.6 on Xenial Linux"
-      env: ANACONDA="3.6"
+    - name: "Python 3.8 on Xenial Linux"
+      python: '3.8'
 
     - name: "Miniconda 3.7 on Xenial Linux"
       env: ANACONDA="3.7"
 
+    - name: "Miniconda 3.8 on Xenial Linux"
+      env: ANACONDA="3.8"
+
 before_install:
 - sudo apt-get update;
 - if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then
@@ -23,8 +23,8 @@ before_install:
 
 install:
 - if [[ -z ${ANACONDA} ]]; then
-      pip install -r requirements.txt;
       pip install -r tests/requirements.txt;
+      pip install -r requirements.txt;
       pip install coveralls;
     else
       wget https://repo.anaconda.com/miniconda/Miniconda3-latest-$MINICONDA_OS-x86_64.sh -O miniconda.sh;

diff --git a/README.md b/README.md
@@ -66,8 +66,8 @@ from sklearn.svm import LinearSVC
 from yellowbrick.classifier import ROCAUC
 
 model = LinearSVC()
-model.fit(X,y)
 visualizer = ROCAUC(model)
+visualizer.fit(X,y)
 visualizer.score(X,y)
 visualizer.show()
 ```

diff --git a/docs/api/model_selection/importances.rst b/docs/api/model_selection/importances.rst
@@ -111,6 +111,50 @@ Taking the mean of the importances may be undesirable for several reasons. For e
     viz.fit(X, y)
     viz.show()
 
+Top and Bottom Feature Importances
+----------------------------------
+
+It may be more illuminating to the feature engineering process to identify the most or least informative features. To view only the N most informative features, specify the ``topn`` argument to the visualizer. Similar to slicing a ranked list by their importance, if ``topn`` is a postive integer, then the most highly ranked features are used. If ``topn`` is a negative integer, then the lowest ranked features are displayed instead.
+
+.. plot::
+    :context: close-figs
+    :alt: Coefficient importances for LASSO regression
+
+    from sklearn.linear_model import Lasso
+    from yellowbrick.datasets import load_concrete
+    from yellowbrick.model_selection import FeatureImportances
+
+    # Load the regression dataset
+    dataset = load_concrete(return_dataset=True)
+    X, y = dataset.to_data()
+
+    # Title case the feature for better display and create the visualizer
+    labels = list(map(lambda s: s.title(), dataset.meta['features']))
+    viz = FeatureImportances(Lasso(), labels=labels, relative=False, topn=3)
+
+    # Fit and show the feature importances
+    viz.fit(X, y)
+    viz.show()
+
+Using ``topn=3``, we can identify the three most informative features in the concrete dataset as ``splast``, ``cement``, and ``water``. This approach to visualization may assist with *factor analysis* - the study of how variables contribute to an overall model. Note that although ``water`` has a negative coefficient, it is the magnitude (absolute value) of the feature that matters since we are closely inspecting the negative correlation of ``water`` with the strength of concrete. Alternatively, ``topn=-3`` would reveal the three least informative features in the model. This approach is useful to model tuning similar to :doc:`rfecv`, but instead of automatically removing features, it would allow you to identify the lowest-ranked features as they change in different model instantiations. In either case, if you have many features, using ``topn`` can significantly increase the visual and analytical capacity of your analysis.
+
+The ``topn`` parameter can also be used when ``stacked=True``. In the context of stacked feature importance graphs, the information of a feature is the width of the entire bar, or the sum of the absolute value of all coefficients contained therein.
+
+.. plot::
+    :context: close-figs
+    :alt: Stacked per-class importances with Logistic Regression
+
+    from yellowbrick.model_selection import FeatureImportances
+    from sklearn.linear_model import LogisticRegression
+    from sklearn.datasets import load_iris
+
+    data = load_iris()
+    X, y = data.data, data.target
+
+    model = LogisticRegression(multi_class="auto", solver="liblinear")
+    viz = FeatureImportances(model, stack=True, relative=False, topn=-3)
+    viz.fit(X, y)
+    viz.show()
 
 Discussion
 ----------

diff --git a/docs/api/text/dispersion.rst b/docs/api/text/dispersion.rst
@@ -3,7 +3,10 @@
 Dispersion Plot
 ===============
 
-A word's importance can be weighed by its dispersion in a corpus.  Lexical dispersion is a measure of a word's homogeneity across the parts of a corpus.  This plot notes the occurrences of a word and how many words from the beginning of the corpus it appears.
+A word's importance can be weighed by its dispersion in a corpus. Lexical dispersion is a measure of a word's homogeneity across the parts of a corpus.
+
+Lexical dispersion illustrates the homogeneity of a word (or set of words) across
+the documents of a corpus. ``DispersionPlot`` allows for visualization of the lexical dispersion of words in a corpus. This plot illustrates with vertical lines the occurrences of one or more search terms throughout the corpus, noting how many words relative to the beginning of the corpus it appears.
 
 =================   ==============================
 Visualizer           :class:`~yellowbrick.text.dispersion.DispersionPlot`
@@ -33,6 +36,30 @@ Workflow             Feature Engineering
     visualizer.fit(text)
     visualizer.show()
 
+If the target vector of the corpus documents is provided, the points will be colored with respect to their document category, which allows for additional analysis of relationships in search term homogeneity within and across document categories.
+
+.. plot::
+    :context: close-figs
+    :alt: Dispersion Plot with Classes
+
+    from yellowbrick.text import DispersionPlot
+    from yellowbrick.datasets import load_hobbies
+
+    corpus = load_hobbies()
+    text = [doc.split() for doc in corpus.data]
+    y = corpus.target
+
+    target_words = ['points', 'money', 'score', 'win', 'reduce']
+
+    visualizer = DispersionPlot(
+        target_words,
+        colormap="Accent",
+        title="Lexical Dispersion Plot, Broken Down by Class"
+    )
+    visualizer.fit(text, y)
+    visualizer.show()
+
+
 Quick Method
 ------------
 
@@ -55,7 +82,7 @@ The same functionality above can be achieved with the associated quick method `d
     target_words = ['features', 'mobile', 'cooperative', 'competitive', 'combat', 'online']
 
     # Create the visualizer and draw the plot
-    dispersion(target_words, text)
+    dispersion(target_words, text, colors=['olive'])
 
 
 API Reference

diff --git a/docs/changelog.rst b/docs/changelog.rst
@@ -3,6 +3,33 @@
 Changelog
 =========
 
+Version 1.3
+------------
+
+* Tag: v1.3_
+* Deployed Tuesday, February 9, 2021
+* Current Contributors: Benjamin Bengfort, Rebecca Bilbro, Paul Johnson, Philippe Billet, Prema Roman, Patrick Deziel
+
+This version primarily repairs the dependency issues we faced with scipy 1.6, scikit-learn 0.24 and Python 3.6 (or earlier). As part of the rapidly changing Python library landscape, we've been forced to react quickly to dependency changes, even where those libraries have been responsibly issuing future and deprecation warnings.
+
+Major Changes:
+   - Implement new ``set_params`` and ``get_params`` on ModelVisualizers to ensure wrapped estimator is being correctly accessed via the new Estimator methods.
+   - Freeze the test dependencies to prevent variability in CI (must periodically review dependencies to ensure we're testing what our users are experiencing).
+   - Change ``model`` param to ``estimator`` param to ensure that Visualizer arguments match their property names so that inspect works with get and set params and other scikit-learn utility functions.
+
+Minor Changes:
+   - Import scikit-learn private API ``_safe_indexing`` without error.
+   - Remove any calls to ``set_params`` in Visualizer ``__init__`` methods.
+   - Modify test fixtures and baseline images to accommodate new sklearn implementation
+   - Set the numpy dependency to be less than 1.20 because this is causing Pickle issues with joblib and umap
+   - Add ``shuffle=True`` argument to any CV class that uses a random seed.
+   - Set our CI matrix to Python and Miniconda 3.7 and 3.8
+   - Correction in README regarding ModelVisualizer API.
+
+
+.. _v1.3: https://github.com/DistrictDataLabs/yellowbrick/releases/tag/v1.3
+
+
 Hotfix 1.2.1
 ------------
 
@@ -12,6 +39,8 @@ Hotfix 1.2.1
 
 On December 22, 2020, scikit-learn released version 0.24 which deprecated the external use of scikit-learn's internal utilities such as ``safe_indexing``. Unfortunately, Yellowbrick depends on a few of these utilities and must refactor our internal code base to port this functionality or work around it. To ensure that Yellowbrick continues to work when installed via ``pip``, we have temporarily changed our scikit-learn dependency to be less than 0.24. We will update our dependencies on the v1.3 release when we have made the associated fixes.
 
+.. _v1.2.1: https://github.com/DistrictDataLabs/yellowbrick/releases/tag/v1.2.1
+
 
 Version 1.2
 -----------

diff --git a/docs/governance/index.rst b/docs/governance/index.rst
@@ -218,3 +218,5 @@ Board of Advisors Minutes
     minutes/2019-05-15.rst
     minutes/2019-09-09.rst
     minutes/2020-01-07.rst
+    minutes/2020-05-13.rst
+    minutes/2020-10-06.rst