Skip to content

Commit

Permalink
Merge branch 'main' into doc/WHO_Examples
Browse files Browse the repository at this point in the history
  • Loading branch information
rosecers authored Jan 18, 2023
2 parents 28e75f4 + 7ef05eb commit 8dfcc83
Show file tree
Hide file tree
Showing 10 changed files with 177 additions and 8 deletions.
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ sphinx_rtd_theme
tqdm
traitlets>=5.0
jinja2 < 3.1
pandas
pandas
116 changes: 116 additions & 0 deletions docs/source/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,122 @@ You may want to setup your editor to automatically apply the
files, there are plugins to do this with `all major
editors <https://black.readthedocs.io/en/stable/editor_integration.html>`_.


Issues and Pull Requests
########################

Having a problem with scikit-COSMO? Please let us know by `submitting an issue <https://github.com/lab-cosmo/scikit-cosmo/issues>`_.

Submit new features or bug fixes through a `pull request <https://github.com/lab-cosmo/scikit-cosmo/pulls>`_.


Contributing Datasets
#####################

Have an example dataset that would fit into scikit-COSMO?

Contributing a dataset is easy. First, copy your numpy file into
``skcosmo/datasets/data/`` with an informative name. Here, we'll call it ``my-dataset.npz``.

Next, create a documentation file in ``skcosmo/datasets/data/my-dataset.rst``.
This file should look like this:

.. code-block::
.. _my-dataset:
My Dataset
##########
This is a summary of my dataset. My dataset was originally published in My Paper.
Function Call
-------------
.. function:: skcosmo.datasets.load_my_dataset
Data Set Characteristics
------------------------
:Number of Instances: ______
:Number of Features: ______
The representations were computed using the _____ package using the hyperparameters:
+------------------------+------------+
| key | value |
+------------------------+------------+
| hyperparameter 1 | _____ |
+------------------------+------------+
| hyperparameter 2 | _____ |
+------------------------+------------+
Of the ____ resulting features, ____ were selected via _____.
References
----------
Reference Code
--------------
Then, show ``scikit-cosmo`` how to load your data by adding a loader function to
``skcosmo/datasets/_base.py``. It should look like this:

.. code-block:: python
def load_my_dataset():
"""Load and returns my dataset.
Returns
-------
my_data : sklearn.utils.Bunch
Dictionary-like object, with the following attributes:
data : `sklearn.utils.Bunch` --
contains the keys ``X`` and ``y``.
My input vectors and properties, respectively.
DESCR: `str` --
The full description of the dataset.
"""
module_path = dirname(__file__)
target_filename = join(module_path, "data", "my-dataset.npz")
raw_data = np.load(target_filename)
data = Bunch(
X=raw_data["X"],
y=raw_data["y"],
)
with open(join(module_path, "descr", "my-dataset.rst")) as rst_file:
fdescr = rst_file.read()
return Bunch(data=data, DESCR=fdescr)
Add this function to ``skcosmo/datasets/__init__.py``.

Finally, add a test to ``skcosmo/tests/test_datasets.py`` to see that your dataset
loads properly. It should look something like this:

.. code-block:: python
class MyDatasetTests(unittest.TestCase):
@classmethod
def setUpClass(cls):
cls.my_data = load_my_data()
def test_load_my_data(self):
# test if representations and properties have commensurate shape
self.assertTrue(self.my_data.data.X.shape[0] == self.my_data.data.y.shape[0])
def test_load_my_data_descr(self):
self.my_data.DESCR
You're good to go! Time to submit a `pull request. <https://github.com/lab-cosmo/scikit-cosmo/pulls>`_


License
#######

Expand Down
3 changes: 2 additions & 1 deletion docs/source/datasets.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
Datasets
================
========

.. include:: ../../skcosmo/datasets/descr/degenerate_CH4_manifold.rst

.. include:: ../../skcosmo/datasets/descr/csd-1000r.rst

6 changes: 6 additions & 0 deletions docs/source/gfrm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,24 @@ Reconstruction Measures
.. currentmodule:: skcosmo.metrics


.. _GRE-api:

Global Reconstruction Error
###########################

.. autofunction:: pointwise_global_reconstruction_error
.. autofunction:: global_reconstruction_error

.. _GRD-api:

Global Reconstruction Distortion
################################

.. autofunction:: pointwise_global_reconstruction_distortion
.. autofunction:: global_reconstruction_distortion

.. _LRE-api:

Local Reconstruction Error
##########################

Expand Down
14 changes: 14 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,20 @@ scikit-cosmo documentation
compatible utilities that implement methods developed in the `COSMO laboratory
<https://cosmo.epfl.ch>`_.

Convenient-to-use libraries such as scikit-learn have accelerated the adoption and application
of machine learning (ML) workflows and data-driven methods. Such libraries have gained great
popularity partly because the implemented methods are generally applicable in multiple domains.
While developments in the atomistic learning community have put forward general-use machine
learning methods, their deployment is commonly entangled with domain-specific functionalities,
preventing access to a wider audience.

scikit-COSMO targets domain-agnostic implementations of methods developed in the
computational chemical and materials science community, following the
scikit-learn API and coding guidelines to promote usability and interoperability
with existing workflows. scikit-COSMO contains a toolbox of methods for
unsupervised and supervised analysis of ML datasets, including the comparison,
decomposition, and selection of features and samples.

.. toctree::
:maxdepth: 1
:caption: Contents:
Expand Down
27 changes: 27 additions & 0 deletions docs/source/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,31 @@ Currently, scikit-COSMO contains models described in [Imbalzano2018]_, [Helfrech
as some modifications to sklearn functionalities and minimal datasets that are useful within the field
of computational materials science and chemistry.



- Fingerprint Selection:
Multiple data sub-selection modules, for selecting the most relevant features and samples out of a large set of candidates [Imbalzano2018]_, [Helfrecht2020]_ and [Cersonsky2021]_.

* :ref:`CUR-api` decomposition: an iterative feature selection method based upon the singular value decoposition.
* :ref:`PCov-CUR-api` decomposition extends upon CUR by using augmented right or left singular vectors inspired by Principal Covariates Regression.
* :ref:`FPS-api`: a common selection technique intended to exploit the diversity of the input space. The selection of the first point is made at random or by a separate metric.
* :ref:`PCov-FPS-api` extends upon FPS much like PCov-CUR does to CUR.
* :ref:`Voronoi-FPS-api`: conduct FPS selection, taking advantage of Voronoi tessellations to accelerate selection.

- Reconstruction Measures:
A set of easily-interpretable error measures of the relative information capacity of feature space `F` with respect to feature space `F'`.
The methods returns a value between 0 and 1, where 0 means that `F` and `F'` are completey distinct in terms of linearly-decodable information, and where 1 means that `F'` is contained in `F`.
All methods are implemented as the root mean-square error for the regression of the feature matrix `X_F'` (or sometimes called `Y` in the doc) from `X_F` (or sometimes called `X` in the doc) for transformations with different constraints (linear, orthogonal, locally-linear).
By default a custom 2-fold cross-validation :py:class:`skosmo.linear_model.RidgeRegression2FoldCV` is used to ensure the generalization of the transformation and efficiency of the computation, since we deal with a multi-target regression problem.
Methods were applied to compare different forms of featurizations through different hyperparameters and induced metrics and kernels [Goscinski2021]_ .

* :ref:`GRE-api` (GRE) computes the amount of linearly-decodable information recovered through a global linear reconstruction.
* :ref:`GRD-api` (GRD) computes the amount of distortion contained in a global linear reconstruction.
* :ref:`LRE-api` (LRE) computes the amount of decodable information recovered through a local linear reconstruction for the k-nearest neighborhood of each sample.

- Principal Covariates Regression

* PCovR: the standard Principal Covariates Regression [deJong1992]_. Utilises a combination between a PCA-like and an LR-like loss, and therefore attempts to find a low-dimensional projection of the feature vectors that simultaneously minimises information loss and error in predicting the target properties using only the latent space vectors $\mathbf{T}$ :ref:`PCovR-api`.
* Kernel Principal Covariates Regression (KPCovR) a kernel-based variation on the original PCovR method, proposed in [Helfrecht2020]_ :ref:`KPCovR-api`.

If you would like to contribute to scikit-COSMO, check out our :ref:`contributing` page!
5 changes: 5 additions & 0 deletions docs/source/selection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ They are instantiated using
Xr = selector.transform(X)
.. _PCov-CUR-api:

PCov-CUR
########
Expand Down Expand Up @@ -204,6 +205,8 @@ These selectors can be instantiated using
Xr = selector.transform(X)
.. _PCov-FPS-api:

PCov-FPS
########
PCov-FPS extends upon FPS much like PCov-CUR does to CUR. Instead of using the
Expand Down Expand Up @@ -247,6 +250,8 @@ be instantiated using
Xr = selector.transform(X)
.. _Voronoi-FPS-api:

Voronoi FPS
###########

Expand Down
1 change: 1 addition & 0 deletions docs/source/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,5 @@ check out the pedagogic notebooks in our companion project `kernel-tutorials <ht
:Caption: Feature Reconstruction Measures

read-only-examples/PlotGFRE
read-only-examples/PlotPointwiseGFRE.ipynb
read-only-examples/PlotLFRE
6 changes: 3 additions & 3 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,6 @@ classifiers = Development Status :: 3 - Alpha
include_package_data = True
zip_safe = True
packages = find:
install_requires = numpy
scikit-learn>="0.24.0"

install_requires =
numpy
scikit-learn>=0.24.0
5 changes: 2 additions & 3 deletions skcosmo/linear_model/_ridge.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,8 @@ class RidgeRegression2FoldCV(MultiOutputMixin, RegressorMixin):
and in general more accurate, see issue #40. However, it is constraint to a svd
solver for the matrix inversion.
It offers additional functionalities in comparison to :obj:`sklearn.linear_model.Ridge`:
The regularaization parameters can be chosen to be relative to the largest eigenvalue
of the inverted matrix, and a cutoff regularization method is offered which is explained
in the `Parameters` in detail.
The regularaization parameters can be chosen relative to the largest eigenvalue of the feature matrix
as well as regularization method. Details are explained in the `Parameters` section.
Parameters
----------
Expand Down

0 comments on commit 8dfcc83

Please sign in to comment.