docs: update to version 2.2.0

f-aguzzi · Jun 7, 2024 · ebff95d · ebff95d
1 parent f0d0d74
commit ebff95d
Show file tree

Hide file tree

Showing 53 changed files with 1,946 additions and 8 deletions.
diff --git a/docs/cookbook/case-study-classifier.md b/docs/cookbook/case-study-classifier.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 3
+sidebar_position: 4
 ---
 
 # Case study: training a classifier from lab data

diff --git a/docs/cookbook/case-study-hybrid.md b/docs/cookbook/case-study-hybrid.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 4
+sidebar_position: 5
 ---
 
 # Case study: hybrid workflow

diff --git a/docs/cookbook/case-study-realtime.md b/docs/cookbook/case-study-realtime.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 5
+sidebar_position: 6
 ---
 
 # Case study: real-time data classification

diff --git a/docs/cookbook/data-operations.md b/docs/cookbook/data-operations.md
@@ -0,0 +1,238 @@
+---
+sidebar_position: 3
+---
+
+# 2. Data operations: import, export and data fusion
+
+`ChemFuseKit` can elaborate datasets in multiple different ways.
+
+For now, let's take a look at basic input / output operations on datasets. As stated in the previous chapter of this cookbook, all loaded datasets are contained into a `BaseDataModel` object or in one of its derived classes:
+
+```mermaid
+classDiagram
+    class BaseDataModel {
+        +x_data: DataFrame
+        +x_train: DataFrame
+        +y: ndarray
+        __init__(x_data, x_train, y)
+    }
+
+    class LLDFDataModel {
+        ...
+        __init__(...)
+    }
+
+    class PCADataModel {
+        +array_scores: ndarray
+        +components: int
+        __init__(..., array_scores)
+    }
+
+    BaseDataModel *-- LLDFDataModel
+    BaseDataModel *-- PCADataModel
+```
+
+With the help of these classes, we can perform three fundamental data operations:
+
+1. dataset loading
+2. data fusion
+3. dataset saving
+
+## Dataset loading
+
+`ChemFuseKit` can import Excel tables into its `BaseDataModel`-derived classes. `BaseDataModel` and both its derived classes offer the ability to import a single datasheet, while only `LLDF` (the name stands for *Low Level Data Fusion*) can perform data fusion by importing multiple tables at once into a single, unified dataset:
+
+```mermaid
+flowchart TD
+    A[Import data] --> B{How many tables?}
+    B --> |One| C[BaseDataModel\nand its\nderived classes]
+    B --> |Many| D[LLDF only]
+```
+
+Let's say we have one single table, called `spectrometer_data.xlsx`, and we want to load it into our project, to then feed it to one of our classifiers.
+
+This is the schema of the table called `Spectral Samples` within `spectrometer_data.xlsx`:
+
+| Sample number | Class           | 8     | 8.1   | 8.2   | 8.3    | ... | 9.9    | 10     |
+|---------------|-----------------|-------|-------|-------|--------|-----|--------|--------|
+| 1             | Dichloromethane | 2.341 | 3.866 | 1.430 | 5.843  |     | 0.032  | 1.128  |
+| 2             | N-hexane        | 5.745 | 8.346 | 2.985 | 6.842  |     | 1.832  | 3.543  |
+| 3             | Dioxane         | 0.003 | 0.002 | 0.006 | 0.0013 |     | 11.483 | 10.445 |
+| ...           |                 |       |       |       |        |     |        |        |
+
+As we can observe, there's an index column called *Sample number*, a class column called *Class* and then a certain number of columns containing the spectral responses of the samples in the 8nm - 10nm range.
+
+<br />
+
+Now, let's import this table into a `BaseDataModel`:
+
+```python
+from chemfusekit.__base import BaseDataModel
+
+data = BaseDataModel.load_from_file(
+    import_path='spectrometer_data.xlsx'
+    sheet_name='Spectral samples',
+    class_column='Class',
+    index_column='Sample number'
+)
+```
+
+Now our `data` variable is loaded, and, as an instance of `BaseDataModel`, it contains three fields:
+
+- `x_data`
+- `x_train`
+- `y`
+
+<br />
+
+`x_data` is a Pandas Dataframe with the following content:
+
+| 8     | 8.1   | 8.2   | 8.3    | ... | 9.9    | 10     |
+|-------|-------|-------|--------|-----|--------|--------|
+| 2.341 | 3.866 | 1.430 | 5.843  |     | 0.032  | 1.128  |
+| 5.745 | 8.346 | 2.985 | 6.842  |     | 1.832  | 3.543  |
+| 0.003 | 0.002 | 0.006 | 0.0013 |     | 11.483 | 10.445 |
+| ...   |       |       |        |     |        |        |
+
+As we can see, it only contains the spectral data.
+
+<br />
+
+`x_train`, a Pandas Dataframe too, contains both the classes (with the column header renamed to *Substance*) and the spectral data:
+
+| Substance       | 8     | 8.1   | 8.2   | 8.3    | ... | 9.9    | 10     |
+|-----------------|-------|-------|-------|--------|-----|--------|--------|
+| Dichloromethane | 2.341 | 3.866 | 1.430 | 5.843  |     | 0.032  | 1.128  |
+| N-hexane        | 5.745 | 8.346 | 2.985 | 6.842  |     | 1.832  | 3.543  |
+| Dioxane         | 0.003 | 0.002 | 0.006 | 0.0013 |     | 11.483 | 10.445 |
+| ...             |       |       |       |        |     |        |        |
+
+<br />
+
+`y`, a NumPy ndarray, only contains the classes (with the column header renamed to *Substance*):
+
+| Substance       |
+|-----------------|
+| Dichloromethane |
+| N-hexane        |
+| Dioxane         |
+| ...             |
+
+
+
+## Data Fusion
+
+Let's build on from the previous example. Our file, `spectrometer_data.xlsx`, also contains a second sheet called `Gas Chromatography samples`, and we want to import it along our previous `Spectral samples` table.
+
+This is the schema of the second table:
+
+| Sample  | class           | Retention time |
+|---------|-----------------|----------------|
+| 1       | Dichloromethane | 123.78         |
+| 2       | N-hexane        | 44.19          |
+| 3       | Dioxane         | 22.34          |
+| ...     |                 |                |
+
+Even though the header names are slightly different, the content of the first two columns corresponds to the first two columns of the previous table. The third column contains gas chromatography retention times in milliseconds.
+
+<br />
+
+The `LLDF` module allows us to join these two tables (the current and the one from the previous examples) to form a single dataset that contains both spectral data and retention times. Let's see how.
+
+```python
+from chemfusekit.lldf import LLDFSettings, LLDF, GraphMode, Table
+
+settings = LLDFSettings()   # Initialize the default settings
+
+# Set up the import settings for the first table (spectral data)
+table1 = Table(
+    file_path='chemical_data.xlsx',
+    sheet_name='Spectral samples'
+    preprocessing='snv'
+    class_column='Class'
+    index_column='Sample number'
+)
+
+# Set up the import settings for the second table (chromatography data)
+table2 = Table(
+    file_path='chemical_data.xlsx',
+    sheet_name='Chromatography samples'
+    preprocessing='none'
+    class_column='class'
+    index_column='Sample'
+)
+
+# Now, let's make an array of the two tables
+tables = [Table1, Table2]
+
+# Let's pass the settings and the tables to the LLDF constructor
+lldf = LLDF(settings, tables)
+
+# Let's finally perform data fusion with the lldf() method!
+lldf.lldf()
+```
+
+At the end of this cycle of operations, we can find our fused data object inside the `fused_data` property of our low-level data fusion object:
+
+```python
+lldf.fused_data
+```
+
+The `fused_data` field is of class `LLDFDataModel`, which is derived from `BaseDataModel`, and contains the same fields (`x_data`, `x_train`, `y`).
+
+<br />
+
+This is the content of `x_data` (a Pandas DataFrame):
+
+| 8     | 8.1   | 8.2   | 8.3    | ... | 9.9    | 10     |Retention time |
+|-------|-------|-------|--------|-----|--------|--------|---------------|
+| 2.341 | 3.866 | 1.430 | 5.843  |     | 0.032  | 1.128  |123.78         |
+| 5.745 | 8.346 | 2.985 | 6.842  |     | 1.832  | 3.543  |44.19          |
+| 0.003 | 0.002 | 0.006 | 0.0013 |     | 11.483 | 10.445 |22.34          |
+| ...   |       |       |        |     |        |        |               |
+
+<br />
+
+This is `x_train` (a Pandas DataFrame):
+
+| Substance       | 8     | 8.1   | 8.2   | 8.3    | ... | 9.9    | 10     |Retention time |
+|-----------------|-------|-------|-------|--------|-----|--------|--------|---------------|
+| Dichloromethane | 2.341 | 3.866 | 1.430 | 5.843  |     | 0.032  | 1.128  |123.78         |
+| N-hexane        | 5.745 | 8.346 | 2.985 | 6.842  |     | 1.832  | 3.543  |44.19          |
+| Dioxane         | 0.003 | 0.002 | 0.006 | 0.0013 |     | 11.483 | 10.445 |22.34          |
+| ...             | ...   |       |       |        |     |        |        |               |
+
+<br />
+
+This is the content of `y` (a NumPy ndarray):
+
+| Substance       |
+|-----------------|
+| Dichloromethane |
+| N-hexane        |
+| Dioxane         |
+| ...             |
+
+
+## Dataset export
+
+`BaseDataModel` and its derived classes have a `export_to_file` method that exports the complete table (class names and data columns) to an Excel file.
+
+Let's say we want to export the fused dataset from the previous example into a file called `fused dataset.xlsx`. Here's how to do it, using our `lldf` variable from the previous example (the one that contained an instance of the `LLDF` class, with which we joined the two tables):
+
+```python
+lldf.export_to_file(export_path='fused dataset.xlsx', sheet_name="Sheet 1")
+```
+
+Et voila! Now we have a new file called `fused dataset.xlsx`, inside of which there is a sheet called "Sheet 1" with the following content:
+
+| Substance       | 8     | 8.1   | 8.2   | 8.3    | ... | 9.9    | 10     |Retention time |
+|-----------------|-------|-------|-------|--------|-----|--------|--------|---------------|
+| Dichloromethane | 2.341 | 3.866 | 1.430 | 5.843  |     | 0.032  | 1.128  |123.78         |
+| N-hexane        | 5.745 | 8.346 | 2.985 | 6.842  |     | 1.832  | 3.543  |44.19          |
+| Dioxane         | 0.003 | 0.002 | 0.006 | 0.0013 |     | 11.483 | 10.445 |22.34          |
+| ...             | ...   |       |       |        |     |        |        |               |
+
+<br />
+
+With this, you now know all the basics of data handling within `ChemFuseKit`.
diff --git a/docs/cookbook/structure.md b/docs/cookbook/structure.md
@@ -2,7 +2,7 @@
 sidebar_position: 2
 ---
 
-# Project structure
+# 1. Project structure
 
 In this cookbook page, you will be shown how the project is structured, and the purpose of each module.
 

diff --git a/docs/cookbook_versioned_docs/version-2.2.0/case-study-classifier.md b/docs/cookbook_versioned_docs/version-2.2.0/case-study-classifier.md
@@ -0,0 +1,9 @@
+---
+sidebar_position: 4
+---
+
+# Case study: training a classifier from lab data
+
+:::note
+This case study is still **under construction**.
+:::
diff --git a/docs/cookbook_versioned_docs/version-2.2.0/case-study-hybrid.md b/docs/cookbook_versioned_docs/version-2.2.0/case-study-hybrid.md
@@ -0,0 +1,9 @@
+---
+sidebar_position: 5
+---
+
+# Case study: hybrid workflow
+
+:::note
+This case study is still **under construction**.
+:::
diff --git a/docs/cookbook_versioned_docs/version-2.2.0/case-study-realtime.md b/docs/cookbook_versioned_docs/version-2.2.0/case-study-realtime.md
@@ -0,0 +1,9 @@
+---
+sidebar_position: 6
+---
+
+# Case study: real-time data classification
+
+:::note
+This case study is still **under construction**.
+:::