03_analysis-har.qmd

---
image: figs/chapter3/chap3.png
description: Analysis on how the amount of training data affects the classification performance on several models and data sources.
listing: 
  - id: contents
    type: grid
    contents: 
        - "03.1_training-data.ipynb"
        - "03.2_data-sources.ipynb"
        - "03.3_models.ipynb"
    sort: false
    grid-columns: 3
filters:
  - add-code-files
format:
  html:
    code-links:
      - text: 01_data-processing.py
        icon: file-code
        href: https://github.com/matey97/thesis/blob/main/libs/chapter3/pipeline/01_data-processing.py
        target: blank
      - text: 02_hyperparameter-optimization.py
        icon: file-code
        href: https://github.com/matey97/thesis/blob/main/libs/chapter3/pipeline/02_hyperparameter-optimization.py
        target: blank
      - text: 03_incremental-loso.py
        icon: file-code
        href: https://github.com/matey97/thesis/blob/main/libs/chapter3/pipeline/03_incremental-loso.py
        target: blank
    other-links:  
      - text: CIARP paper
        icon: journal-bookmark-fill
        href: https://doi.org/10.1007/978-3-031-49018-7_28
        target: blank
---

# Multidimensional Analysis of ML and DL on HAR

The ML and DL techniques are powerful tools to solve a wide variety of problems and develop applications. The usage of these tools requires bearing in mind several dimensions for the success of the analysis. 

The first dimension is the amount of data required to train the ML or DL model. Researchers usually agree that "the more data, the better", however, the data collection procedures are time- and resource-expensive. Researchers hereby face the challenge of deciding how much data they will need to meet the requirements of their study, i.e., to not collect less or more data than required. 

The second dimension involves the type of data that will be employed, i.e., data source. Researchers need to determine which device among all the possibilities they are going to use and how they are going to use it (e.g., sensor placement) in their research.

The third dimension is the choice of a ML or DL model. This choice can be a paramount step in a research, since some models could perform significantly better or worse than others depending on the nature and quantity of training data, leading to the success or failure of the research. 

In this chapter, we analyse these dimensions using the dataset and the variety of \gls{ml} and \gls{dl} methods described in the previous section. The aim is, in the context of HAR, 1) to determine how the amount of data impacts the performance of the models, while also investigating 2) which data source from the dataset (i.e., smartphone, smartwatch or fused) yields the best results across models and 3) which model provides the best results given a specific data source.

::: {.callout-note}
The contents on this section correspond with the Chapter 3 of the dissertation document and constitute an extension of the work "Analysis and Impact of Training Set Size in Cross-Subject Human Activity Recognition" [@matey2024analysis] presented in the $27^{th}$ Iberoamerican Congress on Pattern Recognition (CIARP).
:::

## Methodology

The collected dataset in @sec-dataset and the ML and DL techniques described in @sec-ml_methods have been used to carry out the specified analysis.  The following sections describe specific procedures for data preparation, the process to optimize the hyperparameters of the selected models, and the procedure to evaluate the impact of the amount of training data.

### Data preparation

#### Min-Max scaling
A Min-Max scaling was applied to the smartphone and smartwatch data to rescale the data into the $[-1, 1]$ range. This rescaling is defined as: 
$$
    v' = \frac{v - min}{max - min}*(new\_max-new\_min) + new\_min
$$

#### Data fusion
A fused dataset was generated to evaluate the impact of the training set size when using smartphone and smartwatch data together. Due to variable sampling rates ($102$ Hz and $104$ Hz for smartphone and smartwatch), the following procedure (depicted in @fig-fusion) was employed to fusion both data sources:

- Since the data collection is started and finalized in each device independently, exceeding samples at the beginning and end of the execution are removed (red dots in @fig-fusion, step 1).
- Samples are grouped in batches of $1$ second.
- In each batch, the $i^{th}$ smartphone sample is matched with the $i^{th}$ smartwatch sample. A sample is discarded if it cannot be matched with another sample (red dots in @fig-fusion, step 3).

![Data fusion procedure](figs/chapter3/fusion.png){#fig-fusion .lightbox}

#### Data windowing
The smartphone, smartwatch and fused dataset were split using the sliding window technique. A window size of $50$ samples was used, corresponding to approximately $0.5$ seconds with a $100$ Hz sampling rate, and an overlap of $50$ %. These values have been chosen since they are proven to obtain successful results in HAR [@sansano2020study;@jaen2022effects]. @tbl-windows contains the resulting windowns in each datasets, ready to be used by the CNN, LSTM and CNN-LSTM models.

| Dataset | SEATED | STANDING_UP | WALKING | TURNING | SITTING_DOWN | Total |
|---------|:------:|:-----------:|:-------:|:-------:|:------------:|------:|
| **Smartphone** | 1033 | 1081 | 4606 | 2087 | 1235 | 10042 |
| **Smartwatch** | 998  | 1105 | 4691 | 2123 | 1253 | 10170 |
| **Fused**      | 871  | 1083 | 4458 | 1887 | 1066 | 9365  |
: Number of resulting data windows in each dataset {#tbl-windows .striped }


#### Feature extraction
A feature extraction process is executed to use the collected data in MLP models. The extracted features can be classified in **mathematical/statistical** or **angular** features:

- **Mathematical/Statistical** features are extracted by applying mathematical/statistical functions to the collected data. Although simple, these features are beneficial for discriminating between static and dynamic activities, postures, etc. [@figo2010preprocessing]. The extracted features are the **mean**, **median**, **maximum**, **minimum**, **sd**, **range** and **rms**. Each feature is extracted from each sensor (i.e., accelerometer and gyroscope) and axis (i.e., *x*, *y* and *z*).
- **Angular** features are useful to determine orientation changes of the devices and in combination with the mathematical/statistical features improve the classification accuracy of the models [@coskun2015]. The angular features extracted are the **Pitch**, the **Roll** and the **Rotational angle** for each axis.

For the smartphone and smartwatch datasets, a total of $47$ (i.e., $7*6+1+1+3$) features were extracted from each window in each dataset. In the case of the fused dataset, $94$ (i.e., $47*2$) features were extracted from each window.

::: {.callout-note}
The script employed to execute this process is `01_data-processing.py`.

:::: {add-from=libs/chapter3/pipeline/01_data-processing.py code-fold='true' code-filename='01_data-processing.py'}
```{.python}
```
::::

:::


### Models' hyperparameters optimization

Before training the models, their hyperparameters have to be selected, i.e., the models have to be tuned. The selected options for the hyperparameters, which have been chosen based on other works using these models, are the following:

- Number of units (all models): $128$, $256$ and $512$.
- Number of filters (CNN and CNN-LSTM): $32$, $64$ and $128$.
- LSTM cells (LSTM and CNN-LSTM): $32$, $64$, $128$.

Other hyperparamters (e.g., learning rate, filter sizes, number of layers) were selected based on previous experience.

The best hyperparameters were obtained using the **Grid Search** technique, where every possible combination of hyperparameters is evaluated. The process was configured to train and evaluate each combination five times using the Adam optimizer during $50$ epochs with a batch size of $64$ windows. To reduce the computational cost of the optimization, the process was carried out in two phases: 1) optimization of layers and learning hyperparameters and 2) optimization of the number of layers. [@tbl-phase2-hyperparameters] shows the best combination of hyperparameters per model and dataset.

{{< embed 03.1.1_grid-search.ipynb#tbl-phase2-hyperparameters >}}

::: {.callout-note}
The script employed to execute this process is `02_hyperparameter-optimization.py`.

:::: {add-from=libs/chapter3/pipeline/02_hyperparameter-optimization.py code-fold='true' code-filename='02_hyperparameter-optimization.py'}
```{.python}
```
::::

:::


### Evaluation procedure

#### Incremental Leaving-One-Subject-Out (ILOSO)
As pointed out by @gholamiangonabadi2020deep, the right approach to evaluate HAR systems is the LOSO (cross-subject evaluation) strategy. To study the effect of the amount of training data on the performance of models, we employ an ILOSO.

In the ILOSO strategy, given $S$ subjects, each $s$ subject is individually considered as a test subject. Then, $n$ subjects are randomly selected as training subjects from $S\setminus s$, where $n \in [1,\ldots,|S|-1]$. The data from training subjects are employed to train a model and the data from the test subject is used to evaluate it. This procedure is repeated $R$ times for every value of $n$ with different random initialization of the model. 

The described algorithm trains and evaluates $|S|*(|S|-1)*R$ models. Since we use $R=10$ and there are three datasets and four types of models, a total of $60720$ combinations models are trained ($23*22*10*3*4$). 


::: {.callout-note}
The script employed to execute this process is `03_incremental-loso.py`.

:::: {add-from=libs/chapter3/pipeline/03_incremental-loso.py code-fold='true' code-filename='03_incremental-loso.py'}
```{.python}
```
::::

:::


#### Evaluation metrics
We analyse the effect of the increasing size of training data by observing the overall and activity-wise classification performance -- measured using the **accuracy** and **F1-score** metrics (defined in @sec-eval_metrics) -- of all model combinations grouped by the amount of training data ($n$), dataset and model type.

Then, based on the **accuracy** and **F1-score** metrics, descriptive statistics are computed for each value of $n$ (i.e., the amount of training subjects), type of models and datasets to determine the effect of the increase in the training data. Finally, MWU are executed to find significant differences between the amount of data employed and the model performance, and KWH are executed to determine the best performant models and datasets (see @sec-stats_methods).

## Results

:::{#contents}
:::