README.Rmd

---
title: "Readme for DLCAnalyzer"
output:
  html_document:
    keep_md: true
    theme: united
    toc: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
  fig.path = "README_figs/README-"
)
```

## DLCAnalyzer

DLCAnalyzer is a code collection that allows loading and processing of DeepLabCut (DLC) .csv files (https://github.com/DeepLabCut/DeepLabCut). It can be used for simple analyses such as zone visits, distance moved etc. but can also be integrated with supervised machine learning and unsupervised clustering methods to extract complex behaviors based on point data information. DLCAnalyzer is intended for interactive use in R. The following document will highlight the most common type of analyses that can be performed with this collection of functions. In the function glossary all main functions and auxiliary functions are listed with a short description of their action. This code package has been written for a publication that is currently in the revision process. A pre-print version can be found under at: https://www.biorxiv.org/content/10.1101/2020.01.21.913624v1

## Getting started

This collection of code is not available as package, since certain dependencies rely on installs that are independent of R, so in order to ensure smooth operation please follow the steps described here.
Following libraries are used by this package (with information about tested versions) and should be installed and loaded before executing any commands

```{r results='hide', message=FALSE, warning=FALSE}
library(sp)         #tested with v1.3-2
library(imputeTS)   #tested with v3.0
library(ggplot2)    #tested with v3.1.0
library(ggmap)      #tested with v3.0.0
library(data.table) #tested with v1.12.8
library(cowplot)    #tested with v0.9.4
library(corrplot)   #tested with v0.84
library(keras)      #REQUIRES TENSORFLOW INSTALL. tested with v2.2.5.0
```

Additionally, this package requires a working installation of tensorflow for R, which itself requires a working installation of Anaconda. Please follow the installation protocol in the following link for all steps:
https://tensorflow.rstudio.com/installation/
While this package might be a bit harder to install it is only required for functions that are needed for machine learning. For each section of this document it will be indicated when tensorflow is required. Everything else works without the install.

## Loading and processing a single file
Download the contents of this repository and keep the folder structure unchanged. First, set your working directory to this specified folder and then ensure that the file with all the code is sourced

```{r, include=FALSE}
#source('R/DLCAnalyzer_Functions_final.R')
source("R/DLCAnalyzer_Functions_final.R")
```

```{r, eval=FALSE}
setwd("PathToDLCFolder")
source('R/DLCAnalyzer_Functions_final.R')
```

To load a DLC .csv file (here an example file of an open field test (OFT) tracking) insert the path of the file (from the working directory):

```{r}
Tracking <- ReadDLCDataFromCSV(file = "example/OFT/DLC_Data/OFT_3.csv", fps = 25)
```

This command loads the DLCdata and orders it in an object that allows easy access and manipulation. It is crucial to set the correct frames per second (fps), otherwise many downstream metrics will be distorted. If you do not set fps it will be set to 1 frame per second!
Let's inspect the contents of the tracking object:

```{r}
names(Tracking$data)
```

As we can see the object contains a sub-object '$data' that has multiple further sub-objects, one for each point. The points have the same name as they had in the DLC network that was used for point tracking. Since this depends heavily on each user, most code of DLCAnalyzer is compatible with custom point names. Wherever functions were written for our specific network we will indicate this in the document 
Let's have a look at the point `bodycentre`:

```{r}
head(Tracking$data$bodycentre)
```

As we can see the information from DLC (frame,x,y, and likelihood) of point `bodycentre` can be accessed easily
We can also plot one or multiple points using:

```{r}
PlotPointData(Tracking, points = c("nose","bodycentre","tailbase","neck"))
```

As we can see the tracking was not perfect, and all points had some tracking problems every now and then. If we want to remove and interpolate the outliers based on low likelihood from DLC we can do this by:

```{r}
Tracking <- CleanTrackingData(Tracking, likelihoodcutoff = 0.95)
PlotPointData(Tracking, points = c("nose","bodycentre","tailbase","neck"))
```

This looks already better, but there are still some points that were tracked incorrectly with high likelihood. When inspecting these OFT videos it becomes apparent that the video runs on after the trial is over which leads to mislabeling when the light in the camber turns on and the mouse is picked up. Let's get rid of the last 250 frames (=10 seconds) and see if this solves the issue

```{r}
Tracking <- CutTrackingData(Tracking, end = 250)
PlotPointData(Tracking, points = c("nose","bodycentre","tailbase","neck"))
```

As we can see this removed the artefacts in the data sufficiently


Next, we want to calibrate our Tracking data to transform it from a pixel dimension into a metric dimension. In this case we measured the physical distance of the area in cm (42 x 42) which is span by the points `tl`, `tr`, `br` and `bl`. For this, we use:

```{r}
Tracking <- CalibrateTrackingData(Tracking, method = "area",in.metric = 42*42, points = c("tl","tr","br","bl"))
Tracking$px.to.cm
PlotPointData(Tracking, points = c("nose","bodycentre","tailbase","neck"))
```

As you can see now `the px.to.cm` ratio was calculated and the data was automatically transformed into a metric dimension. The same process is possible with a pre-defined ratio or with a distance measurement between two points

## OFT analysis of a single file

Here we will explore how we can perform an OFT (open field test) analysis on a single file

Let's start by loading and pre-processing the data:

```{r, results='hide'}
Tracking <- ReadDLCDataFromCSV(file = "example/OFT/DLC_Data/OFT_3.csv", fps = 25)
Tracking <- CleanTrackingData(Tracking, likelihoodcutoff = 0.95)
Tracking <- CutTrackingData(Tracking,start = 100, end = 250)
Tracking <- CalibrateTrackingData(Tracking, method = "area",in.metric = 42*42, points = c("tl","tr","br","bl"))
```

In an OFT analysis we want to quantify how much / fast an animal moves and how much time it spends in different zones.
Let's start by defining the zones. DLCAnalyzer has a built in function to create a set of OFT zones based on the median tracking data from the 4 corners of the arena. 
To create the OFT zones we use:

```{r}
Tracking <- AddOFTZones(Tracking, scale_center = 0.5,scale_periphery  = 0.8 ,scale_corners = 0.4, points = c("tl","tr","br","bl"))
PlotZones(Tracking)
```

As we see, the function created the OFT Zones and stored them as part of the `Tracking` object.
Now we can resolve whenever a body point is in a certain zone:

```{r}
PlotZoneVisits(Tracking,point = c("bodycentre","nose","tailbase"))
```

However, we might be interested in adding new zones that are independent of any method. If we would want to add 2 new triangular zones and only get a readout for them can do this with:

```{r}
Tracking <- AddZones(Tracking,z = data.frame(my.zone.1 = c("tl","tr","centre"),
                                             my.zone.2 = c("bl","br","centre")))
PlotZones(Tracking, zones=c("my.zone.1","my.zone.2", "arena"))
PlotZoneVisits(Tracking,point = c("bodycentre","nose","tailbase"),zones = c("my.zone.1","my.zone.2"))
```

In order to get metrics such as speed or movement we use:

```{r}
Tracking <- CalculateMovement(Tracking, movement_cutoff = 5, integration_period = 5)
head(Tracking$data$bodycentre)
```

The `movement_cutoff` defines at which cutoff (units / s, here cm) we are considering a point to be moving. The `integration_period` is important for detecting transitions. It will define over how many frames (+- `integration_period`) transitions between zones are analyzed and for how long an animal has to be moving to be considered doing so. This can help to remove noisy interpretations, i.e. where a point jumps over the zone line multiple times in short succession.

Now that we have calculated all metrics which are important for our analysis, we can create a density plot that encapsulates speed, time spent and position for selected points (here we only want to plot `bodycentre`, `nose` and `tailbase`):

```{r}
plots <- PlotDensityPaths(Tracking,points = c("bodycentre","nose","tailbase"))
plots$bodycentre
plots$nose
```

We can add our zones to the plot:

```{r}
plots <- AddZonesToPlots(plots,Tracking$zones)
plots$bodycentre
```

We are interested what happens in the center over the whole recording. For thi,s we can generate a zone report:

```{r}
Report <- ZoneReport(Tracking, point = "bodycentre", zones = "center")
t(data.frame(Report))
```

we can also generate a report based on a combination of zones. In the following example we create a report for whenever the `bodycentre` is neither in `my.zone.1` or in `my.zone.2` (by setting `invert = TRUE` we look at the inversion of the combined zone)

```{r}
Report <- ZoneReport(Tracking, point = "bodycentre", zones = c("my.zone.1","my.zone.2"), zone.name = "OutsideMyZones", invert = TRUE)
t(data.frame(Report))
```

Now we get a report for what happens when the animal is not in `my.zone.1` or `my.zone.2`. If we want to check if our zone was set correctly we can quickly do this with:

```{r}
PlotZoneSelection(Tracking, point = "bodycentre", zones = c("my.zone.1","my.zone.2"), invert = TRUE)
```

For an easy, preset OFT analysis of one or multiple points, you can also use the command:

```{r, warning=FALSE}
Tracking <- OFTAnalysis(Tracking, points = "bodycentre" ,movement_cutoff = 5, integration_period = 5)
t(data.frame(Tracking$Report))
```

## OFT analysis of multiple files

In order to run multiple files the best practice is to first define a pipeline of commands, and then execute it for all files in a folder.
First we define the path to our folder (`input_folder`) of interest and find all files:

```{r, warning=FALSE}
input_folder <- "example/OFT/DLC_Data/"
files <- list.files(input_folder) 
files
```

Now, we define our processing pipeline. In order to check if it is running appropriately it is advisable to first run it for a single file interactively

```{r, warning=FALSE}
pipeline <- function(path){
  Tracking <- ReadDLCDataFromCSV(path, fps = 25)
  Tracking <- CutTrackingData(Tracking,start = 100, end = 300)
  Tracking <- CalibrateTrackingData(Tracking, "area",in.metric = 42*42, points = c("tr","tl","bl","br"))
  Tracking <- CleanTrackingData(Tracking, likelihoodcutoff = 0.95)
  Tracking <- AddOFTZones(Tracking, scale_center = 0.5,scale_periphery  = 0.8 ,scale_corners = 0.4)
  Tracking <- OFTAnalysis(Tracking, movement_cutoff = 5, integration_period = 5, points = "bodycentre")
  return(Tracking)
}
```

Now that the pipeline is defined, we can execute it for all files and combine them into a list of Tracking objects:

```{r, results='hide', message=FALSE, warning=FALSE}
TrackingAll <- RunPipeline(files,input_folder,FUN = pipeline)
```

From the list we can access individual results with the `$` operator. for example if we want to plot the zone visits for the first file we can simply use:

```{r}
Tracking <- TrackingAll$`OFT_1.csv`
PlotZoneVisits(Tracking, point = "bodycentre")
```

To get a combined report of all files we can use the following command (here, we only display the first 6 columns):

```{r}
Report <- MultiFileReport(TrackingAll)
Report[,1:6]
```

Additionally, we can create PDF files with multiplots for all analyses. They will appear in the working directory:

```{r, eval=FALSE}
PlotDensityPaths.Multi.PDF(TrackingAll,points = c("bodycentre"), add_zones = TRUE)
PlotZoneVisits.Multi.PDF(TrackingAll,points = c("bodycentre","nose","tailbase"))
```

## EPM analysis of a single file

Here we will describe how a EPM (elevated plus maze) analysis for a single file can be performed using DLCAnalyzer
Similar to previous sections, we start by loading and cleaning up our data.

```{r}
Tracking <- ReadDLCDataFromCSV(file = "example/EPM/DLC_Data/EPM_1.csv", fps = 25)
names(Tracking$data)
```

As you we see, the EPM network contains additional/different points than the OFT network. Many of these additional points are used to track the maze, which allows automatic reconstruction of the zones for each file.

```{r}
Tracking <- CutTrackingData(Tracking,start = 100, end = 250)
Tracking <- CalibrateTrackingData(Tracking, method = "distance",in.metric = 65.5, points = c("tl","br"))
Tracking <- CleanTrackingData(Tracking, likelihoodcutoff = 0.95)
PlotPointData(Tracking,points = c("nose","bodycentre","tailbase","neck"))
```

As we can see, the data looks messy even after the likelihood cutoff. We will get back to that with a further trick. However, for the EPM analysis we also need a number of zones that describe the maze (open arms, closed arms etc.). Rather than constructing them in all independently there is the option to use a template file that describes the zones and then adds them to the object automatically.
First, we load the template .csv file:

```{r}
zoneinfo <- read.table("example/EPM/EPM_zoneinfo.csv", sep = ";", header = T)
zoneinfo
```

As we can see this file contains a number of columns that each contain a list or point names from our DLCnetwork. Each column of points describes a zone which they span. We can add this info to the Tracking data:

```{r}
Tracking <- AddZones(Tracking,zoneinfo)
PlotZones(Tracking)
```

Our zone file also contains one zone `arena` that describes the whole area:

```{r}
Tracking$zones$arena
```

We can use this to further clean up our data. we scale it by a factor 1.8 to define an inclusion zone:

```{r}
inclusion.zone <- ScalePolygon(Tracking$zones$arena, factor = 1.8)
```

Now we can use this zone to further clean up our data. every point that falls outside of it will be removed and interpolated

```{r}
Tracking <- CleanTrackingData(Tracking, likelihoodcutoff = 0.95,existence.pol = inclusion.zone)
PlotPointData(Tracking,points = c("nose","bodycentre","tailbase","neck"))
```

We can now perform an EPM analysis. This will record time in zones and other metrics. if `nosedips` is enabled it will automatically detect nose dips. This will only works if the correctly named points and zones are present in the DLCnetwork and zoneinfo file, otherwise it will omit the nose dip analysis and report a warning.

```{r}
Tracking <- EPMAnalysis(Tracking, movement_cutoff = 5,integration_period = 5,points = "bodycentre", nosedips = TRUE)
t(data.frame(Tracking$Report[1:6]))
```

We can create a time resolved plot of all nose dips. for this, we use:

```{r}
PlotLabels(Tracking)
```

## EPM analysis of mutliple files

Here we will cover an EPM (elevated plus maze) analysis for multiple files.
Similar to multiple OFT files, we first define a pipeline and then execute it for all EPM files

```{r, warning=FALSE}
input_folder <- "example/EPM/DLC_Data/"
files <- list.files(input_folder) 
files
```


```{r, warning=FALSE,results='hide'}
pipeline <- function(path){
  Tracking <- ReadDLCDataFromCSV(file = path, fps = 25)
  Tracking <- CutTrackingData(Tracking,start = 100, end = 250)
  Tracking <- CalibrateTrackingData(Tracking, method = "distance",in.metric = 65.5, points = c("tl","br"))
  zoneinfo <- read.table("example/EPM/EPM_zoneinfo.csv", sep = ";", header = T)
  Tracking <- AddZones(Tracking,zoneinfo)
  Tracking <- CleanTrackingData(Tracking, likelihoodcutoff = 0.95,existence.pol = ScalePolygon(Tracking$zones$arena, 1.8))
  Tracking <- EPMAnalysis(Tracking, movement_cutoff = 5,integration_period = 5,points = "bodycentre", nosedips = TRUE)
  return(Tracking)
}

TrackingAll <- RunPipeline(files,input_folder,FUN = pipeline)
```

Again, we can create a report for all files using the following command (here we only show the first 6 columns):

```{r, warning=FALSE,}
Report <- MultiFileReport(TrackingAll)
Report[,1:6]
```

We can see that animal 3 has very few nose dips, whereas animal 4 seems to have a lot. We can quickly compare both with the following commands to create overview plots for them:

```{r, fig.asp = 1.5, figures-side, fig.show="hold", out.width="50%"}
OverviewPlot(TrackingAll$EPM_4.csv,"bodycentre")
OverviewPlot(TrackingAll$EPM_3.csv,"bodycentre")
```

Indeed, it becomes apparent that animal 3 was hiding in the bottom closed arm for most of the test, whereas animal 4 showed a rich explorative behavior.

Additionally, we can create PDF files with multiplots for all analyses. They will appear in your working directory.

```{r, eval=FALSE}
PlotDensityPaths.Multi.PDF(TrackingAll,points = c("bodycentre"), add_zones = TRUE)
PlotZoneVisits.Multi.PDF(TrackingAll,points = c("bodycentre","nose","tailbase"))
PlotLabels.Multi.PDF(TrackingAll)
```

## FST analysis of a single file

Here we will describe how a FST (forced swim test) analysis for a single file can be performed
Similar to previous sections, we start by loading and cleaning up our data:

```{r,results='hide'}
Tracking <- ReadDLCDataFromCSV(file = "example/FST/DLC_Data/FST_1.csv", fps = 25)
Tracking <- CutTrackingData(Tracking,start = 300, end = 100)
Tracking <- CalibrateTrackingData(Tracking, "distance",in.metric = 20, points = c("t","b"))
Tracking <- CleanTrackingData(Tracking, likelihoodcutoff = 0.95)
```

And we inspect our data to control the integrity:

```{r}
PlotPointData(Tracking,points = c("nose","bodycentre","tailbase","neck"))
```

As we can see, the data looks good already. We can now proceed to run the analysis. But first, for an FST analysis we need to specify further information. For this type of analysis, we have to measure the movement of all points that are part of the mouse and analyze them in order to detect floating behavior.
If we inspect the `point.info` of our object

```{r}
Tracking$point.info
```

We can see that there is not yet any info that describes which point belongs to the mouse. Here, we will load this info from a .csv file and add it to the object: (however, this could also be done manually by directly changing `Tracking$point.info`)

```{r}
pointinfo <- read.table("example/FST/FST_pointinfo.csv", sep = ";", header = T)
Tracking <- AddPointInfo(Tracking, pointinfo)
Tracking$point.info
```

Now that the `Tracking` object has the additional information about point type we can run an FST analysis. We specify a `cutoff_floating` of 0.03, for which we found the algorithm to track floating behavior optimally. We also specify that the `Object` we want to track is of type `"Mouse"`.

```{r}
Tracking <- FSTAnalysis(Tracking,cutoff_floating = 0.03,integration_period = 10, Object = "Mouse", points = "bodycentre")
Tracking$Report
PlotLabels(Tracking)
```

As we can see, in this case the floating behavior of the mouse increased in the later part of the swim test.

## FST analysis of multiple files

Here we will describe how multiple FST (forced swim test) files can be analyzed together.
Similar to multiple OFT files, we first set the path, define a pipeline and then execute it for all FST files

```{r, warning=FALSE}
input_folder <- "example/FST/DLC_Data/"
files <- list.files(input_folder) 
files[1:5]
```

We define the processing pipeline

```{r, warning=FALSE}
pipeline <- function(path){
  Tracking <- ReadDLCDataFromCSV(file = path, fps = 25)
  Tracking <- CutTrackingData(Tracking,start = 300, end = 100)
  Tracking <- CalibrateTrackingData(Tracking, "distance",in.metric = 20, points = c("t","b"))
  Tracking <- CleanTrackingData(Tracking, likelihoodcutoff = 0.95)
  pointinfo <- read.table("example/FST/FST_pointinfo.csv", sep = ";", header = T)
  Tracking <- AddPointInfo(Tracking, pointinfo)
  Tracking <- FSTAnalysis(Tracking,cutoff_floating = 0.03,integration_period = 10, Object = "Mouse", points = "bodycentre")
  return(Tracking)
}
```

We execute the pipeline for all files and combine them into a list of Tracking objects. We can then get a final report:

```{r, warning=FALSE,results='hide'}
TrackingAll <- RunPipeline(files,input_folder,FUN = pipeline)
Report <- MultiFileReport(TrackingAll)
```

```{r, warning=FALSE}
Report[1:5,1:5]
```

Additionally, we can create a PDF with all the label plots using the following command. By default it will be placed in the working directory

```{r, eval=FALSE}
PlotLabels.Multi.PDF(TrackingAll)
```

## Runing a bin analysis

Often, in behavioral research readouts are required in time bins. DLCAnalyzer has a fully integrated approach that allows analyses within bins. Here, we will do a bin analysis for floating behavior to see if it increases in animals at later time-points.

```{r, warning=FALSE, results='hide'}
Tracking <- ReadDLCDataFromCSV(file = "example/FST/DLC_Data/FST_1.csv", fps = 25)
Tracking <- CutTrackingData(Tracking,start = 300, end = 100)
Tracking <- CalibrateTrackingData(Tracking, "distance",in.metric = 20, points = c("t","b"))
Tracking <- CleanTrackingData(Tracking, likelihoodcutoff = 0.95)
pointinfo <- read.table("example/FST/FST_pointinfo.csv", sep = ";", header = T)
Tracking <- AddPointInfo(Tracking, pointinfo)
```

We did not yet run any analysis, since first we need to add the information about the bins. Here we add 1 minute long bins.

```{r, warning=FALSE}
Tracking <- AddBinData(Tracking,unit = "minute", binlength = 1)
Tracking$bins
```

Optionally, we can also load bins from a `data.frame`. this is especially interesting if we have unequally sized bins. In this example we have a first and last bin of 1.5 minutes and a longer intermediate bin of 3 minutes

```{r, warning=FALSE}
my.bins <- read.table("example/FST/FST_BinData.csv", sep = ";", header = T)
my.bins
Tracking <- AddBinData(Tracking,bindat = my.bins, unit = "minute")
Tracking$bins
```

Now, we can perform a bin analysis using:

```{r, warning=FALSE}
BinReport <- BinAnalysis(Tracking, FUN = FSTAnalysis ,cutoff_floating = 0.03,integration_period = 10, Object = "Mouse", points = "bodycentre")
BinReport[,1:5]
```

It is important to note that the function `BinAnalysis()` takes another function as argument (`FSTAnalysis`). Therefore it is crucial to pass all the arguments the function `FSTAnalysis()` requires along whenever using `BinAnalysis()`, In this case these were: `cutoff_floating`, `integration_period`, `Object` and `points`. The same applies to other functions such as `OFTAnalysis()` or `EPMAnalysis()` for the respective tests, which are also compatible with the `BinAnalysis()` function.

We can do the same for multiple files (here again in our custom bins):

```{r, warning=FALSE, results='hide'}
input_folder <- "example/FST/DLC_Data/"
files <- list.files(input_folder) 

pipeline <- function(path){
  Tracking <- ReadDLCDataFromCSV(file = path, fps = 25)
  Tracking <- CutTrackingData(Tracking,start = 300, end = 100)
  Tracking <- CalibrateTrackingData(Tracking, "distance",in.metric = 20, points = c("t","b"))
  Tracking <- CleanTrackingData(Tracking, likelihoodcutoff = 0.95)
  pointinfo <- read.table("example/FST/FST_pointinfo.csv", sep = ";", header = T)
  Tracking <- AddPointInfo(Tracking, pointinfo)
  my.bins <- read.table("example/FST/FST_BinData.csv", sep = ";", header = T)
  Tracking <- AddBinData(Tracking,bindat = my.bins, unit = "minute")
  return(Tracking)
}

TrackingAll <- RunPipeline(files,input_folder,FUN = pipeline)
BinReportAll <- MultiFileBinanalysis(TrackingAll, FUN = FSTAnalysis ,cutoff_floating = 0.03,integration_period = 10, Object = "Mouse", points = "bodycentre")
```

And we can see that the bins for each file are now included in one `data.frame`

```{r, warning=FALSE}
BinReportAll[1:6,1:5]
```

We can now answer our question if floating increases in the later bins of the tests:

```{r, warning=FALSE}
ggplot(data = BinReportAll, aes(bin,percentage.floating, color = bin)) + geom_boxplot() + geom_point(position = position_jitterdodge())
```

As we can see here, the time floating seems to increase in the later bins compared to the earlier bin.

## Training a machine learning classifier on a single file

This section requires a working install of the `keras` library which itself requires a working anaconda and tensorflow install!
Here we will explore how a neural network can be trained to recognize complex behaviors. We will work with the forced swim test data (FST).

First, we will see how we can generate features from one single file, and add the labeling data from a human experimenter to it. We start with our standard pipeline to load, clean and calibrate the data:

```{r, warning=FALSE}
file <- "FST_3.csv"
path <- "example/FST/DLC_Data/"
Tracking <- ReadDLCDataFromCSV(paste(path,file,sep = ""), fps = 25)
Tracking <- CutTrackingData(Tracking,start = 300,end = 300)
Tracking <- CalibrateTrackingData(Tracking, "distance",in.metric = 20, c("t","b"))
Tracking <- CleanTrackingData(Tracking, likelihoodcutoff = 0.95)
PlotPointData(Tracking,points = c("nose","bodycentre","tailbase","neck"))
```

As we can see, the data looks good. Now, let's add the labeling data. For this, we first load it from a file which containes labeling data of only one experimenter:

```{r, warning=FALSE}
labeling.data <- read.table("example/FST/Lables/FSTLables_Oliver.csv",sep = ";", header = T)
head(labeling.data)
```

As you can see, the data is prepared in a specific way that links onset and offset of behaviors to a specific file. To get the data from this one file and add it to our object we use:

```{r, warning=FALSE}
labeling.data <- labeling.data[labeling.data$CSVname == Tracking$file,]
Tracking <- AddLabelingData(Tracking, labeling.data)
PlotLabels(Tracking)
```

Now that we have our labeling data, we have to think about how to create our feature data from our tracking data. Here we will use the information about point acceleration to train a floating classifier. For this, we first have to calculate accelerations for each of our point using:

```{r, warning=FALSE}
Tracking <- CalculateAccelerations(Tracking)
```

Now we use a preset function that extracts our features from the point data and stitches them into a single feature `data.frame`

```{r, warning=FALSE}
Tracking <- CreateAccelerationFeatures(Tracking)
dim(Tracking$features)
head(Tracking$features)
```

As we see, our feature data now contains acceleration data of 11 points, in this case all points except the tail points. It is important to note here, that the function `CreateAccelerationFeatures()` is very specific for a certain DLC network, so any network that produces different points, or points with different names will not work properly!
Additionally, we also want to incorporate temporal information. For this, we use the following command. In the process data from previous and following frames will be added to our features at each frame, depending on a specified integration period, here set to +- 20 frames.

```{r, warning=FALSE}
Tracking <- CreateTrainingSet(Tracking, integration_period = 20)
dim(Tracking$train_x)
head(Tracking$train_y)
```

As we can see, now the `train_x` data is much bigger, having 451 features (= 41 x 11) that represent acceleration of 11 points over a window of +- 20 frames. We can now use the following command to prepare the data so it can directly be used with keras. We set `shuffle` to `TRUE` so the data get's randomly shuffled before training a network with it:

```{r, warning=FALSE, eval = FALSE}
MLData <- CombineTrainingsData(Tracking, shuffle = TRUE)
```

Now, we define the architecture of the network that we want to train. We will not go into the details of the model used here, it is advisable to read the documentation on tensorflow.org for specific details.

```{r, warning=FALSE, eval = FALSE}
model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(MLData$parameters$N_input),kernel_regularizer = regularizer_l2(l = 0)) %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 128, activation = 'relu',kernel_regularizer = regularizer_l2(l = 0)) %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = MLData$parameters$N_features, activation = 'softmax')

model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)
```

We train the model with our data using the following command:

```{r, warning=FALSE, eval = FALSE}
history <- model %>% fit(
  MLData$train_x, MLData$train_y, 
  epochs = 5, batch_size = 32, 
  validation_split = 0
)
```

We can evaluate the model with the same data we trained on using the following function:

```{r, include = FALSE}
#Here we are reading in pre-processed labeling data. set eval to TRUE for the above chunks and disable this command to interactively run ML training
Tracking$labels <- readRDS("example/knitdat.ML1.Rds")
```
```{r, warning=FALSE, eval = FALSE}
Tracking <- ClassifyBehaviors(Tracking,model,MLData$parameters)
```
```{r}
PlotLabels(Tracking)
```

As we can see, the classifier performs closely to the original manual tracking. However, we have to be very careful here. We just trained with the same data that we tested on, and we used data from one single video. The results are therefore not solid. In the next section, we will see how we can use a one vs. all approach to create multiple training and testing sets, that can be used to gain a fairly accurate read-out of our models performance.

## Training a classifier and cross validating it

In order to train more robust classifiers we need more trainings data. Here we again use the FST example dataset, for which we have 5 annotated videos. In general, this amount of trainings data is not sufficient to reach a good performance, but it will serve as a practical example that can be run in a short enough time here. We start by defining our `pipeline()`.

```{r, warning=FALSE}
labeling.data <- read.table("example/FST/Lables/FSTLables_Oliver.csv",sep = ";", header = T)
pointinfo <- read.table("example/FST/FST_pointinfo.csv", sep = ";", header = T)
files <- unique(labeling.data$CSVname)
path <- "example/FST/DLC_Data/"
pipeline <- function(path){
  Tracking <- ReadDLCDataFromCSV(path, fps = 25)
  Tracking <- CutTrackingData(Tracking,start = 300,end = 300)
  Tracking <- CalibrateTrackingData(Tracking, "distance",in.metric = 20, c("t","b"))
  Tracking <- CleanTrackingData(Tracking, likelihoodcutoff = 0.95)
  Tracking <- AddLabelingData(Tracking, labeling.data[labeling.data$CSVname == Tracking$filename,])
  Tracking <- CalculateAccelerations(Tracking)
  Tracking <- CreateAccelerationFeatures(Tracking)
  Tracking <- CreateTrainingSet(Tracking, integration_period = 20)
  Tracking <- AddPointInfo(Tracking,pointinfo)
  Tracking <- FSTAnalysis(Tracking, cutoff_floating = 0.03, integration_period = 5, points = "bodycentre", Object = "Mouse")
  return(Tracking)
}

```

And now, we run this pipeline for all files and save the results in a `list()` of tracking objects

```{r, warning=FALSE, results = 'hide'}
TrackingAll <- RunPipeline(files,path,FUN = pipeline)
```

For the one vs. all evaluation, we write a loop that trains a model for each file (using all the other files as training data) and then evaluates this model with the one single file. It will repeat this process for each file and add the classification results to each file.

```{r, warning=FALSE, results = 'hide', eval = FALSE}
evaluation <- list()

  for(i in names(TrackingAll)){
    #combine training data from all files except the one we want to evaluate
    MLData <- CombineTrainingsData(TrackingAll[names(TrackingAll) != i],shuffle = TRUE)
    
    #initialize model
    model <- keras_model_sequential() 
    model %>% 
      layer_dense(units = 256, activation = 'relu', input_shape = c(MLData$parameters$N_input),kernel_regularizer = regularizer_l2(l = 0)) %>% 
      layer_dropout(rate = 0.4) %>% 
      layer_dense(units = 128, activation = 'relu',kernel_regularizer = regularizer_l2(l = 0)) %>%
      layer_dropout(rate = 0.3) %>%
      layer_dense(units = MLData$parameters$N_features, activation = 'softmax')

    #define optimizer
   model %>% compile(
      loss = 'categorical_crossentropy',
      optimizer = optimizer_rmsprop(),
      metrics = c('accuracy')
      )

  #train model
    history <- model %>% fit(
      MLData$train_x, MLData$train_y, 
      epochs = 5, batch_size = 32, 
      validation_split = 0
      )
  
  #we now use this model to classify the behavior in the cross validation file
  TrackingAll[[i]] <- ClassifyBehaviors(TrackingAll[[i]],model,MLData$parameters)
  
  }
```

We quickly inspect how it performed for the first 2 files:

```{r, include = FALSE}
#Here we are reading in pre-processed labeling data. set eval to TRUE for the above chunks and disable this command to interactively run ML training
labs <- readRDS(file = "example/LabsFST.Rds")
for(i in names(TrackingAll)){
  TrackingAll[[i]]$labels$classifications <- labs[[i]]$classifications
}

```


```{r}
PlotLabels(TrackingAll$FST_1.csv)
PlotLabels(TrackingAll$FST_2.csv)
```

As we can see, on the first glance the classifications look fairly accurate. To gain a more descriptiv readout we can evaluate our results within files and for the overall experiment. The second command will create a PDF document with all LabelPlots in the working directory.

```{r}
EvaluateClassification(TrackingAll)
PlotLabels.Multi.PDF(TrackingAll)
```

As we can see, for floating we achieve an overall accuracy of ~80% and for None ~94%
If we want to investigate how well the classifier imitates the human experimenter across multiple files we can create a correlation plot of the final readouts. Here we are interested in the floating time for the different labeling approaches:

```{r}
CorrelationPlotLabels(TrackingAll, include = c("manual.Floating.time","classifications.Floating.time","cutoff.floating.Floating.time"))
```

As we can see, training with 4 files only leads to a classifier that correlates worse to the manual scoring than the pre-set cutoff. To increase the correlation we would want to increase the size of trainings data substantially.

DLCAnalyzer can also run unsupervised clusterin methods. Here we use kmeans clustering on our machine learning data to see if this approach can resolve any behavioral syllables that are floating like.

```{r, results = 'hide', width=10, fig.height=8}
TrackingAll <- UnsupervisedClusteringKmeans(TrackingAll,N_clusters = 10,Z_score_Normalize = TRUE)
PlotLabels(TrackingAll$FST_2.csv)
```

As we see there is one cluster that seems to be very similar to the classifier, manual and floating cutoff. Let's plot a correlation matrix to see if this holds over all files.

```{r, fig.width=10, fig.height=8}
CorrelationPlotLabels(TrackingAll, include = c("manual.Floating.time",
                                               "classifications.Floating.time",
                                               "cutoff.floating.Floating.time",
                                               "unsupervised.1.time",
                                               "unsupervised.2.time",
                                               "unsupervised.3.time",
                                               "unsupervised.4.time",
                                               "unsupervised.5.time",
                                               "unsupervised.6.time",
                                               "unsupervised.7.time",
                                               "unsupervised.8.time",
                                               "unsupervised.9.time",
                                               "unsupervised.10.time"))
```

## Functions Glossary

Main functions, intended to be used by user:

  Function  | Description
  ------------- | -------------
  `AddBinData()`  | Adds bin data to an object of type TrackingData
  `AddLabelingData()`  | Adds labels to an object of type TrackingData
  `AddOFTZones()`|  Adds OFT zones to an object of type TackingData
  `AddPointInfo()` | Adds point info to an object of type Trackingdata
  `AddZones()` | Adds zones to an object of type TrackingData
  `AddZonesToPlots()` | Adds zones to a density path plot. Requires existing zones
  `BinAnalysis()` | Runs a analysis on an object of type TrackingData in bins. Requires existing bin data
  `CalculateAccelerations()` | Adds accelerations to an object of type TrackingData
  `CalculateMovement()` | Adds movement metrics to an object of type TrackingData
  `CalibrateTrackingData()` | Calibrates an object of Type TrackingData
  `ClassifyBehaviors()` | Classifies behaviors in an object of type TrackingData using a trained keras model. Requires existing trainings/testing data. REQUIRES KERAS LIBRARY!
  `CleanTrackingData()` | Cleanes up an object of type TrackingData
  `CombineLabels()` | Combines labels in an object of type TrackingData. Requires existing labels
  `CombineTrainingsData()` | Combines and prepares trainings data of multiple objects of type Trackingdata. Requires existing trainings data. REQUIRES KERAS LIBRARY!
  `CorrelationPlotLabels()` | Creates a correlation plot of different labels in a list of TrackingData objects. Requires existing labels
  `CreateAccelerationFeatures()` | Creates acceleration based features for an object of type TrackingData. Requires any DLC network of original publication. Requires existing acceleration data
  `CreateSkeletonData()` | Creates features based on skeleton data for an object of type TrackingData. Requires any DLC network of original publication
  `CreateSkeletonData_FST_v2()` | Creates features based on skeleton data for an object of type TrackingData. Requires FST DLC network from original publication. Requires existing acceleration data
  `CreateSkeletonData_OFT()` | Creates features based on skeleton data for an object of type TrackingData. Requires OFT DLC network from original publication. Requires existing OFT zones
  `CreateSkeletonData_OFT_v2()` | Creates features based on skeleton data for an object of type TrackingData. Requires OFT DLC network from original publication. Requires existing OFT zones. requires existing acceleration data
  `CreateSkeletonData_OFT_v3()` | Creates features based on skeleton data for an object of type TrackingData. Requires OFT DLC network from original publication. Requires existing OFT zones. Requires existing acceleration data
  `CreateTestSet()` | Adds machine learning test data for an object of type TrackingData. requires existing features
  `CreateTrainingSet()`|  Adds machine learning training data for an object of type Trackingdata. Requires existing features and labeling data.
`CutTrackingData()`|  Cuts an object of type TrackingData
`EPMAnalysis()`|  Performs a EPM analysis on an object of type TrackingData. requires EPM zones.
`EvaluateClassification()`|  Evaluates classification performance/accuracies of and object or a list of objects of type TrackingData
`ExtractLabels()`|  Extract labels from a long format data frame by multiple selection criterias
`FSTAnalysis()`|  Runs a FST analysis on an object of type TrackingData. Requires 
`GetAngleClockwise()`|  Gets the clock wise angle between a vector pair from an object of type TrackingData at each frame
`GetAngleTotal()`|  Gets the total angle between two vectors from an object of type TrackingData at each frame
`GetDistances()`|  Gets the distance between two points from an object of type TrackingData at each frame
`GetDistanceToZoneBorder()`|  Gets the distance between a point an the border of a zone from an object of type TrackingData at each frame
`GetPolygonAreas()`|  Gets area of a polygon from an object of type TrackingData at each frame
`HeadAngleAnalysis()`|  Performs a head angle analysis for an object of type TrackingData
`IsInZone()`|  Checks if a point is in a zone  for an object of type TrackingData at each frame
`IsTrackingData()`|  Checks if object is of type TrackingData
`LabelReport()`|  Creates a Label report for an object of type TrackingData. requires labels.
`MedianMouseArea()`|  Calculates the median mouse area for an object of type TrackingData
`MedianMouseLength()`|  Calculates the median mouse length for an object of type TrackingData
`MultiFileBinanalysis()`|  Performs a bin analysis for a list() of files of type TrackingData. requires bin data.
`MultiFileReport()`|  Produces a combined report for a list() of files of type TrackingData.
`NormalizeZscore()`|  Performs a Z score normalization on a matrix by columns
`NormalizeZscore_median()`|  Performs a median Z score normalization on a matrix by columns
`OFTAnalysis()`|  Performs a OFT analysis on an object of type TrackingData. requires OFT zones.
`OverviewPlot()`|  Creates an overview plot for an object of type TrackingData
`PlotDensityPaths()`|  For an object of type TrackingData, plots the density path of a point
`PlotDensityPaths.Multi.PDF()`|  For a list() of objects of type TrackingData. Prints density path plots to a pdf.
`PlotLabels()`|  For an object of type TrackingData, plots label data 
`PlotLabels.Multi.PDF()`|  For a list() of objects of type TrackingData. Prints label plots to a pdf. 
`PlotPointData()`|  Plots point data for an object of type TrackingData. 
`PlotZones()`|  Plots zones of an object of type TrackingData. requires zones
`PlotZoneSelection()`|  Plots the zone selection for an object of type TrackingData. requires zones
`PlotZoneVisits()`|  Plots zone visits for an object of type TrackingData. requires zones 
`PlotZoneVisits.Multi.PDF()`|  For a list() of objects of type TrackingData. Prints zone visit plots to a pdf. requires zones  
`PrepareMLData()`|  Prepares x_train and y_train data for a keras network. REQUIRES KERAS LIBRARY!
`ReadDLCDataFromCSV()`|  Creates an object of type TrackingData from a DLC .csv output file
`RotateTrackingData()`|  Rotates an object of type TrackingData 
`RunPipeline()`|  Runs a specified pipeline that loads and processes multiple objects of type TrackingData
`ScaleFeatures()`|  Linerly scales feature data of an object of type TrackingData. requries feature data. 
`SmoothLabels()`|  Smooths all labeling data of an object of type TrackingData. requires labels
`UnsupervisedClusteringKmeans()`|  Performs a k means clustering on an object (or list() of objects) of type TrackingData 
`ZoneReport()`|  Creates a zone report for an object of type TrackingData 
`ZscoreNormalizeFeatures()`|  Performs Z score normalization on features of an object of type TrackingData. Requires Features

Auxiliary functions:

  Function  | Description
  ------------- | -------------
`AreaPolygon2d()` | Calculates the area of a polygon
`AreaPolygon3d()` | Calculates the area of a polygon at each frame
`avgbool()` | Logical rolling mean of a boolean vector
`avgmean()` | Numeric rolling mean of a numeric vector
`CalculateTransitions()` | Calculates transitions for a boolean vector
`Distance2d()`|  Distance between two points
`DistanceToPolygon()`|  Distance from a point to the closest polygon edge
`EqualizeTrainingSet()`|  Equalizes a training set
`integratevector()`|  integration of a numeric vector
`periodsum()`|  For a numeric vector, sums up total values within a time period at each element
`RecenterPolygon()`|  Recenters a polygon 
`ScalePolygon()`|  Linearly scales a polygon 
`SmoothLabel()`| Smooths a single character vector with label data