Data science - statistics (R).Rmd

---
title: Data science - statistics
output: html_document
---

# Data science - quantitative analysis

Data science can be used to numerical data as well.
We do not need to modify it to suite computer analysis, so we can read the data directly in.
Here we examine World Value Survey dataset.

[Narrated code walkthough](https://www.youtube.com/watch?v=aCbSM4Qiz_Q)

```{r}
data <- read.csv('./data/wvs.csv')
data$V10 <- as.factor( data$V10 )
table( data$V10 )
```

## Supervised machine learning

We use a general purpose library [caret](https://www.rdocumentation.org/packages/caret/) in developing and workingwith these exercises.
1. splitting the data into train and test datasets
1. fit a model to the data
1. examine the model performance using test dataset

```{r}
library( caret )
```

```{r}
index = createDataPartition(y=data$V10, p=0.7, list=FALSE)
train = data[index,]
test = data[-index,]
```

```{r}
table( train$V10 )
table( test$V10 )
```

```{r}
model = train( V10 ~ V4 + V5 + V6 + V7 + V8 + V9,  data=train, method="rpart" )
```

```{r}
pred = predict( model , newdata = test )
```

```{r}
tab_class <- table( test$V10 , pred )
confusionMatrix(tab_class, mode = "everything")
```

## Exercises

* `V10` has several unwanted values: `-5`, `-2` and `-1`. Remove them from the data and rerun the analysis.
* What other variables would you add to the analysis? Do they improve accuracy? See [survey documentation](https://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp) for the meaning of variables.
* What other methods than `rpart` exists (see [caret documentation](http://topepo.github.io/caret/train-models-by-tag.html) on available models)? Try out them. Do you get better results?
* What does cross validation mean? Try out cross validation and test out these things.

## Unsupervised machine learning

Beyond seeking to classify data based on existing attributes, in some questions you may want to find groups of similar entries from the data.
This is unsupervised machine learning; several methods excists for this.

```{r}
clusters <- kmeans( data[1:500,], centers = 5 )
```

```{r}
## cluster values per variable
aggregate( data[1:500,], by=list(cluster=clusters$cluster), mean)
```

## Exercises

* Which cluster has highest number of data points?
* How does changing the number of clusters change their results?