Skip to content

Commit

Permalink
docs: adding docs and new plots for blog
Browse files Browse the repository at this point in the history
  • Loading branch information
mariogarcia committed Dec 24, 2024
1 parent 111df6d commit 0953a6a
Show file tree
Hide file tree
Showing 19 changed files with 199 additions and 35 deletions.
18 changes: 11 additions & 7 deletions docs/guide/docs/blog/posts/2024/12/classifying_food.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ categories:
- ml
tags:
- ml
- classification
---

# Classifying food
Expand Down Expand Up @@ -97,7 +98,7 @@ Each entry has a series of possible features and it’s labeled with a color val
3 | 4.2 | 4.2 | 17 | 0 | 0 | 0 | 0.004 | 0 | 0.01 |
```
However the goal is to choose the minimum set of features that maximizes the classification. Too many could classify well but it would become too hard to use, too few would not classify well enough. I need to find the balance between the two. Once I’ve found the balance I can use both, features and labels to create a training and test datasets. For that I use the train_test_split function from scikit-learn library.
However the goal is to choose the minimum set of features that maximizes the classification. Too many could classify well but it would become too hard to use, too few would not classify well enough. I need to find the balance between the two. Once I’ve found the balance I can use both, features and labels to create a training and test datasets. For that I use the **trainTestSplit** function.
```groovy title="minimum set of features and creating training and test datasets"
--8<-- "src/test/groovy/underdog/blog/y2024/m12/ClassifyingFoodSpec.groovy:train_test_split"
Expand All @@ -110,8 +111,8 @@ Drawing a scatter matrix sometimes could help you to spot features that are part
```
<figure markdown="span">
![scatter matrix](images/classifying_food_scatter_matrix.png#only-light)
![scatter matrix](images/classifying_food_scatter_matrix_dark.png#only-dark)
![scatter matrix](images/classifying_food/scatter_matrix.png#only-light)
![scatter matrix](images/classifying_food/scatter_matrix_dark.png#only-dark)
</figure>
### Algorithm selection
Expand All @@ -121,22 +122,25 @@ In order to choose the algorithm, I needed to identify first the type of problem
- First, I’ve got a labeled dataset, so it looked like I could use the labeled data to train a supervised learning model.
- Second, I was looking for different types of discrete target values (values for green, orange, red), therefore it seemed to be a classification problem.
Once I confirmed it was a classification problem I chose the only classification algorithm I know so far, the k-nearest neighbors algorithm.
Once I confirmed it was a classification problem I picked the [k-nearest neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) algorithm.
## Evaluation Phase
Then we use both, dataset and algorithm, to train a software model to make predictions. Afterwards the model performance is evaluated with testing datasets. Training and testing are part of the evaluation phase.
### Model creation & training
The k-nearest neighbors algorithm tries to establish to which type the element belongs by checking the closest neighborg elements around. You can customize the K parameter which sets how many neighbors does the algorithm have to check before emmiting its veredict.
The k-nearest neighbors algorithm is implemented in scikit-learn via the KNeighborsClassifier class. The algorithm tries to establish to which type the element belongs by checking the closest neighborg elements around. You can customize the K parameter which sets how many neighbors does the algorithm have to check before emmiting its veredict.
Here I’m initializing the algorithm with k=5. Then I’m training the model using the fit function and finally I’m checking how well the model is going to perform by passing the testing dataset (X_test, y_test) to the score function. After some tunes here and there I was able to get a 90% of accuracy by using 6 features.
Here I’m initializing the algorithm with k=5. Then I’m training the model using the fit function and finally I’m checking how well the model is going to perform by passing the testing dataset (X_test, y_test) to the score function. I was able to get more than 80% of accuracy by using 6 features.
```groovy title="model training and getting accuracy score with the testing dataset"
--8<-- "src/test/groovy/underdog/blog/y2024/m12/ClassifyingFoodSpec.groovy:knn_predictions"
```
```groovy title="accuracy check"
--8<-- "src/test/groovy/underdog/blog/y2024/m12/ClassifyingFoodSpec.groovy:accuracy_check"
```
### Model testing
To get a prediction I need to provide the following measurements to the model:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ class ClassifyingFoodSpec extends Specification {
def label = df['TRAFFICLIGHT VALUE'] as int[]

def (X_train, X_test, y_train, y_test) = Underdog.ml()
.utils
.trainTestSplit(feats, label, random_state: 0)
.utils
.trainTestSplit(feats, label, random_state: 0)
// --8<-- [end:train_test_split]

and:
Expand All @@ -65,11 +65,21 @@ class ClassifyingFoodSpec extends Specification {
and:
// --8<-- [start:knn_predictions]
def ml = Underdog.ml()

// creates and trains the model
def knn = ml.classification.knn(X_train, y_train, k: 5)

// creating predictions with the test feature set
def predictions = knn.predict(X_test)
def score = ml.metrics.r2Score(y_test, predictions)

// getting the accuracy of the model when tested against the test set
def score = ml.metrics.accuracy(y_test, predictions)
// --8<-- [end:knn_predictions]

// --8<-- [start:accuracy_check]
assert score > 0.80
// --8<-- [end:accuracy_check]

and:
// --8<-- [start:food_samples_predictions]
// def sample = [CARBS, SUGAR, PROTEINS, FAT, SALT, FIBER]
Expand Down
3 changes: 2 additions & 1 deletion gradle.properties
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ tablesaw = 1.0.1
spock = 2.3-groovy-4.0
ajoberstar = 3.0.0
micronautVersion = 4.6.3
jgrapht = 1.5.2
jgrapht = 1.5.2
smile = 3.1.1
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ interface DataFrame extends Columnar {
*/
DataFrame copy()

double[][] corrMatrix()

double[][] corrMatrix(Integer round)

/**
* Fill NA/NaN values using the specified value passed as parameter
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,33 @@ class TSDataFrame implements DataFrame {
return new TSDataFrame(table.copy())
}

@Override
double[][] corrMatrix() {
def finalData = []
this.columns.eachWithIndex { String left, int i ->
def next = []
this.columns.eachWithIndex { String right, int j ->
next << this[left].corr(this[right])
}
finalData << next
} as double[][]

return finalData
}

double[][] corrMatrix(Integer round) {
List<List<Double>> finalData = []
this.columns.eachWithIndex { String left, int i ->
List<Double> next = []
this.columns.eachWithIndex { String right, int j ->
next << this[left].corr(this[right]).doubleValue().round(2)
}
finalData << next
}

return finalData as double[][]
}

@Override
DataFrame fillna(Object o) {
Table copied = table.copy()
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package underdog.impl

import groovy.util.logging.Slf4j
import underdog.DataFrame
import underdog.Series
import groovy.transform.NamedParam
Expand All @@ -26,6 +27,7 @@ import static underdog.Series.TypeCorrelation.KENDALL
import static underdog.Series.TypeCorrelation.PEARSON
import static underdog.Series.TypeCorrelation.SPEARMAN

@Slf4j
class TSSeries implements Series {
private final Column column

Expand Down Expand Up @@ -122,10 +124,11 @@ class TSSeries implements Series {
@NamedParam(required = true) Series other,
@NamedParam(required = false) TypeCorrelation method = PEARSON,
@NamedParam(required = false) Integer observations = 0) {
log.debug("correlation between ${this.name} - ${other.name}")

def (alignedX, alignedY) = [this as Double[], other as Double[]]
.transpose()
.<List<Double>>findAll(Object::every)
.<List<Double>>findAll { items -> items.every { it != null} }
.inject([[], []]) { agg, next ->
agg[0] << next[0]
agg[1] << next[1]
Expand Down
4 changes: 2 additions & 2 deletions modules/underdog-ml/underdog-ml.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ repositories {
dependencies {
api project(":underdog-dataframe")

api "com.github.haifengl:smile-core:3.1.1"
api "com.github.haifengl:smile-nlp:3.1.1"
api "com.github.haifengl:smile-core:$smile"
api "com.github.haifengl:smile-nlp:$smile"

implementation "org.apache.groovy:groovy-macro:$groovy"
testImplementation project(":underdog-dataframe")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import underdog.plots.dsl.Radar
import underdog.plots.dsl.Series
import underdog.plots.dsl.Title
import underdog.plots.dsl.Tooltip
import underdog.plots.dsl.VisualMap
import underdog.plots.dsl.XAxis
import underdog.plots.dsl.YAxis

Expand All @@ -18,6 +19,7 @@ class Options {
Title title
Tooltip tooltip
AxisPointer axisPointer
VisualMap visualMap

@RepeatableField Legend legend
@RepeatableField Grid grid
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
package underdog.plots.dsl

import underdog.plots.ast.Node

@Node
class VisualMap {
Number min
Number max
Boolean calculable
String orient
String left
String bottom
Boolean show
}
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,10 @@ class XAxis {
String name
String type
String nameLocation
String position
Number nameGap
Number gridIndex
Boolean inverse
Number splitNumber
List boundaryGap
SplitLine splitLine
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,12 @@ class YAxis {
String type
String name
String nameLocation
Boolean inverse
Number nameGap
Number gridIndex
Number splitNumber
List boundaryGap
List data
SplitLine splitLine
NameTextStyle nameTextStyle
Number min
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
package underdog.plots.dsl.series

import underdog.plots.ast.Node
import underdog.plots.dsl.Series

@Node
class HeatmapSeries extends Series {
String type = 'heatmap'
}
21 changes: 4 additions & 17 deletions modules/underdog-plots/src/main/groovy/underdog/plots/Plots.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import groovy.transform.NamedVariant
import underdog.plots.Render.Meta
import underdog.plots.charts.Bar
import underdog.plots.charts.Graph
import underdog.plots.charts.CorrelationMatrix
import underdog.plots.charts.Histogram
import underdog.plots.charts.Line
import underdog.plots.charts.Pie
Expand All @@ -13,30 +14,16 @@ import underdog.plots.charts.Scatter
import underdog.plots.charts.ScatterMatrix

class Plots {
@Delegate Line line = new Line()
@Delegate Scatter scatter = new Scatter()
@Delegate Line lineDelegate = new Line()
@Delegate Scatter scatterDelegate = new Scatter()
@Delegate Histogram histogram = new Histogram()
@Delegate Graph graphDelegate = new Graph()
@Delegate Bar barDelegate = new Bar()
@Delegate Pie pieDelegate = new Pie()
@Delegate Radar radarDelegate = new Radar()
@Delegate ScatterMatrix scatterMatrixDelegate = new ScatterMatrix()
@Delegate CorrelationMatrix heatmapDelegate = new CorrelationMatrix()

@NamedVariant
Options plot(
List<Number> x,
List<Number> y,
@NamedParam(required = false, value='title') String chartTitle = "",
@NamedParam(required = false) boolean smooth = false) {
return line.line(x, y, title: chartTitle, smooth: smooth)
}

@NamedVariant
Options plot(
Map<String, List<Number>> data,
@NamedParam(required = false, value='title') String chartTitle = "") {
return line.lines(data, title: chartTitle)
}

@NamedVariant
static String show(
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
package underdog.plots.charts

import groovy.transform.InheritConstructors
import underdog.plots.Options
import underdog.plots.dsl.series.HeatmapSeries

/**
* @since 0.1.0
*/
@InheritConstructors
class CorrelationMatrix extends Chart {

/**
* @param xs
* @param ys
* @param heatmapData
* @return
* @since 0.1.0
*/
Options correlationMatrix(
List labels,
double[][] heatmapData
) {
def finalData = []
labels.eachWithIndex { Object left, int i ->
labels.eachWithIndex { Object right, int j ->
finalData << [i, j, heatmapData[i][j]]
}
}

return create {
grid {
bottom('100')
}
xAxis {
show(true)
type('category')
data(labels)
axisLabel {
rotate(90)
}
}
yAxis {
show(true)
inverse(true)
data(labels)
}
visualMap {
min(0)
max(1)
show(false)
}
series(HeatmapSeries) {
data(finalData)
label {
show(true)
}
}
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -88,9 +88,9 @@ class Histogram extends Chart {
}

static List<List<Number>> createHistogramDataFrom(List<Number> xs, Integer bins = 20) {
def min = Math.floor(xs.min().toDouble()).toInteger()
def max = Math.ceil(xs.max().toDouble()).toInteger()
def binSize = Math.ceil((max - min) / bins).toInteger()
def min = xs.min().toDouble()
def max = xs.max().toDouble()
def binSize = (max - min) / bins

def x = (0..bins).inject([min]) { agg, next ->
agg << (agg[-1] + binSize)
Expand Down
Loading

0 comments on commit 0953a6a

Please sign in to comment.