docs: adding docs and new plots for blog

grooviter · Dec 24, 2024 · 0953a6a · 0953a6a
1 parent 111df6d
commit 0953a6a
Show file tree

Hide file tree

Showing 19 changed files with 199 additions and 35 deletions.
diff --git a/docs/guide/docs/blog/posts/2024/12/classifying_food.md b/docs/guide/docs/blog/posts/2024/12/classifying_food.md
@@ -4,6 +4,7 @@ categories:
   - ml
 tags:
   - ml
+  - classification
 ---
 
 # Classifying food
@@ -97,7 +98,7 @@ Each entry has a series of possible features and it’s labeled with a color val
                   3  |    4.2  |    4.2  |      17  |         0  |              0  |    0  |   0.004  |      0  |  0.01  |
 ```
 
-However the goal is to choose the minimum set of features that maximizes the classification. Too many could classify well but it would become too hard to use, too few would not classify well enough. I need to find the balance between the two. Once I’ve found the balance I can use both, features and labels to create a training and test datasets. For that I use the train_test_split function from scikit-learn library.
+However the goal is to choose the minimum set of features that maximizes the classification. Too many could classify well but it would become too hard to use, too few would not classify well enough. I need to find the balance between the two. Once I’ve found the balance I can use both, features and labels to create a training and test datasets. For that I use the **trainTestSplit** function.
 
 ```groovy title="minimum set of features and creating training and test datasets"
 --8<-- "src/test/groovy/underdog/blog/y2024/m12/ClassifyingFoodSpec.groovy:train_test_split"
@@ -110,8 +111,8 @@ Drawing a scatter matrix sometimes could help you to spot features that are part
 ```
 
 <figure markdown="span">
-![scatter matrix](images/classifying_food_scatter_matrix.png#only-light)
-![scatter matrix](images/classifying_food_scatter_matrix_dark.png#only-dark)
+![scatter matrix](images/classifying_food/scatter_matrix.png#only-light)
+![scatter matrix](images/classifying_food/scatter_matrix_dark.png#only-dark)
 </figure>
 
 ### Algorithm selection
@@ -121,22 +122,25 @@ In order to choose the algorithm, I needed to identify first the type of problem
 - First, I’ve got a labeled dataset, so it looked like I could use the labeled data to train a supervised learning model. 
 - Second, I was looking for different types of discrete target values (values for green, orange, red), therefore it seemed to be a classification problem.
 
-Once I confirmed it was a classification problem I chose the only classification algorithm I know so far, the k-nearest neighbors algorithm.
+Once I confirmed it was a classification problem I picked the [k-nearest neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) algorithm.
 
 ## Evaluation Phase
 
 Then we use both, dataset and algorithm, to train a software model to make predictions. Afterwards the model performance is evaluated with testing datasets. Training and testing are part of the evaluation phase.
 
 ### Model creation & training
+The k-nearest neighbors algorithm tries to establish to which type the element belongs by checking the closest neighborg elements around. You can customize the K parameter which sets how many neighbors does the algorithm have to check before emmiting its veredict.
 
-The k-nearest neighbors algorithm is implemented in scikit-learn via the KNeighborsClassifier class. The algorithm tries to establish to which type the element belongs by checking the closest neighborg elements around. You can customize the K parameter which sets how many neighbors does the algorithm have to check before emmiting its veredict.
-
-Here I’m initializing the algorithm with k=5. Then I’m training the model using the fit function and finally I’m checking how well the model is going to perform by passing the testing dataset (X_test, y_test) to the score function. After some tunes here and there I was able to get a 90% of accuracy by using 6 features.
+Here I’m initializing the algorithm with k=5. Then I’m training the model using the fit function and finally I’m checking how well the model is going to perform by passing the testing dataset (X_test, y_test) to the score function. I was able to get more than 80% of accuracy by using 6 features.
 
 ```groovy title="model training and getting accuracy score with the testing dataset"
 --8<-- "src/test/groovy/underdog/blog/y2024/m12/ClassifyingFoodSpec.groovy:knn_predictions"
 ```
 
+```groovy title="accuracy check"
+--8<-- "src/test/groovy/underdog/blog/y2024/m12/ClassifyingFoodSpec.groovy:accuracy_check"
+```
+
 ### Model testing
 
 To get a prediction I need to provide the following measurements to the model:

diff --git a/...mages/classifying_food_scatter_matrix.png → ...mages/classifying_food/scatter_matrix.png b/...mages/classifying_food_scatter_matrix.png → ...mages/classifying_food/scatter_matrix.png
diff --git a/.../classifying_food_scatter_matrix_dark.png → .../classifying_food/scatter_matrix_dark.png b/.../classifying_food_scatter_matrix_dark.png → .../classifying_food/scatter_matrix_dark.png
diff --git a/docs/guide/src/test/groovy/underdog/blog/y2024/m12/ClassifyingFoodSpec.groovy b/docs/guide/src/test/groovy/underdog/blog/y2024/m12/ClassifyingFoodSpec.groovy
@@ -46,8 +46,8 @@ class ClassifyingFoodSpec extends Specification {
         def label = df['TRAFFICLIGHT VALUE'] as int[]
 
         def (X_train, X_test, y_train, y_test) = Underdog.ml()
-                .utils
-                .trainTestSplit(feats, label, random_state: 0)
+            .utils
+            .trainTestSplit(feats, label, random_state: 0)
         // --8<-- [end:train_test_split]
 
         and:
@@ -65,11 +65,21 @@ class ClassifyingFoodSpec extends Specification {
         and:
         // --8<-- [start:knn_predictions]
         def ml = Underdog.ml()
+
+        // creates and trains the model
         def knn = ml.classification.knn(X_train, y_train, k: 5)
+
+        // creating predictions with the test feature set
         def predictions = knn.predict(X_test)
-        def score = ml.metrics.r2Score(y_test, predictions)
+
+        // getting the accuracy of the model when tested against the test set
+        def score = ml.metrics.accuracy(y_test, predictions)
         // --8<-- [end:knn_predictions]
 
+        // --8<-- [start:accuracy_check]
+        assert score > 0.80
+        // --8<-- [end:accuracy_check]
+
         and:
         // --8<-- [start:food_samples_predictions]
         // def sample      = [CARBS, SUGAR, PROTEINS, FAT, SALT, FIBER]

diff --git a/gradle.properties b/gradle.properties
@@ -8,4 +8,5 @@ tablesaw            = 1.0.1
 spock               = 2.3-groovy-4.0
 ajoberstar          = 3.0.0
 micronautVersion    = 4.6.3
-jgrapht             = 1.5.2
+jgrapht             = 1.5.2
+smile               = 3.1.1
diff --git a/modules/underdog-dataframe/src/main/groovy/underdog/DataFrame.groovy b/modules/underdog-dataframe/src/main/groovy/underdog/DataFrame.groovy
@@ -21,6 +21,10 @@ interface DataFrame extends Columnar {
      */
     DataFrame copy()
 
+    double[][] corrMatrix()
+
+    double[][] corrMatrix(Integer round)
+
     /**
      * Fill NA/NaN values using the specified value passed as parameter
      *

diff --git a/modules/underdog-dataframe/src/main/groovy/underdog/impl/TSDataFrame.groovy b/modules/underdog-dataframe/src/main/groovy/underdog/impl/TSDataFrame.groovy
@@ -81,6 +81,33 @@ class TSDataFrame implements DataFrame {
         return new TSDataFrame(table.copy())
     }
 
+    @Override
+    double[][] corrMatrix() {
+        def finalData = []
+        this.columns.eachWithIndex { String left, int i ->
+            def next = []
+            this.columns.eachWithIndex { String right, int j ->
+                next << this[left].corr(this[right])
+            }
+            finalData << next
+        } as double[][]
+
+        return finalData
+    }
+
+    double[][] corrMatrix(Integer round) {
+        List<List<Double>> finalData = []
+        this.columns.eachWithIndex { String left, int i ->
+            List<Double> next = []
+            this.columns.eachWithIndex { String right, int j ->
+                next << this[left].corr(this[right]).doubleValue().round(2)
+            }
+            finalData << next
+        }
+
+        return finalData as double[][]
+    }
+
     @Override
     DataFrame fillna(Object o) {
         Table copied = table.copy()

diff --git a/modules/underdog-dataframe/src/main/groovy/underdog/impl/TSSeries.groovy b/modules/underdog-dataframe/src/main/groovy/underdog/impl/TSSeries.groovy
@@ -1,5 +1,6 @@
 package underdog.impl
 
+import groovy.util.logging.Slf4j
 import underdog.DataFrame
 import underdog.Series
 import groovy.transform.NamedParam
@@ -26,6 +27,7 @@ import static underdog.Series.TypeCorrelation.KENDALL
 import static underdog.Series.TypeCorrelation.PEARSON
 import static underdog.Series.TypeCorrelation.SPEARMAN
 
+@Slf4j
 class TSSeries implements Series {
     private final Column column
 
@@ -122,10 +124,11 @@ class TSSeries implements Series {
         @NamedParam(required = true) Series other,
         @NamedParam(required = false) TypeCorrelation method = PEARSON,
         @NamedParam(required = false) Integer observations = 0) {
+        log.debug("correlation between ${this.name} - ${other.name}")
 
         def (alignedX, alignedY) = [this as Double[], other as Double[]]
             .transpose()
-            .<List<Double>>findAll(Object::every)
+            .<List<Double>>findAll { items -> items.every { it != null} }
             .inject([[], []]) { agg,  next ->
                 agg[0] << next[0]
                 agg[1] << next[1]

diff --git a/modules/underdog-ml/underdog-ml.gradle b/modules/underdog-ml/underdog-ml.gradle
@@ -10,8 +10,8 @@ repositories {
 dependencies {
     api project(":underdog-dataframe")
 
-    api "com.github.haifengl:smile-core:3.1.1"
-    api "com.github.haifengl:smile-nlp:3.1.1"
+    api "com.github.haifengl:smile-core:$smile"
+    api "com.github.haifengl:smile-nlp:$smile"
 
     implementation "org.apache.groovy:groovy-macro:$groovy"
     testImplementation project(":underdog-dataframe")

diff --git a/modules/underdog-plots-domain/src/main/groovy/underdog/plots/Options.groovy b/modules/underdog-plots-domain/src/main/groovy/underdog/plots/Options.groovy
@@ -10,6 +10,7 @@ import underdog.plots.dsl.Radar
 import underdog.plots.dsl.Series
 import underdog.plots.dsl.Title
 import underdog.plots.dsl.Tooltip
+import underdog.plots.dsl.VisualMap
 import underdog.plots.dsl.XAxis
 import underdog.plots.dsl.YAxis
 
@@ -18,6 +19,7 @@ class Options {
     Title title
     Tooltip tooltip
     AxisPointer axisPointer
+    VisualMap visualMap
 
     @RepeatableField Legend legend
     @RepeatableField Grid grid

diff --git a/modules/underdog-plots-domain/src/main/groovy/underdog/plots/dsl/VisualMap.groovy b/modules/underdog-plots-domain/src/main/groovy/underdog/plots/dsl/VisualMap.groovy
@@ -0,0 +1,14 @@
+package underdog.plots.dsl
+
+import underdog.plots.ast.Node
+
+@Node
+class VisualMap {
+    Number min
+    Number max
+    Boolean calculable
+    String orient
+    String left
+    String bottom
+    Boolean show
+}
diff --git a/modules/underdog-plots-domain/src/main/groovy/underdog/plots/dsl/XAxis.groovy b/modules/underdog-plots-domain/src/main/groovy/underdog/plots/dsl/XAxis.groovy
@@ -8,8 +8,10 @@ class XAxis {
     String name
     String type
     String nameLocation
+    String position
     Number nameGap
     Number gridIndex
+    Boolean inverse
     Number splitNumber
     List boundaryGap
     SplitLine splitLine

diff --git a/modules/underdog-plots-domain/src/main/groovy/underdog/plots/dsl/YAxis.groovy b/modules/underdog-plots-domain/src/main/groovy/underdog/plots/dsl/YAxis.groovy
@@ -8,10 +8,12 @@ class YAxis {
     String type
     String name
     String nameLocation
+    Boolean inverse
     Number nameGap
     Number gridIndex
     Number splitNumber
     List boundaryGap
+    List data
     SplitLine splitLine
     NameTextStyle nameTextStyle
     Number min

diff --git a/modules/underdog-plots-domain/src/main/groovy/underdog/plots/dsl/series/HeatmapSeries.groovy b/modules/underdog-plots-domain/src/main/groovy/underdog/plots/dsl/series/HeatmapSeries.groovy
@@ -0,0 +1,9 @@
+package underdog.plots.dsl.series
+
+import underdog.plots.ast.Node
+import underdog.plots.dsl.Series
+
+@Node
+class HeatmapSeries extends Series {
+    String type = 'heatmap'
+}
diff --git a/modules/underdog-plots/src/main/groovy/underdog/plots/Plots.groovy b/modules/underdog-plots/src/main/groovy/underdog/plots/Plots.groovy
@@ -5,6 +5,7 @@ import groovy.transform.NamedVariant
 import underdog.plots.Render.Meta
 import underdog.plots.charts.Bar
 import underdog.plots.charts.Graph
+import underdog.plots.charts.CorrelationMatrix
 import underdog.plots.charts.Histogram
 import underdog.plots.charts.Line
 import underdog.plots.charts.Pie
@@ -13,30 +14,16 @@ import underdog.plots.charts.Scatter
 import underdog.plots.charts.ScatterMatrix
 
 class Plots {
-    @Delegate Line line = new Line()
-    @Delegate Scatter scatter = new Scatter()
+    @Delegate Line lineDelegate = new Line()
+    @Delegate Scatter scatterDelegate = new Scatter()
     @Delegate Histogram histogram = new Histogram()
     @Delegate Graph graphDelegate = new Graph()
     @Delegate Bar barDelegate = new Bar()
     @Delegate Pie pieDelegate = new Pie()
     @Delegate Radar radarDelegate = new Radar()
     @Delegate ScatterMatrix scatterMatrixDelegate = new ScatterMatrix()
+    @Delegate CorrelationMatrix heatmapDelegate = new CorrelationMatrix()
 
-    @NamedVariant
-    Options plot(
-        List<Number> x,
-        List<Number> y,
-        @NamedParam(required = false, value='title') String chartTitle = "",
-        @NamedParam(required = false) boolean smooth = false) {
-        return line.line(x, y, title: chartTitle, smooth: smooth)
-    }
-
-    @NamedVariant
-    Options plot(
-        Map<String, List<Number>> data,
-        @NamedParam(required = false, value='title') String chartTitle = "") {
-        return line.lines(data, title: chartTitle)
-    }
 
     @NamedVariant
     static String show(

diff --git a/modules/underdog-plots/src/main/groovy/underdog/plots/charts/CorrelationMatrix.groovy b/modules/underdog-plots/src/main/groovy/underdog/plots/charts/CorrelationMatrix.groovy
@@ -0,0 +1,61 @@
+package underdog.plots.charts
+
+import groovy.transform.InheritConstructors
+import underdog.plots.Options
+import underdog.plots.dsl.series.HeatmapSeries
+
+/**
+ * @since 0.1.0
+ */
+@InheritConstructors
+class CorrelationMatrix extends Chart {
+
+    /**
+     * @param xs
+     * @param ys
+     * @param heatmapData
+     * @return
+     * @since 0.1.0
+     */
+    Options correlationMatrix(
+        List labels,
+        double[][] heatmapData
+    ) {
+        def finalData = []
+        labels.eachWithIndex { Object left, int i ->
+            labels.eachWithIndex { Object right, int j ->
+                finalData << [i, j, heatmapData[i][j]]
+            }
+        }
+
+        return create {
+            grid {
+                bottom('100')
+            }
+            xAxis {
+                show(true)
+                type('category')
+                data(labels)
+                axisLabel {
+                    rotate(90)
+                }
+            }
+            yAxis {
+                show(true)
+                inverse(true)
+                data(labels)
+            }
+            visualMap {
+                min(0)
+                max(1)
+                show(false)
+            }
+            series(HeatmapSeries) {
+                data(finalData)
+                label {
+                    show(true)
+                }
+            }
+        }
+    }
+}
diff --git a/modules/underdog-plots/src/main/groovy/underdog/plots/charts/Histogram.groovy b/modules/underdog-plots/src/main/groovy/underdog/plots/charts/Histogram.groovy
@@ -88,9 +88,9 @@ class Histogram extends Chart {
     }
 
     static List<List<Number>> createHistogramDataFrom(List<Number> xs, Integer bins = 20) {
-        def min = Math.floor(xs.min().toDouble()).toInteger()
-        def max = Math.ceil(xs.max().toDouble()).toInteger()
-        def binSize = Math.ceil((max - min) / bins).toInteger()
+        def min = xs.min().toDouble()
+        def max = xs.max().toDouble()
+        def binSize = (max - min) / bins
 
         def x = (0..bins).inject([min]) { agg, next ->
             agg << (agg[-1] + binSize)