Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kNN causes an error when the input has nominal data #1565

Closed
e10e3 opened this issue Jul 5, 2024 · 3 comments
Closed

kNN causes an error when the input has nominal data #1565

e10e3 opened this issue Jul 5, 2024 · 3 comments

Comments

@e10e3
Copy link
Contributor

e10e3 commented Jul 5, 2024

Versions

River version: 0.21.1
Python version: 3.12.4
Operating system: macOS 14.5

Describe the bug

When a kNN model does a prediction, it computes the distance of its input with previous data points.
If some of the feature are nominal (i.e. not numbers), kNN will cause an error because it tries to perform a subtraction on unsupported data types.

One would expect kNN to be resilient to nominal data, since there exist ways to derive a distance between non-numeric features. A simple one is to give a distance of 0 when they are equal and 1 if they are different.

Code to reproduce

from river import neighbors

m = neighbors.KNNClassifier()

m.learn_one({"a": "river"}, 0)
m.predict_one({"a": "river"})

Output

Traceback (most recent call last):
  File "/path/to/river/knn-strings.py", line 6, in <module>
    m.predict_one({"a": "river"})
  File "/path/to/river/river/base/classifier.py", line 66, in predict_one
    y_pred = self.predict_proba_one(x, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/river/river/neighbors/knn_classifier.py", line 149, in predict_proba_one
    nearest = self._nn.search((x, None), n_neighbors=self.n_neighbors, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/river/river/neighbors/ann/swinn.py", line 431, in search
    return self._linear_scan(item, n_neighbors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/river/river/neighbors/ann/swinn.py", line 339, in _linear_scan
    points = [(p.item, self.dist_func(item, p.item)) for p in self]
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/river/river/neighbors/base.py", line 33, in __call__
    return self.distance_function(a[0], b[0])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/river/river/utils/math.py", line 165, in minkowski_distance
    return sum((abs(a.get(k, 0.0) - b.get(k, 0.0))) ** p for k in {*a.keys(), *b.keys()}) ** (1 / p)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/river/river/utils/math.py", line 165, in <genexpr>
    return sum((abs(a.get(k, 0.0) - b.get(k, 0.0))) ** p for k in {*a.keys(), *b.keys()}) ** (1 / p)
                    ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'str' and 'str'
@smastelini
Copy link
Member

smastelini commented Jul 5, 2024

Hi @e10e3, I see your point with the error. However, following the river convention, this type of situation should be dealt with on the application side. So, it is not a bug, but the intended behavior.

I am also against making kNN slow in cases where the data is numeric only.

Note that linear models, decisions trees and other models are also prone to fail if the user passes numeric and non-numeric data directly.

As possible solutions, the user can use a pipeline with One-Hot Encoding or supplying a custom distance metric as you propose in the related PR.

@e10e3
Copy link
Contributor Author

e10e3 commented Jul 5, 2024

I agree with your point.

I guess I won't be the last (nor the first) to propose such change, is this convention explained in a document for future reference?

@e10e3 e10e3 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2024
@smastelini
Copy link
Member

Hi @e10e3, that is a good point. We have an entry on the River FAQ about input validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants