-
-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kNN causes an error when the input has nominal data #1565
Comments
Hi @e10e3, I see your point with the error. However, following the river convention, this type of situation should be dealt with on the application side. So, it is not a bug, but the intended behavior. I am also against making kNN slow in cases where the data is numeric only. Note that linear models, decisions trees and other models are also prone to fail if the user passes numeric and non-numeric data directly. As possible solutions, the user can use a pipeline with One-Hot Encoding or supplying a custom distance metric as you propose in the related PR. |
I agree with your point. I guess I won't be the last (nor the first) to propose such change, is this convention explained in a document for future reference? |
Versions
River version: 0.21.1
Python version: 3.12.4
Operating system: macOS 14.5
Describe the bug
When a kNN model does a prediction, it computes the distance of its input with previous data points.
If some of the feature are nominal (i.e. not numbers), kNN will cause an error because it tries to perform a subtraction on unsupported data types.
One would expect kNN to be resilient to nominal data, since there exist ways to derive a distance between non-numeric features. A simple one is to give a distance of 0 when they are equal and 1 if they are different.
Code to reproduce
Output
The text was updated successfully, but these errors were encountered: