-
-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Initial commit of DBScan #12
Conversation
Also, the dbscan implementation in sklearn isn't one that you can fit then predict, fitting is part of the prediction stage so I've replicated that. Incremental DBScan is an extension over the original algorithm |
Codecov Report
@@ Coverage Diff @@
## master #12 +/- ##
==========================================
- Coverage 96.68% 94.46% -2.23%
==========================================
Files 7 10 +3
Lines 181 271 +90
==========================================
+ Hits 175 256 +81
- Misses 6 15 +9
Continue to review full report at Codecov.
|
Nice! I'll have a look straight away 😁 There are other interesting implementation in Python-land that we should keep in mind - in particular, HDBSCAN and pyclustering. |
* Remove Sync trait bounds * Use ndarray_stats l2_norm * Actually use observation in search queue to get neighbours * Add two tests for noise points and nested dense clusters
It would be interesting to add a benchmark for DBSCAN as well - I believe we can do some optimisations in a couple of places, but I'd avoid starting with them before we can measure if the gain is real 👍 |
So I realised when writing a quick benchmark that predict was a bad function name as a free function because of the public reexports, so I've currently renamed it to |
Yeah, we can work on the naming - I would probably suggest to wrap it in a struct anyway, for ease of saving/loading as well as future extensibility. But we can figure this out once we nailed down the algorithm implementation (I think we are close 😁) |
Right benchmark added which should look very familiar |
Change to return reference to the neighbour data as well to avoid lookups
I think we are there from an algorithmic point of view 😀 Next steps to get this ready to be merged:
Then we are good to go 🚀 |
I realised a mistake in my implementation of the algorithm! Only minor but the number of neighbours needs to be taken into account for each element in the search queue. I've added that and an example. Currently writing the doc comments then I'll push something |
Rename DBScan to Dbscan because the whole thing is an acronym
Merged - thanks for all your work here @xd009642! 🙏 |
Looked to the k-means implementation to keep the design consistent. Comments need to be filled in and tests. Also distance is currently just euclidean distance but it's probably a good idea for there to some sort of distance trait or enum so people can pass in what distance function they want for this or other algorithms.
I'll carry on filling in the rest just figured early feedback was better than later 😄