Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange classification when using rf_to_strings trees in EE RF? #528

Closed
matthiasdemuzere opened this issue Jun 17, 2021 · 9 comments · Fixed by #545
Closed

Strange classification when using rf_to_strings trees in EE RF? #528

matthiasdemuzere opened this issue Jun 17, 2021 · 9 comments · Fixed by #545
Assignees
Labels
bug Something isn't working

Comments

@matthiasdemuzere
Copy link

Environment Information

  • geemap version: 0.8.16
  • Python version: 3.7.10
  • Operating System: Google Colab

Description

I am very enthusiastic about this new ml package, and the ability to train a random forest locally, and upload the trees into EE.

I gave it a spin with a reduced set of my own data (a 17 class classification problem). Yet for some reason, the classified classes come out all wrong, with the classified image even showing class numbers that are not in the training data.

I am not sure where the problem lies ... as I only use a reduced dataset, with limited trees, I don't think my decision tree strings are already too large, as indicated by @KMarkert? So I get the feeling something goes wrong when parsing the trees to strings with rf_to_strings?

What I Did

I followed the tutorial notebook, and just plugged in my data where needed:

  • My notebook is available here
  • Sample data is here: sample_data.csv
  • I source the EO input from a pre-processed asset. This should be accessible to read.
No errors, the code works. Yet the classification result is not as it should be, for many reasons:
- the classified map contains 0's and 7's, which are not present in the training data.
- water is classified as 12. It should be 17. 
- For a better assessment, see below classified image produced completely in EE, with the same inputs and settings (well, RF settings between scikit and RF in EE of course differ).

image

@matthiasdemuzere matthiasdemuzere added the bug Something isn't working label Jun 17, 2021
@KMarkert KMarkert self-assigned this Jun 24, 2021
@KMarkert
Copy link
Member

Hi @matthiasdemuzere 👋

Thanks for the feedback! Glad you are trying this capability out and sorry it is not giving what is expected. Seems like there is some internal dropping of labels going on when creating the label values for the trees.

As you noted in your notebook there are the class values 1, 2, 3, 4, 5, 6, 8, 9, 11, 12, 14, ,15, 17, this is only 13 classes. The rf_to_strings tries to guess-timate the class labels from the output by reducing the argmax index for each leaf (See here). This leaves n classes for the labels but obvisouly doesn't provide the correct results.

We can get the class labels directly from the rf classifier (i.e. rf.classes_) and pass that as a look up to the values index when creating the trees. That should solve the problem.

I am working to update based on your example and should have it completed soon.

@matthiasdemuzere
Copy link
Author

Hey hey @KMarkert,

Thanks for following up. This classification problem indeed has 17 classes, yet not all labels are always present. For another region of interest, the distribution could be very different.

So it would be nice indeed to maintain the actual class labels; doing this via a look-up table is probably not a bad idea.

Thanks for looking into this, I can always test this out once done.

@KMarkert
Copy link
Member

This seems to be fixed with 200b39c. Here is the notebook that I tried testing on: https://colab.research.google.com/drive/1WkqJ9mSY9Al4sRmJLdi2Ab4puafIe6BQ?usp=sharing

@matthiasdemuzere if you would be so kind as to test on your end and confirm that this bugfix is providing the correct labels. Once confirmed, we can close this issue and submit a PR to merge the bugfix into master.

@matthiasdemuzere
Copy link
Author

Hey @KMarkert,

I have tried the notebook, and there is seems to work as expected now.

Yet I also tried with another example, adding more input features (33 instead of 6), a larger training sample, and more RF trees (well, I tried both 10 and 30 - the latter being the default I normally use).

At least when using 30 trees, the class numbers seem very off again in the classified image.
Not sure whether this is due to the way the class values are treated, or to the size of the trees?

Should I update the notebook with the latter case, so that you can have another look?

@KMarkert
Copy link
Member

Yes, please share the notebook that is not producing the expected results. I can look into the issue with scaling to more trees/features.

The bug fix is pulling the labels directly from the sklearn classifier object so I am really curious as to why (or how) the output labels are not correct...

@matthiasdemuzere
Copy link
Author

Ok, so I prepared a notebook using my extended data (more input features, more training labels, more trees). See here.

Interestingly, this result looks fine, yet different then the result obtained when running with my local python install?
Config of my local python install:
geemap: 0.8.17
python: 3.8.10

As this is a conda environment, I thought there was maybe a conflict when pip installing the geemap fix. So I reinstalled this, yet I still get wrong labels in the classified image (eg. water is 12 instead of 17).

I checked whether the python version influences the result, by making a conda python 3.7 install. But also with this I get a wrong result when executing the locally ...

So I finally checked my local ml.py routine (in py37), and this file looks different than your bug fix reported here.
Not sure why that is? As far as I can see, our geemap version is the same, and I pip installed this as well:
pip install scikit-learn git+https://github.com/giswqs/geemap/@bugfix/rf_to_string-labels

So I just copied your ml.py routine in my install and used that. That seems to work?

So long story short: your bug fix seems to work (great!), yet I seem unable to install this bug fix properly via pip?

@KMarkert
Copy link
Member

Glad to hear the bug fix was able to produce the expected results! It seems like challenge now is installing the package locally from a specific git branch. Without having a look at your environment, my first guess is that there is a locally cached geemap v0.8.17 that pip is installing vs downloading the specific branch you are pointing to. This sometimes happens when there isn't a version change for new code (see here), but this is just speculation and could be any number of things 🤷‍♂️.

I imagine once we merge the bug fix to the master branch and release a new version, this install issue will be resolved with updating the package locally. So, I will submit the PR to get the bug fix merged with the master branch. If this bug persists after a new release and update on your end then please feel free to reopen.

Thanks for finding this bug and helping work through it!

@matthiasdemuzere
Copy link
Author

Great, thanks for following up on this.

Once it is in the official repo, I'll continue testing this.
As I still have to explore how it scales (cf. max. length of tree strings) with even more training data and input features ...

@giswqs
Copy link
Member

giswqs commented Jun 30, 2021

@matthiasdemuzere Run geemap.update_package() once to install the package development version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants