Strange classification when using rf_to_strings trees in EE RF? #528

matthiasdemuzere · 2021-06-17T07:56:53Z

Environment Information

geemap version: 0.8.16
Python version: 3.7.10
Operating System: Google Colab

Description

I am very enthusiastic about this new ml package, and the ability to train a random forest locally, and upload the trees into EE.

I gave it a spin with a reduced set of my own data (a 17 class classification problem). Yet for some reason, the classified classes come out all wrong, with the classified image even showing class numbers that are not in the training data.

I am not sure where the problem lies ... as I only use a reduced dataset, with limited trees, I don't think my decision tree strings are already too large, as indicated by @KMarkert? So I get the feeling something goes wrong when parsing the trees to strings with rf_to_strings?

What I Did

I followed the tutorial notebook, and just plugged in my data where needed:

My notebook is available here
Sample data is here: sample_data.csv
I source the EO input from a pre-processed asset. This should be accessible to read.

No errors, the code works. Yet the classification result is not as it should be, for many reasons:
- the classified map contains 0's and 7's, which are not present in the training data.
- water is classified as 12. It should be 17. 
- For a better assessment, see below classified image produced completely in EE, with the same inputs and settings (well, RF settings between scikit and RF in EE of course differ).

The text was updated successfully, but these errors were encountered:

KMarkert · 2021-06-25T17:32:43Z

Hi @matthiasdemuzere 👋

Thanks for the feedback! Glad you are trying this capability out and sorry it is not giving what is expected. Seems like there is some internal dropping of labels going on when creating the label values for the trees.

As you noted in your notebook there are the class values 1, 2, 3, 4, 5, 6, 8, 9, 11, 12, 14, ,15, 17, this is only 13 classes. The rf_to_strings tries to guess-timate the class labels from the output by reducing the argmax index for each leaf (See here). This leaves n classes for the labels but obvisouly doesn't provide the correct results.

We can get the class labels directly from the rf classifier (i.e. rf.classes_) and pass that as a look up to the values index when creating the trees. That should solve the problem.

I am working to update based on your example and should have it completed soon.

matthiasdemuzere · 2021-06-25T18:38:34Z

Hey hey @KMarkert,

Thanks for following up. This classification problem indeed has 17 classes, yet not all labels are always present. For another region of interest, the distribution could be very different.

So it would be nice indeed to maintain the actual class labels; doing this via a look-up table is probably not a bad idea.

Thanks for looking into this, I can always test this out once done.

KMarkert · 2021-06-25T21:04:21Z

This seems to be fixed with 200b39c. Here is the notebook that I tried testing on: https://colab.research.google.com/drive/1WkqJ9mSY9Al4sRmJLdi2Ab4puafIe6BQ?usp=sharing

@matthiasdemuzere if you would be so kind as to test on your end and confirm that this bugfix is providing the correct labels. Once confirmed, we can close this issue and submit a PR to merge the bugfix into master.

matthiasdemuzere · 2021-06-29T07:44:26Z

Hey @KMarkert,

I have tried the notebook, and there is seems to work as expected now.

Yet I also tried with another example, adding more input features (33 instead of 6), a larger training sample, and more RF trees (well, I tried both 10 and 30 - the latter being the default I normally use).

At least when using 30 trees, the class numbers seem very off again in the classified image.
Not sure whether this is due to the way the class values are treated, or to the size of the trees?

Should I update the notebook with the latter case, so that you can have another look?

KMarkert · 2021-06-29T14:19:47Z

Yes, please share the notebook that is not producing the expected results. I can look into the issue with scaling to more trees/features.

The bug fix is pulling the labels directly from the sklearn classifier object so I am really curious as to why (or how) the output labels are not correct...

matthiasdemuzere · 2021-06-29T20:34:22Z

Ok, so I prepared a notebook using my extended data (more input features, more training labels, more trees). See here.

Interestingly, this result looks fine, yet different then the result obtained when running with my local python install?
Config of my local python install:
geemap: 0.8.17
python: 3.8.10

As this is a conda environment, I thought there was maybe a conflict when pip installing the geemap fix. So I reinstalled this, yet I still get wrong labels in the classified image (eg. water is 12 instead of 17).

I checked whether the python version influences the result, by making a conda python 3.7 install. But also with this I get a wrong result when executing the locally ...

So I finally checked my local ml.py routine (in py37), and this file looks different than your bug fix reported here.
Not sure why that is? As far as I can see, our geemap version is the same, and I pip installed this as well:
pip install scikit-learn git+https://github.com/giswqs/geemap/@bugfix/rf_to_string-labels

So I just copied your ml.py routine in my install and used that. That seems to work?

So long story short: your bug fix seems to work (great!), yet I seem unable to install this bug fix properly via pip?

KMarkert · 2021-06-30T11:09:47Z

Glad to hear the bug fix was able to produce the expected results! It seems like challenge now is installing the package locally from a specific git branch. Without having a look at your environment, my first guess is that there is a locally cached geemap v0.8.17 that pip is installing vs downloading the specific branch you are pointing to. This sometimes happens when there isn't a version change for new code (see here), but this is just speculation and could be any number of things 🤷‍♂️.

I imagine once we merge the bug fix to the master branch and release a new version, this install issue will be resolved with updating the package locally. So, I will submit the PR to get the bug fix merged with the master branch. If this bug persists after a new release and update on your end then please feel free to reopen.

Thanks for finding this bug and helping work through it!

matthiasdemuzere · 2021-06-30T11:45:45Z

Great, thanks for following up on this.

Once it is in the official repo, I'll continue testing this.
As I still have to explore how it scales (cf. max. length of tree strings) with even more training data and input features ...

giswqs · 2021-06-30T12:04:10Z

@matthiasdemuzere Run geemap.update_package() once to install the package development version.

matthiasdemuzere added the bug Something isn't working label Jun 17, 2021

KMarkert self-assigned this Jun 24, 2021

KMarkert mentioned this issue Jun 30, 2021

[Bugfix] Fix for incorrect output labels from ml.rf_to_strings workflow #545

Merged

giswqs closed this as completed in #545 Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange classification when using rf_to_strings trees in EE RF? #528

Strange classification when using rf_to_strings trees in EE RF? #528

matthiasdemuzere commented Jun 17, 2021

KMarkert commented Jun 25, 2021

matthiasdemuzere commented Jun 25, 2021

KMarkert commented Jun 25, 2021

matthiasdemuzere commented Jun 29, 2021

KMarkert commented Jun 29, 2021

matthiasdemuzere commented Jun 29, 2021

KMarkert commented Jun 30, 2021

matthiasdemuzere commented Jun 30, 2021

giswqs commented Jun 30, 2021

Strange classification when using rf_to_strings trees in EE RF? #528

Strange classification when using rf_to_strings trees in EE RF? #528

Comments

matthiasdemuzere commented Jun 17, 2021

Environment Information

Description

What I Did

KMarkert commented Jun 25, 2021

matthiasdemuzere commented Jun 25, 2021

KMarkert commented Jun 25, 2021

matthiasdemuzere commented Jun 29, 2021

KMarkert commented Jun 29, 2021

matthiasdemuzere commented Jun 29, 2021

KMarkert commented Jun 30, 2021

matthiasdemuzere commented Jun 30, 2021

giswqs commented Jun 30, 2021