Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation and examples #101

Open
rodrigo-arenas opened this issue Jun 16, 2022 · 21 comments
Open

Improve documentation and examples #101

rodrigo-arenas opened this issue Jun 16, 2022 · 21 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers help wanted Extra attention is needed up-for-grabs

Comments

@rodrigo-arenas
Copy link
Owner

I open this issue for newcomers who would like to contribute to an open-source project

The idea is to improve the current docs and add more examples using the library, you can see the current docs files here

You could also add external articles to the package showcasing some applications, see these for example

Here is the stable docs

@rodrigo-arenas rodrigo-arenas added documentation Improvements or additions to documentation help wanted Extra attention is needed good first issue Good for newcomers up-for-grabs labels Jun 16, 2022
@emirtarik
Copy link

Hi @rodrigo-arenas

I'm having an issue when I'm trying to replicate the Boston House Pricing Prediction notebook.

I'm not sure if the package names are outdated or I made a mistake installing them but I get the following error when I'm importing the packages in the first block:

from sklearn_genetic import GASearchCV

ModuleNotFoundError: No module named 'sklearn_genetic'

In fact, none of the sklearn_genetic imports seem to work. I've checked this issue from the sklearn-genetic repository but it's not exactly the same problem.

I have no trouble with installing sklearn-genetic and I have:

Python==3.7.3
sklearn==0.23.1
deap==1.3

Is it an issue with the documentation? If so can you please at least briefly explain how I would get started with sklearn-genetic with my RF algorithm on a Boston House Pricing like dataset? I'm really into the the idea of GA for my master's thesis and I can't really go back at this point :)

Thanks in advance

@rodrigo-arenas
Copy link
Owner Author

Hi @emirtarik , I hope you are doing great
I think it might be a misunderstanding, the sklearn-genetic package has nothing to do with sklearn-genetic-opt (this package), they just happen to share some part of the name, make sure you are installing the right one using

pip install sklearn-genetic-opt[all]

Let me know if this fixes the problem

@emirtarik
Copy link

Thanks for the quick reply @rodrigo-arenas.

Now it makes sense. Sorry about that misunderstanding. This did fix my problem however I'm still having the problems with .space .plots .callbacks. Do you know why these might be missing?

Thanks again

@rodrigo-arenas
Copy link
Owner Author

No problem @emirtarik
Can you share what error are you getting? is it an import error?

I just ran the whole notebook without issues, if you are using Jupyter notebooks directly, make sure you installed the package in the right environment and that you restarted the kernel

I'd also suggest you make sure to use a virtual environment so dependencies you might have with other projects don't mix up

@emirtarik
Copy link

emirtarik commented Jul 19, 2022

Yes, it was an import error but I think it was related to the python environment on my work computer because after trying on my Jupyter server and local python, I gave it a try on Colab and it worked! I'll use it there instead.

Thanks a lot @rodrigo-arenas, maybe I can contribute on the docs once I understand more of how this works.

On a side note, do you know how this would run on sparse matrices? I have a lot of categorical variables to use which are all encoded.

@rodrigo-arenas
Copy link
Owner Author

rodrigo-arenas commented Jul 19, 2022

No problem, I'm glad you made it work.
And for sure, new contributors are welcome!

For sparse matrices, it's not something quite related to the package itself, in the sense that it doesn't have an explicit algorithm to give special treatment to this kind of dataset, the direct impact is that you might need to increase the number of individuals and generations to explore all the space

In the other hand, you can just see it as a regular machine learning problem, you could for example use some preprocessing steps, like a t-sne or PCA algorithm to reduce the number of dimensions in your dataset; you can also try to not one-hot encode all the variables but use different techniques (depending on the nature on your data) that doesn't create a new column for every new value, I hope it helps

@emirtarik
Copy link

Hi again @rodrigo-arenas,

Following your suggestions, I was able to work with a labeled dataset. I was hesitant to do this since my categoricals are not exactly ordinal, therefore I was afraid that this would complicate interpretation. For instance, you wouldn't do this in a linear approach as to not bring any meaning to the marginal increase in categories under a single variable. With sparse matrices as in a one-hot encoded dataset, I was having 'nan' returns so I had to find another way (still not 100% sure though so I will check with my advisor). I'm still having low fitness scores but this is highly likely to be related to the limits of the dataset I'm currently working with.

About dimension reduction, I was kinda hoping that the GA would provide an unconventional dimension reduction technique, as in I would be able to see which features are most important in choosing optimized new generations. Which brings me to today's question :) Do you think there would be a way to look at gene frequencies used in the GASearch process? I would want to compare it to the classic RF feature importances graph obtained by using MDI, or simply compare it with some coefficients obtained by my linear models. A good example would be this paper.

I realize this is getting out of topic for this issue entry and I apologize but maybe it will help others looking through the docs and issues with similar problems. Also, this is currently the only way I am aware of to reach you :)

@rodrigo-arenas
Copy link
Owner Author

Hi @emirtarik
I understand, as you mentioned, the encoding strategy might have those impacts

About the second part, of the gene frequencies, you can check exactly which hyperparameters the model tried using at each step in the case of GASearchCV, or the features it selected in GAFeatureSelectionCV, you have different options:

  • You can explore the logbook object which contains all this information.

  • You can also check the cv_results_ object, for example, check this notebook

  • You can plot the sampled space of hyperparameters, using this function

@GuiTaek
Copy link
Contributor

GuiTaek commented Jul 25, 2022

Hi, as given in the CONTRIBUTING.md, herewith I say, that I'm working on this issue. I will likely not fix it but I can probably make it better

@rodrigo-arenas
Copy link
Owner Author

Hi @GuiTaek for sure, just let us know which sections you'll be working on, so other people don't overwrite it
Thanks!

@GuiTaek
Copy link
Contributor

GuiTaek commented Jul 30, 2022

You're welcome. For now, as I haven't ever used this library (came from good first issue tag) I would like to tackle the first greater page https://sklearn-genetic-opt.readthedocs.io/en/stable/tutorials/basic_usage.html. I don't know, maybe later more

Edit: Would you like more atomar pull request or would you rather prefer that I combine everything to one pull request?

@rodrigo-arenas
Copy link
Owner Author

Hi @GuiTaek yeah for sure, you can start on that one. In this case, it would be great one Pull Request per page, so we keep subjects separated

Thanks

@GuiTaek
Copy link
Contributor

GuiTaek commented Jul 31, 2022

Hi @rodrigo-arenas, then I'll collect every suggestion I have for one page and make a pull request. May I also touch content? E.g. I think, it is possible to improve the example as the max-score doesn't increase much. It would be better advertising if it increases gradually from low to high. I already have an example I'm not satisfied though, as it throws warnings.

@rodrigo-arenas
Copy link
Owner Author

Hi @GuiTaek , yes the examples can be improved, just take into account that the ones shown in the tutorials section, usually are pretty simple, so the users can get started right away, what I mean is, for example, if the tutorial is about adapters, I'd showcase how to setup that parameter, and not adding callbacks, loggers that might mix two subjects.

On the other hand, if you see the jupyter notebooks examples, I think there is a big oporunity to make them better, on those notebooks, there is no problem to modify them and add complex features in a single notebook, since those are meant to showcase all the library capabilities at one place

I hope it makes sense

@emirtarik
Copy link

Hi @rodrigo-arenas, hope you're doing well.

I'm trying to use the GAFeatureSelectionCV as you suggested to understand the importance of attributes in my dataset, however I have a two-sided problem in this regard.

The first is that I'm trying to use labeled data instead of encoded and this directly results in the error below.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-34-81501a102f7c>](https://localhost:8080/#) in <module>()
      1 # Train and select the features
----> 2 evolved_selection.fit(x_train_labeled_ga, y_train_ga)

4 frames
[/usr/local/lib/python3.7/dist-packages/deap/tools/support.py](https://localhost:8080/#) in record(self, **infos)
    336         """
    337         apply_to_all = {k: v for k, v in infos.items() if not isinstance(v, dict)}
--> 338         for key, value in infos.items():
    339             if isinstance(value, dict):
    340                 chapter_infos = value.copy()

RuntimeError: dictionary changed size during iteration

The second is when I try to do this with encoded data, I am well able to do it, however it asks me to turn my sparse matrix into an array using np.toarray(). After doing so, I am unsure on how to interpret the resulting .best_features_ array. Can you please give a brief explanation on how I can pair this with my set of variables? I'm imagining something in the lines of OneHotEncoder.get_feature_names().

Thanks a lot

@rodrigo-arenas
Copy link
Owner Author

Hi @emirtarik , as you mentioned you can't pass a labeled dataset, not especially because of this package, but because scikit-learn won't work with such a structure, so there is really nothing I can do from this library, this must be solved in a pipeline with some encoding as a preprocessing step.

The best_features_ attribute returns one value per each input column, so you must know what each column of your dataset means in order to interpret it. If the only transformation you are doing is a one-hot encoding, then can use for example the get_feature_names_out() method of the encoder, to map the names of each column, then what best_features_ means is a value of True if that column was selected, False otherwise.

Please for future questions, make sure to create a new bug or question, if it's not related to this issue (documentation improvement), so we don't mix different subjects in this thread

Greetings

@GuiTaek
Copy link
Contributor

GuiTaek commented Aug 13, 2022

Hi @GuiTaek , yes the examples can be improved, just take into account that the ones shown in the tutorials section, usually are pretty simple, so the users can get started right away, what I mean is, for example, if the tutorial is about adapters, I'd showcase how to setup that parameter, and not adding callbacks, loggers that might mix two subjects.

On the other hand, if you see the jupyter notebooks examples, I think there is a big oporunity to make them better, on those notebooks, there is no problem to modify them and add complex features in a single notebook, since those are meant to showcase all the library capabilities at one place

I hope it makes sense

OK, I'll consider that, that it should be easy.

@GuiTaek
Copy link
Contributor

GuiTaek commented Aug 28, 2022

Unfortunately I cannot make a draft pull request and request review. I'd like to have review as I have quite a bundle of changes and I am particulary unsure about the whole example thing: Is it OK, that I use intentionally a "wrong" range to show the powers of this library? Is it clear that a user have to change it according to what I have written? See also the draft pull request as well as the commits. I'm not finished though, as there is more on this page I haven't touched.

@GuiTaek
Copy link
Contributor

GuiTaek commented Sep 10, 2022

I made a full pull request as I feared that you can't see the pull request.

@rodrigo-arenas
Copy link
Owner Author

Hi @GuiTaek thanks for notifying me, I just saw the PR I'll be reviewing it this weekend

@GuiTaek
Copy link
Contributor

GuiTaek commented Dec 27, 2022

Hi @rodrigo-arenas had a lot of university, but it looks like you merged it, I didn't expect that to be honest! Thank you very much! Sorry for late response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers help wanted Extra attention is needed up-for-grabs
Projects
None yet
Development

No branches or pull requests

3 participants