Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding tabular dataset #19

Open
kachayev opened this issue Nov 21, 2023 · 8 comments
Open

Consider adding tabular dataset #19

kachayev opened this issue Nov 21, 2023 · 8 comments

Comments

@kachayev
Copy link
Collaborator

For example, the one with personal/business flights I've been experimenting with.

It would be nice to have more than CV provided out-of-the box.

tgnassou added a commit that referenced this issue Nov 22, 2023
[MRG] Use skorch class for deep DA
@YanisLalou
Copy link
Collaborator

Just to be clear. The goal here is to create a file like _office.py to download/process a tabular dataset, right ?

@tgnassou
Copy link
Collaborator

Exactly, but I think we want an easy one, which is a SOTA dataset I would say. Maybe we need to check the paper if we can find the more popular one. And in a second time, we will add more complex datasets to a bench_skada repo.

@kachayev
Copy link
Collaborator Author

I would say that, first off, we need to pick a suitable tabular dataset for Domain Adaptation (DA). I've had some preliminary results with this dataset: Airline Passenger Satisfaction on Kaggle. It's perfect for our needs because you can easily differentiate between personal and business flights, giving us a clear source vs. target scenario.

Next up, let's create a concise tutorial. The goal here is to demonstrate how the performance of a classifier, trained on one domain, tends to decline when applied to another, and, how to enhance this using DA techniques. At this stage there's no need to worry dataset processing, we're talking about maybe 10-20 lines of code to download and cleanup the dataset.

Once we have this in place, our next step is to package the dataset in a user-friendly way, similar to what we've seen with Office31. This way, it's ready to roll 'out-of-the-box' for anyone installing our library.

Why this order? Well, it's crucial to ensure that our chosen dataset fits DA library needs.

Let me know what do you think.

@YanisLalou
Copy link
Collaborator

About the dataset choice, at first glance the Airline Passenger Satisfaction on Kaggle one has no license defined, no authors name, no DOI. Thus we don't even know if its open source or not.
At first we wanted to select one of the dataset used in this paper: https://arxiv.org/pdf/2312.07577.pdf
These datasets are all open source and there's also benchmarks with them.
However we havent decided yet which one we're going to add to skada at first. Maybe the one with the most citations? The one who seems to have the best accuracy results in the benchmarks with DA methods?

@kachayev
Copy link
Collaborator Author

Oh, interesting. I haven't seen this paper yet. Were you able to re-run their experiments to verify results?

@YanisLalou
Copy link
Collaborator

I don't think we've tried to reproduce results and don't know if we plan to do it

@tgnassou
Copy link
Collaborator

It is a distribution shift tabular dataset, but they don't use any domain adaptation method in their benchmark :( So, I didn't try to reproduce the code. But it will be interesting for the benchmark

@kachayev
Copy link
Collaborator Author

they don't use any domain adaptation method in their benchmark

Yeah... whichever dataset we choose, it's essential to ensure that we can showcase the use of DA methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants