Using sklearn data pre-processing pipelines inside LightningDataModule #19807
Unanswered
tiefenthaler
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment
-
Any updates? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I wonder if it makes sense or does not make sense to use sklearn pipelines for data pre-processing within the LightningDataModule?
I am a big fan of sklearn pipelines since they structure the code properly and allows an easy use of pre-processing steps properly when splitting data into train, val and test data. Besides classical ML Models, I am using NNs more and more for tabular data and got some great results for some use cases. Some use cases require more individual handling of a variety of features.
PyTorch/Lightning is using the LightningDataModule so that the data can be used efficiently for the training of NNs. The LightningDataModule provides GPU support for preprocessing, like shuffling, train-val-test splits, transformations (categorical encoding, normalization, etc.), etc. which any dataframe should undergo before feeding into the dataloader (e.g. train_dataloader, val_dataloader, test_dataloader, predict_dataloader). To me it makes sense to use sklearn pipelines to define those data pre-processing steps (categorical encoding, normalization, etc.).
But I have not seen anyone using sklearn pipelnes in this context before. I was wondering if "PyTorch Tabular" is using sklearn pipelines, but they are not. They rather define a separat method for preprocessing. Which does the same job as described above and of course they use sklearn functions.
Is there a reason why not to use sklearn pipelines to do so (e.g. conflicts enabling GPU acceleration, ...)?
Pseudo Code:
Beta Was this translation helpful? Give feedback.
All reactions