Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a User Guide "How to handle large CSV?" #765

Open
zaleslaw opened this issue Jul 4, 2024 · 1 comment
Open

Add a User Guide "How to handle large CSV?" #765

zaleslaw opened this issue Jul 4, 2024 · 1 comment
Assignees
Labels
documentation Improvements or additions to documentation (not KDocs) enhancement New feature or request
Milestone

Comments

@zaleslaw
Copy link
Collaborator

zaleslaw commented Jul 4, 2024

Users often asks about limitations of KDF to handle large dataframes

The User Guide should contain some recommendations and snippets of code to improve User Path here

  • some benchmarks on real-world or synthetic CSV files or datasets
  • snippets of code to handle it with SQL databases or Apache Spark
  • fine-tuning for KNB, IDEA or Jupyter Notebooks
  • explanation of limitations of KDF reading model

Related to the #141

@zaleslaw zaleslaw added documentation Improvements or additions to documentation (not KDocs) enhancement New feature or request labels Jul 4, 2024
@zaleslaw zaleslaw added this to the 0.14.0 milestone Jul 4, 2024
@zaleslaw zaleslaw self-assigned this Jul 4, 2024
@zaleslaw zaleslaw changed the title Add a User Guide **"How to handle large CSV?"** Add a User Guide "How to handle large CSV?" Jul 4, 2024
@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Aug 23, 2024

I tried reading the 800+ MB csv file from here and I keep running into OOM errors. It might be a good candidate for trying to get it into a DataFrame or working with it.

It contains about 34,959,672 rows of data.

(Edit: It only runs OOM from JUnit tests. It works fine with enough memory from a main() function or a notebook)

@Jolanrensen Jolanrensen mentioned this issue Aug 23, 2024
19 tasks
@zaleslaw zaleslaw modified the milestones: 0.14.0, 0.15.0 Sep 3, 2024
@zaleslaw zaleslaw modified the milestones: 0.15.0, 0.16.0 Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation (not KDocs) enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants