-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#4461: Fix S3 CSV data path #4462
FIX-#4461: Fix S3 CSV data path #4462
Conversation
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Should wait to see if https://github.com/modin-project/modin/actions/runs/2321592702 passes |
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Codecov Report
@@ Coverage Diff @@
## master #4462 +/- ##
==========================================
+ Coverage 86.71% 89.54% +2.82%
==========================================
Files 222 222
Lines 18038 18038
==========================================
+ Hits 15642 16152 +510
+ Misses 2396 1886 -510
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Was able to get https://github.com/modin-project/modin/actions/runs/2321793092 working but will add another commit to unify the data sources to a |
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Signed-off-by: jeffreykennethli <jkli@ponder.io>
a17eae8
to
79747fc
Compare
@@ -90,7 +88,7 @@ | |||
"source": [ | |||
"# [Optional] Download data locally. This may take a few minutes to download.\n", | |||
"# import urllib.request\n", | |||
"# url_path = \"https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have this be the 1.8GB dask-data
dataset because in https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/test/test_notebooks.py#L57-L61, the line above is replace (commented out) and the smaller modin-test
dataset is used instead.
"\n", | ||
"We will be using a version of this data already in S3, originally posted in this blog post: https://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes\n", | ||
"\n", | ||
"**Size: ~2GB**" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still use the dask-data
dataset here because it is replaced with the smaller modin-test
datatset in our CI (same as in the pandas_on_ray
case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jeffreykennethli, LGTM!
Signed-off-by: jeffreykennethli jkli@ponder.io
What do these changes do?
Fixes a bug where a CSV file in S3 was moved by changing that file path to one that works. From former (broken) to current (fixed).
Also fixes instances where we refer to the
nyc-tlc
dataset, which was ~120MB, but meant to refer to the 2GBdask-data
dataset.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date