Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI is broken because taxi data URL now gives 404 #4461

Closed
mvashishtha opened this issue May 13, 2022 · 1 comment · Fixed by #4462
Closed

CI is broken because taxi data URL now gives 404 #4461

mvashishtha opened this issue May 13, 2022 · 1 comment · Fixed by #4462
Assignees
Labels
CI Testing 📈 Issues related to testing

Comments

@mvashishtha
Copy link
Collaborator

Sample failure for examples/tutorial/jupyter/execution/pandas_on_dask/test/test_notebooks.py::test_exercise_2: https://github.com/modin-project/modin/runs/6427942517?check_suite_focus=true

To reproduce the 404:

import os
import urllib.request
url_path = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv"
urllib.request.urlretrieve(url_path, "taxi.csv")
Show stack trace
_______________________________ test_exercise_2 ________________________________

    def test_exercise_2():
        modified_notebook_path = os.path.join(local_notebooks_dir, "exercise_2_test.ipynb")
        nb = nbformat.read(
            os.path.join(local_notebooks_dir, "exercise_2.ipynb"),
            as_version=nbformat.NO_CONVERT,
        )
    
        new_cell = f'path = "{test_dataset_path}"\n' + download_taxi_dataset
    
        _replace_str(
            nb,
            'path = "s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv"',
            new_cell,
        )
    
        nbformat.write(nb, modified_notebook_path)
>       _execute_notebook(modified_notebook_path)

examples/tutorial/jupyter/execution/pandas_on_dask/test/test_notebooks.py:65: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
examples/tutorial/jupyter/execution/test/utils.py:38: in _execute_notebook
    ep.preprocess(nb)
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py:89: in preprocess
    self.preprocess_cell(cell, resources, index)
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py:110: in preprocess_cell
    cell = self.execute_cell(cell, index, store_history=True)
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/coverage/control.py:793: CoverageWarning: No data was collected. (no-data-collected)
  self._warn("No data was collected.", slug="no-data-collected")
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/nbclient/util.py:87: in wrapped
    return just_run(coro(*args, **kwargs))
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/nbclient/util.py:62: in just_run
    return loop.run_until_complete(coro)
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/asyncio/base_events.py:616: in run_until_complete
    return future.result()
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/nbclient/client.py:1012: in async_execute_cell
    await self._check_raise_for_error(cell, cell_index, exec_reply)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <nbconvert.preprocessors.execute.ExecutePreprocessor object at 0x7feb51cfe6d0>
cell = {'cell_type': 'code', 'execution_count': 2, 'metadata': {'execution': {'iopub.status.busy': '2022-05-13T[18](https://github.com/modin-project/modin/runs/6427942517?check_suite_focus=true#step:15:18):29:45.90035...zonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv"\n    urllib.request.urlretrieve(url_path, "taxi.csv")\n    '}
cell_index = 5
exec_reply = {'buffers': [], 'content': {'ename': 'HTTPError', 'engine_info': {'engine_id': -1, 'engine_uuid': '4c75e45e-b50d-4a6a-...e, 'engine': '4c75e45e-b50d-4a6a-88b1-9082a9259036', 'started': '2022-05-13T18:29:45.900930Z', 'status': 'error'}, ...}

    async def _check_raise_for_error(
        self, cell: NotebookNode, cell_index: int, exec_reply: t.Optional[t.Dict]
    ) -> None:
    
        if exec_reply is None:
            return None
    
        exec_reply_content = exec_reply['content']
        if exec_reply_content['status'] != 'error':
            return None
    
        cell_allows_errors = (not self.force_raise_errors) and (
            self.allow_errors
            or exec_reply_content.get('ename') in self.allow_error_names
            or "raises-exception" in cell.metadata.get("tags", [])
        )
        await run_hook(
            self.on_cell_error, cell=cell, cell_index=cell_index, execute_reply=exec_reply
        )
        if not cell_allows_errors:
>           raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
E           nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
E           ------------------
E           path = "taxi.csv"
E           import os
E           import urllib.request
E           if not os.path.exists("taxi.csv"):
E               url_path = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv"
E               urllib.request.urlretrieve(url_path, "taxi.csv")
E               
E           ------------------
E           
E           ---------------------------------------------------------------------------
E           HTTPError                                 Traceback (most recent call last)
E           Input In [2], in <cell line: 4>()
E                 4 if not os.path.exists("taxi.csv"):
E                 5     url_path = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv"
E           ----> 6     urllib.request.urlretrieve(url_path,"taxi.csv")
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:247, in urlretrieve(url, filename, reporthook, data)
E               230 """
E               231 Retrieve a URL into a temporary location on disk.
E               232 
E              (...)
E               243 data file as well as the resulting HTTPMessage object.
E               244 """
E               245 url_type, path = _splittype(url)
E           --> 247 with contextlib.closing(urlopen(url,data)) as fp:
E               248     headers = fp.info()
E               250     # Just return the local path and the "headers" for file://
E               251     # URLs. No sense in performing a copy unless requested.
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:222, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
E               220 else:
E               221     opener = _opener
E           --> 222 return opener.open(url,data,timeout)
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:531, in OpenerDirector.open(self, fullurl, data, timeout)
E               529 for processor in self.process_response.get(protocol, []):
E               530     meth = getattr(processor, meth_name)
E           --> 531     response = meth(req,response)
E               533 return response
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:640, in HTTPErrorProcessor.http_response(self, request, response)
E               637 # According to RFC 2616, "2xx" code indicates that the client's
E               638 # request was successfully received, understood, and accepted.
E               639 if not (200 <= code < 300):
E           --> 640     response = self.parent.error(
E               641 'http',request,response,code,msg,hdrs)
E               643 return response
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:569, in OpenerDirector.error(self, proto, *args)
E               567 if http_err:
E               568     args = (dict, 'default', 'http_error_default') + orig_args
E           --> 569     return self._call_chain(*args)
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:502, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
E               500 for handler in handlers:
E               501     func = getattr(handler, meth_name)
E           --> 502     result = func(*args)
E               503     if result is not None:
E               504         return result
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:649, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
E               648 def http_error_default(self, req, fp, code, msg, hdrs):
E           --> 649     raise HTTPError(req.full_url, code, msg, hdrs, fp)
E           
E           HTTPError: HTTP Error 404: Not Found
E           HTTPError: HTTP Error 404: Not Found

/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/nbclient/client.py:906: CellExecutionError
----------------------------- Captured stderr call -----------------------------
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 42 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
_______________________________ test_exercise_4 ________________________________

    def test_exercise_4():
        modified_notebook_path = os.path.join(local_notebooks_dir, "exercise_4_test.ipynb")
        nb = nbformat.read(
            os.path.join(local_notebooks_dir, "exercise_4.ipynb"),
            as_version=nbformat.NO_CONVERT,
        )
    
        s3_path_cell = f's3_path = "{test_dataset_path}"\n' + download_taxi_dataset
        _replace_str(
            nb,
            's3_path = "s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv"',
            s3_path_cell,
        )
    
        nbformat.write(nb, modified_notebook_path)
>       _execute_notebook(modified_notebook_path)

examples/tutorial/jupyter/execution/pandas_on_dask/test/test_notebooks.py:117: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
examples/tutorial/jupyter/execution/test/utils.py:38: in _execute_notebook
    ep.preprocess(nb)
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py:89: in preprocess
    self.preprocess_cell(cell, resources, index)
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py:110: in preprocess_cell
    cell = self.execute_cell(cell, index, store_history=True)
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/nbclient/util.py:87: in wrapped
    return just_run(coro(*args, **kwargs))
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/nbclient/util.py:62: in just_run
    return loop.run_until_complete(coro)
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/asyncio/base_events.py:616: in run_until_complete
    return future.result()
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/nbclient/client.py:1012: in async_execute_cell
    await self._check_raise_for_error(cell, cell_index, exec_reply)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <nbconvert.preprocessors.execute.ExecutePreprocessor object at 0x7feb51cfe6d0>
cell = {'cell_type': 'code', 'execution_count': 2, 'id': 'dc8d5903', 'metadata': {'execution': {'iopub.status.busy': '2022-05...modin_df = pd.read_csv(s3_path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3, nrows=1000)'}
cell_index = 4
exec_reply = {'buffers': [], 'content': {'ename': 'HTTPError', 'engine_info': {'engine_id': -1, 'engine_uuid': '710e2342-de2b-4329-...e, 'engine': '710e2342-de2b-4329-b189-83a0f25b0056', 'started': '2022-05-13T18:31:22.472[19](https://github.com/modin-project/modin/runs/6427942517?check_suite_focus=true#step:15:19)1Z', 'status': 'error'}, ...}

    async def _check_raise_for_error(
        self, cell: NotebookNode, cell_index: int, exec_reply: t.Optional[t.Dict]
    ) -> None:
    
        if exec_reply is None:
            return None
    
        exec_reply_content = exec_reply['content']
        if exec_reply_content['status'] != 'error':
            return None
    
        cell_allows_errors = (not self.force_raise_errors) and (
            self.allow_errors
            or exec_reply_content.get('ename') in self.allow_error_names
            or "raises-exception" in cell.metadata.get("tags", [])
        )
        await run_hook(
            self.on_cell_error, cell=cell, cell_index=cell_index, execute_reply=exec_reply
        )
        if not cell_allows_errors:
>           raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
E           nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
E           ------------------
E           import modin.pandas as pd
E           import modin.experimental.spreadsheet as mss
E           from modin.config import Engine
E           Engine.put("dask")
E           
E           s3_path = "taxi.csv"
E           import os
E           import urllib.request
E           if not os.path.exists("taxi.csv"):
E               url_path = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_[20](https://github.com/modin-project/modin/runs/6427942517?check_suite_focus=true#step:15:20)[21](https://github.com/modin-project/modin/runs/6427942517?check_suite_focus=true#step:15:21)-01.csv"
E               urllib.request.urlretrieve(url_path, "taxi.csv")
E               
E           modin_df = pd.read_csv(s3_path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3, nrows=1000)
E           ------------------
E           
E           ---------------------------------------------------------------------------
E           HTTPError                                 Traceback (most recent call last)
E           Input In [2], in <cell line: 9>()
E                 9 if not os.path.exists("taxi.csv"):
E                10     url_path = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv"
E           ---> 11     urllib.request.urlretrieve(url_path,"taxi.csv")
E                13 modin_df = pd.read_csv(s3_path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3, nrows=1000)
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:247, in urlretrieve(url, filename, reporthook, data)
E               230 """
E               231 Retrieve a URL into a temporary location on disk.
E               232 
E              (...)
E               243 data file as well as the resulting HTTPMessage object.
E               244 """
E               245 url_type, path = _splittype(url)
E           --> 247 with contextlib.closing(urlopen(url,data)) as fp:
E               248     headers = fp.info()
E               250     # Just return the local path and the "headers" for file://
E               251     # URLs. No sense in performing a copy unless requested.
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:[22](https://github.com/modin-project/modin/runs/6427942517?check_suite_focus=true#step:15:22)2, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
E               220 else:
E               221     opener = _opener
E           --> 222 return opener.open(url,data,timeout)
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:531, in OpenerDirector.open(self, fullurl, data, timeout)
E               529 for processor in self.process_response.get(protocol, []):
E               530     meth = getattr(processor, meth_name)
E           --> 531     response = meth(req,response)
E               533 return response
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:640, in HTTPErrorProcessor.http_response(self, request, response)
E               637 # According to RFC [26](https://github.com/modin-project/modin/runs/6427942517?check_suite_focus=true#step:15:26)16, "2xx" code indicates that the client's
E               638 # request was successfully received, understood, and accepted.
E               639 if not (200 <= code < [30](https://github.com/modin-project/modin/runs/6427942517?check_suite_focus=true#step:15:30)0):
E           --> 640     response = self.parent.error(
E               641 'http',request,response,code,msg,hdrs)
E               643 return response
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:569, in OpenerDirector.error(self, proto, *args)
E               567 if http_err:
E               568     args = (dict, 'default', 'http_error_default') + orig_args
E           --> 569     return self._call_chain(*args)
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:502, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
E               500 for handler in handlers:
E               501     func = getattr(handler, meth_name)
E           --> 502     result = func(*args)
E               503     if result is not None:
E               504         return result
E           
E           File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/urllib/request.py:649, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
E               648 def http_error_default(self, req, fp, code, msg, hdrs):
E           --> 649     raise HTTPError(req.full_url, code, msg, hdrs, fp)
E           
E           HTTPError: HTTP Error 404: Not Found
E           HTTPError: HTTP Error 404: Not Found

/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/nbclient/client.py:906: CellExecutionError
----------------------------- Captured stderr call -----------------------------
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be [33](https://github.com/modin-project/modin/runs/6427942517?check_suite_focus=true#step:15:33) leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

---------- coverage: platform linux, python 3.8.12-final-0 -----------
Coverage XML written to file coverage.xml

=========================== short test summary info ============================
FAILED examples/tutorial/jupyter/execution/pandas_on_dask/test/test_notebooks.py::test_exercise_2
FAILED examples/tutorial/jupyter/execution/pandas_on_dask/test/test_notebooks.py::test_exercise_4
=================== 2 failed, 2 passed in 115.77s (0:01:[55](https://github.com/modin-project/modin/runs/6427942517?check_suite_focus=true#step:15:55)) ====================
Error: Process completed with exit code 1.
@mvashishtha mvashishtha added the Testing 📈 Issues related to testing label May 13, 2022
@jeffreykennethli jeffreykennethli self-assigned this May 13, 2022
@jeffreykennethli
Copy link

Looks like it's related to https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page#:~:text=and%20accurate%20information.-,NOTICE!,-On%2005/13. A workaround would be to self-host this dataset, and replace all references in our test notebooks with the self-hosted dataset.

It looks like just nyc-tlc data is affected, not dask-data/nyc-taxi.

jeffreykennethli pushed a commit to jeffreykennethli/modin that referenced this issue May 13, 2022
Signed-off-by: jeffreykennethli <jkli@ponder.io>
devin-petersohn pushed a commit that referenced this issue May 16, 2022
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Testing 📈 Issues related to testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants