Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kserve-integration notebook fails on self-hosted runners with "Name or service not known" #47

Closed
orfeas-k opened this issue Nov 14, 2023 · 3 comments · Fixed by #51
Closed

Comments

@orfeas-k
Copy link
Contributor

orfeas-k commented Nov 14, 2023

Running the kserve-integration UAT notebook from main branch on a self-hosted runner fails with the following traceback of HTTPConnection errors:

491 E           NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f8da4e75ac0>: Failed to establish a new connection: [Errno -2] Name or service not known
...
525 E           MaxRetryError: HTTPConnectionPool(host='sklearn-iris.test-kubeflow.svc.cluster.local', port=80): Max retries exceeded with url: /v1/models/sklearn-iris:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8da4e75ac0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
...
581 E           ConnectionError: HTTPConnectionPool(host='sklearn-iris.test-kubeflow.svc.cluster.local', port=80): Max retries exceeded with url: /v1/models/sklearn-iris:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8da4e75ac0>: Failed to establish a new connection: [Errno -2] Name or service not known'))

The Name or service not known probably means that something failed during DNS resolution

Environment

  • Self-hosted runner
  • Microk8s 1.24
  • Juju 2.9
  • CKF 1.7/stable

Logs

______________________ test_notebook[kserve-integration] _______________________

test_notebook = '/tests/notebooks/kserve/kserve-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))
    
        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)
    
        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"
    
        try:
            log.info(f"Running {os.path.basename(test_notebook)}...")
>           output_notebook, _ = ep.preprocess(notebook, {"metadata": {"path": "./"}})

/tests/test_notebooks.py:45: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py:100: in preprocess
    self.preprocess_cell(cell, resources, index)
/opt/conda/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py:121: in preprocess_cell
    cell = self.execute_cell(cell, index, store_history=True)
/opt/conda/lib/python3.8/site-packages/jupyter_core/utils/__init__.py:166: in wrapped
    return loop.run_until_complete(inner)
/opt/conda/lib/python3.8/asyncio/base_events.py:616: in run_until_complete
    return future.result()
/opt/conda/lib/python3.8/site-packages/nbclient/client.py:1021: in async_execute_cell
    await self._check_raise_for_error(cell, cell_index, exec_reply)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <nbconvert.preprocessors.execute.ExecutePreprocessor object at 0x7fedb808a160>
cell = {'cell_type': 'code', 'execution_count': 9, 'id': '4ef27af2-9ae0-4adf-9058-ecc5ac84ef24', 'metadata': {'execution': {'...}\nresponse = requests.post(f"{isvc_url}/v1/models/sklearn-iris:predict", json=inference_input)\nprint(response.text)'}
cell_index = 18
exec_reply = {'buffers': [], 'content': {'ename': 'ConnectionError', 'engine_info': {'engine_id': -1, 'engine_uuid': 'a9889439-8e38...e, 'engine': 'a9889439-8e38-4cd4-91d1-fed3131c0170', 'started': '2023-11-14T11:40:36.221455Z', 'status': 'error'}, ...}

    async def _check_raise_for_error(
        self, cell: NotebookNode, cell_index: int, exec_reply: t.Optional[t.Dict]
    ) -> None:
    
        if exec_reply is None:
            return None
    
        exec_reply_content = exec_reply['content']
        if exec_reply_content['status'] != 'error':
            return None
    
        cell_allows_errors = (not self.force_raise_errors) and (
            self.allow_errors
            or exec_reply_content.get('ename') in self.allow_error_names
            or "raises-exception" in cell.metadata.get("tags", [])
        )
        await run_hook(
            self.on_cell_error, cell=cell, cell_index=cell_index, execute_reply=exec_reply
        )
        if not cell_allows_errors:
>           raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
E           nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
E           ------------------
E           inference_input = {
E             "instances": [
E               [6.8,  2.8,  4.8,  1.4],
E               [6.0,  3.4,  4.5,  1.6]
E             ]
E           }
E           response = requests.post(f"{isvc_url}/v1/models/sklearn-iris:predict", json=inference_input)
E           print(response.text)
E           ------------------
E           
E           ---------------------------------------------------------------------------E           gaierror                                  Traceback (most recent call last)
E           File /opt/conda/lib/python3.8/site-packages/urllib3/connection.py:174, in HTTPConnection._new_conn(self)
E               173 try:
E           --> 174     conn = connection.create_connection(
E               175 (self._dns_host,self.port),self.timeout,**extra_kw
E               176 )
E               178 except SocketTimeout:
E           
E           File /opt/conda/lib/python3.8/site-packages/urllib3/util/connection.py:72, in create_connection(address, timeout, source_address, socket_options)
E                68     return six.raise_from(
E                69         LocationParseError(u"'%s', label empty or too long" % host), None
E                70     )
E           ---> 72 for res in socket.getaddrinfo(host,port,family,socket.SOCK_STREAM):
E                73     af, socktype, proto, canonname, sa = res
E           
E           File /opt/conda/lib/python3.8/socket.py:918, in getaddrinfo(host, port, family, type, proto, flags)
E               917 addrlist = []
E           --> 918 for res in _socket.getaddrinfo(host,port,family,type,proto,flags):
E               919     af, socktype, proto, canonname, sa = res
E           
E           gaierror: [Errno -2] Name or service not known
E           
E           During handling of the above exception, another exception occurred:
E           
E           NewConnectionError                        Traceback (most recent call last)
E           File /opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py:714, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
E               713 # Make the request on the httplib connection object.
E           --> 714 httplib_response = self._make_request(
E               715 conn,
E               716 method,
E               717 url,
E               718 timeout=timeout_obj,
E               719 body=body,
E               720 headers=headers,
E               721 chunked=chunked,
E               722 )
E               724 # If we're going to release the connection in ``finally:``, then
E               725 # the response doesn't need to know about the connection. Otherwise
E               726 # it will also try to release it and we'll have a double-release
E               727 # mess.
E           
E           File /opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py:415, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
E               [414](https://github.com/canonical/bundle-kubeflow/actions/runs/6862655392/job/18660833674#step:15:415)     else:
E           --> 415         conn.request(method,url,**httplib_request_kw)
E               417 # We are swallowing BrokenPipeError (errno.EPIPE) since the server is
E               418 # legitimately able to close the connection after sending a valid response.
E               419 # With this behaviour, the received response is still readable.
E           
E           File /opt/conda/lib/python3.8/site-packages/urllib3/connection.py:244, in HTTPConnection.request(self, method, url, body, headers)
E               243     headers["User-Agent"] = _get_default_user_agent()
E           --> 244 super(HTTPConnection,self).request(method,url,body=body,headers=headers)
E           
E           File /opt/conda/lib/python3.8/http/client.py:1252, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
E              1251 """Send a complete request to the server."""
E           -> 1252 self._send_request(method,url,body,headers,encode_chunked)
E           
E           File /opt/conda/lib/python3.8/http/client.py:1298, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
E              1297     body = _encode(body, 'body')
E           -> 1298 self.endheaders(body,encode_chunked=encode_chunked)
E           
E           File /opt/conda/lib/python3.8/http/client.py:1247, in HTTPConnection.endheaders(self, message_body, encode_chunked)
E              1246     raise CannotSendHeader()
E           -> 1247 self._send_output(message_body,encode_chunked=encode_chunked)
E           
E           File /opt/conda/lib/python3.8/http/client.py:1007, in HTTPConnection._send_output(self, message_body, encode_chunked)
E              1006 del self._buffer[:]
E           -> 1007 self.send(msg)
E              1009 if message_body is not None:
E              1010 
E              1011     # create a consistent interface to message_body
E           
E           File /opt/conda/lib/python3.8/http/client.py:947, in HTTPConnection.send(self, data)
E               946 if self.auto_open:
E           --> 947     self.connect()
E               948 else:
E           
E           File /opt/conda/lib/python3.8/site-packages/urllib3/connection.py:205, in HTTPConnection.connect(self)
E               204 def connect(self):
E           --> 205     conn = self._new_conn()
E               206     self._prepare_conn(conn)
E           
E           File /opt/conda/lib/python3.8/site-packages/urllib3/connection.py:186, in HTTPConnection._new_conn(self)
E               185 except SocketError as e:
E           --> 186     raise NewConnectionError(
E               187         self, "Failed to establish a new connection: %s" % e
E               188     )
E               190 return conn
E           
E           NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f8da4e75ac0>: Failed to establish a new connection: [Errno -2] Name or service not known
E           
E           During handling of the above exception, another exception occurred:
E           
E           MaxRetryError                             Traceback (most recent call last)
E           File /opt/conda/lib/python3.8/site-packages/requests/adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
E               485 try:
E           --> 486     resp = conn.urlopen(
E               487 method=request.method,
E               488 url=url,
E               489 body=request.body,
E               490 headers=request.headers,
E               491 redirect=False,
E               492 assert_same_host=False,
E               493 preload_content=False,
E               494 decode_content=False,
E               495 retries=self.max_retries,
E               496 timeout=timeout,
E               497 chunked=chunked,
E               498 )
E               500 except (ProtocolError, OSError) as err:
E           
E           File /opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py:798, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
E               796     e = ProtocolError("Connection aborted.", e)
E           --> 798 retries = retries.increment(
E               799 method,url,error=e,_pool=self,_stacktrace=sys.exc_info()[2]
E               800 )
E               801 retries.sleep()
E           
E           File /opt/conda/lib/python3.8/site-packages/urllib3/util/retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
E               591 if new_retry.is_exhausted():
E           --> 592     raise MaxRetryError(_pool, url, error or ResponseError(cause))
E               594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
E           
E           MaxRetryError: HTTPConnectionPool(host='sklearn-iris.test-kubeflow.svc.cluster.local', port=80): Max retries exceeded with url: /v1/models/sklearn-iris:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8da4e75ac0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
E           
E           During handling of the above exception, another exception occurred:
E           
E           ConnectionError                           Traceback (most recent call last)
E           Cell In[9], line 7
E                 1 inference_input = {
E                 2   "instances": [
E                 3     [6.8,  2.8,  4.8,  1.4],
E                 4     [6.0,  3.4,  4.5,  1.6]
E                 5   ]
E                 6 }
E           ----> 7 response = requests.post(f"{isvc_url}/v1/models/sklearn-iris:predict",json=inference_input)
E                 8 print(response.text)
E           
E           File /opt/conda/lib/python3.8/site-packages/requests/api.py:115, in post(url, data, json, **kwargs)
E               103 def post(url, data=None, json=None, **kwargs):
E               104     r"""Sends a POST request.
E               105 
E               106     :param url: URL for the new :class:`Request` object.
E              (...)
E               112     :rtype: requests.Response
E               113     """
E           --> 115     return request("post",url,data=data,json=json,**kwargs)
E           
E           File /opt/conda/lib/python3.8/site-packages/requests/api.py:59, in request(method, url, **kwargs)
E                55 # By using the 'with' statement we are sure the session is closed, thus we
E                56 # avoid leaving sockets open which can trigger a ResourceWarning in some
E                57 # cases, and look like a memory leak in others.
E                58 with sessions.Session() as session:
E           ---> 59     return session.request(method=method,url=url,**kwargs)
E           
E           File /opt/conda/lib/python3.8/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
E               584 send_kwargs = {
E               585     "timeout": timeout,
E               586     "allow_redirects": allow_redirects,
E               587 }
E               588 send_kwargs.update(settings)
E           --> 589 resp = self.send(prep,**send_kwargs)
E               591 return resp
E           
E           File /opt/conda/lib/python3.8/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
E               700 start = preferred_clock()
E               702 # Send the request
E           --> 703 r = adapter.send(request,**kwargs)
E               705 # Total elapsed time of the request (approximately)
E               706 elapsed = preferred_clock() - start
E           
E           File /opt/conda/lib/python3.8/site-packages/requests/adapters.py:519, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
E               515     if isinstance(e.reason, _SSLError):
E               516         # This branch is for urllib3 v1.22 and later.
E               517         raise SSLError(e, request=request)
E           --> 519     raise ConnectionError(e, request=request)
E               521 except ClosedPoolError as e:
E               522     raise ConnectionError(e, request=request)
E           
E           ConnectionError: HTTPConnectionPool(host='sklearn-iris.test-kubeflow.svc.cluster.local', port=80): Max retries exceeded with url: /v1/models/sklearn-iris:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8da4e75ac0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
E           ConnectionError: HTTPConnectionPool(host='sklearn-iris.test-kubeflow.svc.cluster.local', port=80): Max retries exceeded with url: /v1/models/sklearn-iris:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8da4e75ac0>: Failed to establish a new connection: [Errno -2] Name or service not known'))

/opt/conda/lib/python3.8/site-packages/nbclient/client.py:915: CellExecutionError

During handling of the above exception, another exception occurred:

test_notebook = '/tests/notebooks/kserve/kserve-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))
    
        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)
    
        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"
    
        try:
            log.info(f"Running {os.path.basename(test_notebook)}...")
            output_notebook, _ = ep.preprocess(notebook, {"metadata": {"path": "./"}})
            # persist the notebook output to the original file for debugging purposes
            save_notebook(output_notebook, test_notebook)
        except CellExecutionError as e:
            # handle underlying error
>           pytest.fail(f"Notebook execution failed with {e.ename}: {e.evalue}")
E           Failed: Notebook execution failed with ConnectionError: HTTPConnectionPool(host='sklearn-iris.test-kubeflow.svc.cluster.local', port=80): Max retries exceeded with url: /v1/models/sklearn-iris:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8da4e75ac0>: Failed to establish a new connection: [Errno -2] Name or service not known'))

/tests/test_notebooks.py:50: Failed

...


----------------------------- Captured stderr call -----------------------------
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scipy 1.7.0 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.24.4 which is incompatible.
kubeflow-katib 0.15.0 requires grpcio==1.41.1, but you have grpcio 1.51.3 which is incompatible.
kubeflow-katib 0.15.0 requires protobuf==3.19.5, but you have protobuf 3.20.3 which is incompatible.
kfp 1.8.22 requires kubernetes<26,>=8.0.0, but you have kubernetes 28.1.0 which is incompatible.
jupyter-server 1.23.6 requires anyio<4,>=3.1.0, but you have anyio 4.0.0 which is incompatible.
@orfeas-k orfeas-k changed the title kserve-integration notebook: Fails on self-hosted runners with kserve-integration notebook: Fails on self-hosted runners with Nov 14, 2023
@orfeas-k orfeas-k changed the title kserve-integration notebook: Fails on self-hosted runners with kserve-integration notebook: Fails on self-hosted runners with Failed to establish a new connection: [Errno -2] Name or service not known Nov 14, 2023
@orfeas-k orfeas-k changed the title kserve-integration notebook: Fails on self-hosted runners with Failed to establish a new connection: [Errno -2] Name or service not known kserve-integration notebook: Fails on self-hosted runners with "Name or service not known" Nov 14, 2023
@orfeas-k orfeas-k changed the title kserve-integration notebook: Fails on self-hosted runners with "Name or service not known" kserve-integration notebook fails on self-hosted runners with "Name or service not known" Nov 14, 2023
@orfeas-k
Copy link
Contributor Author

orfeas-k commented Nov 15, 2023

Interesting finding is that the kserve-integration UAT from main branch PASSED when deployed latest/edge bundle.yaml file. Will rerun UATs from track/1.7 on bundle 1.7/stable. The only difference between the UATs branches is this bugfix PR, but when upgrading from Kserve 0.10 -> 0.11, we also switched from raw-deployment mode to serverless, which could affect the K8s services created by Kserve.

@orfeas-k
Copy link
Contributor Author

As noted in canonical/kserve-operators#148 and in the notebook's PR #10

This only works with Serverless deployment mode at the moment

Thus, the above behavior is expected since we ran this with CKF 1.7 that deploys Kserve in RawDeployment mode. We will close this issue but we will need to investigate more canonical/kserve-operators#148 in order to understand:

  1. Why isn't a K8s Service created in the first place in RawDeployment
  2. Whether we should stick with RawDeployment or switch to Serverless in 1.7 as well, to get KServe working

@orfeas-k
Copy link
Contributor Author

Reopening. We will keep this open until we update the kserve-integration UAT notebook with a requirement-note that this works only serverless mode of Kserve.

@orfeas-k orfeas-k reopened this Nov 16, 2023
orfeas-k added a commit that referenced this issue Nov 17, 2023
Add note that UAT notebook is only compatible with serverless deployment
mode at the moment.
Closes #47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant