Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NoneType has no attribute mime #112

Closed
bjchambers opened this issue Mar 25, 2024 · 1 comment · Fixed by #113
Closed

NoneType has no attribute mime #112

bjchambers opened this issue Mar 25, 2024 · 1 comment · Fixed by #113
Assignees
Labels
bug Something isn't working

Comments

@bjchambers
Copy link
Contributor

Originally reported in Discord (https://discord.com/channels/1202000379907424337/1202000379907424340/1219910566735253534)

When creating a document from a URL:

$ curl -L -X POST 'http://localhost:8000/api/documents/' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
--data-raw '{
  "collection": "my_new_collection",
  "url": "https://python.langchain.com/docs/expression_language/cookbook/retrieval"
}'
{"id":2,"collection":"my_new_collection","extracted_text":null,"url":"https://python.langchain.com/docs/expression_language/cookbook/retrieval","ingest_state":"pending","ingest_error":null}

Getting an ingestion error of the following:

python-1      | INFO:     172.29.0.1:39804 - "POST /api/documents/ HTTP/1.1" 200 OK
python-1      | 2024-03-20 06:54:59.126 | INFO     | dewy.common.collection_embeddings:ingest:211 - Loading content for document 2 from url 'https://python.langchain.com/docs/expression_language/cookbook/retrieval'
python-1      | 2024-03-20 06:54:59.316 | DEBUG    | dewy.common.extract:extract_url:93 - Content type of https://python.langchain.com/docs/expression_language/cookbook/retrieval is text/html; charset=utf-8
python-1      | 2024-03-20 06:54:59.317 | DEBUG    | dewy.common.extract:extract_url:95 - Downloading https://python.langchain.com/docs/expression_language/cookbook/retrieval
python-1      | 2024-03-20 06:54:59.361 | INFO     | dewy.common.extract:extract_content:62 - Extracting content from 116836 bytes
python-1      | 2024-03-20 06:54:59.373 | ERROR    | dewy.document.router:ingest_document:43 - Failed to ingest 2: 'NoneType' object has no attribute 'mime'
python-1      | 2024-03-20 06:54:59.374 | INFO     | dewy.document.router:ingest_document:46 - Deleting embeddings for failed document 2
python-1      | 2024-03-20 06:54:59.375 | INFO     | dewy.document.router:ingest_document:56 - Deleting chunks for failed document 2
python-1      | 2024-03-20 06:54:59.376 | INFO     | dewy.document.router:ingest_document:65 - Updating status of failed document 2

When they check the status, it shows as failure:

$ curl -L -X GET 'http://localhost:8000/api/documents/2/status' -H 'Accept: application/json'
{"id":2,"ingest_state":"failed","ingest_error":"'NoneType' object has no attribute 'mime'"}
@bjchambers bjchambers added the bug Something isn't working label Mar 25, 2024
@bjchambers bjchambers self-assigned this Mar 25, 2024
@bjchambers
Copy link
Contributor Author

Changes in main provide more information about the error:

dewy-1  | 2024-03-25 15:10:07.369 | ERROR    | dewy.tasks._ingest:ingest_task:20 - Failed to ingest
dewy-1  | Traceback (most recent call last):
dewy-1  | 
dewy-1  |   File "<frozen runpy>", line 198, in _run_module_as_main
dewy-1  |   File "<frozen runpy>", line 88, in _run_code
dewy-1  | 
dewy-1  |   File "/code/dewy/__main__.py", line 4, in <module>
dewy-1  |     dewy()
dewy-1  |     └ <Group dewy>
dewy-1  | 
dewy-1  |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
dewy-1  |     return self.main(*args, **kwargs)
dewy-1  |            │    │     │       └ {}
dewy-1  |            │    │     └ ()
dewy-1  |            │    └ <function BaseCommand.main at 0xffff9b9be200>
dewy-1  |            └ <Group dewy>
dewy-1  |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1078, in main
dewy-1  |     rv = self.invoke(ctx)
dewy-1  |          │    │      └ <click.core.Context object at 0xffff9bc25e90>
dewy-1  |          │    └ <function MultiCommand.invoke at 0xffff9b9bf420>
dewy-1  |          └ <Group dewy>
dewy-1  |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
dewy-1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
dewy-1  |            │               │       │       │      └ <click.core.Context object at 0xffff9bd7a6d0>
dewy-1  |            │               │       │       └ <function Command.invoke at 0xffff9b9bede0>
dewy-1  |            │               │       └ <Command serve>
dewy-1  |            │               └ <click.core.Context object at 0xffff9bd7a6d0>
dewy-1  |            └ <function MultiCommand.invoke.<locals>._process_result at 0xffff9bc15b20>
dewy-1  |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
dewy-1  |     return ctx.invoke(self.callback, **ctx.params)
dewy-1  |            │   │      │    │           │   └ {'port': 8000, 'admin_ui': True, 'openapi_ui': True, 'apply_migrations': True, 'openai_api_key': 'sk-GhUmxSst1aZGHomgYK2oT3Bl...
dewy-1  |            │   │      │    │           └ <click.core.Context object at 0xffff9bd7a6d0>
dewy-1  |            │   │      │    └ <function serve at 0xffff92975f80>
dewy-1  |            │   │      └ <Command serve>
dewy-1  |            │   └ <function Context.invoke at 0xffff9b9bd760>
dewy-1  |            └ <click.core.Context object at 0xffff9bd7a6d0>
dewy-1  |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 783, in invoke
dewy-1  |     return __callback(*args, **kwargs)
dewy-1  |                        │       └ {'port': 8000, 'admin_ui': True, 'openapi_ui': True, 'apply_migrations': True, 'openai_api_key': 'sk-GhUmxSst1aZGHomgYK2oT3Bl...
dewy-1  |                        └ ()
dewy-1  |   File "/usr/local/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
dewy-1  |     return f(get_current_context(), *args, **kwargs)
dewy-1  |            │ │                       │       └ {'port': 8000, 'admin_ui': True, 'openapi_ui': True, 'apply_migrations': True, 'openai_api_key': 'sk-GhUmxSst1aZGHomgYK2oT3Bl...
dewy-1  |            │ │                       └ ()
dewy-1  |            │ └ <function get_current_context at 0xffff9b9911c0>
dewy-1  |            └ <function serve at 0xffff92975ee0>
dewy-1  | 
dewy-1  |   File "/code/dewy/serve.py", line 205, in serve
dewy-1  |     uvicorn.run(app, host="0.0.0.0", port=port)
dewy-1  |     │       │   │                         └ 8000
dewy-1  |     │       │   └ <fastapi.applications.FastAPI object at 0xffff935a5c10>
dewy-1  |     │       └ <function run at 0xffff92993c40>
dewy-1  |     └ <module 'uvicorn' from '/usr/local/lib/python3.11/site-packages/uvicorn/__init__.py'>
dewy-1  | 
dewy-1  |   File "/usr/local/lib/python3.11/site-packages/uvicorn/main.py", line 587, in run
dewy-1  |     server.run()
dewy-1  |     │      └ <function Server.run at 0xffff929468e0>
dewy-1  |     └ <uvicorn.server.Server object at 0xffff929439d0>
dewy-1  |   File "/usr/local/lib/python3.11/site-packages/uvicorn/server.py", line 61, in run
dewy-1  |     return asyncio.run(self.serve(sockets=sockets))
dewy-1  |            │       │   │    │             └ None
dewy-1  |            │       │   │    └ <function Server.serve at 0xffff92946980>
dewy-1  |            │       │   └ <uvicorn.server.Server object at 0xffff929439d0>
dewy-1  |            │       └ <function run at 0xffff9adcea20>
dewy-1  |            └ <module 'asyncio' from '/usr/local/lib/python3.11/asyncio/__init__.py'>
dewy-1  |   File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
dewy-1  |     return runner.run(main)
dewy-1  |            │      │   └ <coroutine object Server.serve at 0xffff9296a9b0>
dewy-1  |            │      └ <function Runner.run at 0xffff9addac00>
dewy-1  |            └ <asyncio.runners.Runner object at 0xffff92a08650>
dewy-1  |   File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
dewy-1  |     return self._loop.run_until_complete(task)
dewy-1  |            │    │     │                  └ <Task pending name='Task-1' coro=<Server.serve() running at /usr/local/lib/python3.11/site-packages/uvicorn/server.py:81> wai...
dewy-1  |            │    │     └ <function BaseEventLoop.run_until_complete at 0xffff9add8860>
dewy-1  |            │    └ <_UnixSelectorEventLoop running=True closed=False debug=False>
dewy-1  |            └ <asyncio.runners.Runner object at 0xffff92a08650>
dewy-1  |   File "/usr/local/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
dewy-1  |     self.run_forever()
dewy-1  |     │    └ <function BaseEventLoop.run_forever at 0xffff9add87c0>
dewy-1  |     └ <_UnixSelectorEventLoop running=True closed=False debug=False>
dewy-1  |   File "/usr/local/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
dewy-1  |     self._run_once()
dewy-1  |     │    └ <function BaseEventLoop._run_once at 0xffff9adda5c0>
dewy-1  |     └ <_UnixSelectorEventLoop running=True closed=False debug=False>
dewy-1  |   File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
dewy-1  |     handle._run()
dewy-1  |     │      └ <function Handle._run at 0xffff9b5ff420>
dewy-1  |     └ <Handle <TaskStepMethWrapper object at 0xffff9290c430>()>
dewy-1  |   File "/usr/local/lib/python3.11/asyncio/events.py", line 84, in _run
dewy-1  |     self._context.run(self._callback, *self._args)
dewy-1  |     │    │            │    │           │    └ <member '_args' of 'Handle' objects>
dewy-1  |     │    │            │    │           └ <Handle <TaskStepMethWrapper object at 0xffff9290c430>()>
dewy-1  |     │    │            │    └ <member '_callback' of 'Handle' objects>
dewy-1  |     │    │            └ <Handle <TaskStepMethWrapper object at 0xffff9290c430>()>
dewy-1  |     │    └ <member '_context' of 'Handle' objects>
dewy-1  |     └ <Handle <TaskStepMethWrapper object at 0xffff9290c430>()>
dewy-1  |   File "/usr/local/lib/python3.11/site-packages/taskiq/receiver/receiver.py", line 144, in callback
dewy-1  |     result = await self.run_task(
dewy-1  |                    │    └ <function Receiver.run_task at 0xffff92b220c0>
dewy-1  |                    └ <taskiq.receiver.receiver.Receiver object at 0xffff92898350>
dewy-1  |   File "/usr/local/lib/python3.11/site-packages/taskiq/receiver/receiver.py", line 267, in run_task
dewy-1  |     returned = await target_future
dewy-1  |                      └ <coroutine object ingest_task at 0xffff928a4280>
dewy-1  | 
dewy-1  | > File "/code/dewy/tasks/_ingest.py", line 18, in ingest_task
dewy-1  |     await ingest(document_id, request, conn, config)
dewy-1  |           │      │            │        │     └ ServeConfig(db='postgresql://dewydbuser:dewydbpwd@postgres/dewydb', broker=None, serve_openapi_ui=True, serve_admin_ui=True, ...
dewy-1  |           │      │            │        └ <PoolConnectionProxy <asyncpg.connection.Connection object at 0xffff929da5c0> 0xffff92867d30>
dewy-1  |           │      │            └ IngestURL(url='https://python.langchain.com/docs/expression_language/cookbook/retrieval')
dewy-1  |           │      └ 1
dewy-1  |           └ <function ingest at 0xffff9a47d580>
dewy-1  | 
dewy-1  |   File "/code/dewy/domain/ingest.py", line 50, in ingest
dewy-1  |     extracted = await _extract(request, collection_config=collection_config)
dewy-1  |                       │        │                          └ CollectionConfig(collection_id=1, text_embedding_model='openai:text-embedding-ada-002', text_distance_metric=<DistanceMetric....
dewy-1  |                       │        └ IngestURL(url='https://python.langchain.com/docs/expression_language/cookbook/retrieval')
dewy-1  |                       └ <function _extract at 0xffff92dfb920>
dewy-1  | 
dewy-1  |   File "/code/dewy/domain/ingest.py", line 106, in _extract
dewy-1  |     return await extract_url(
dewy-1  |                  └ <function extract_url at 0xffff92dfb2e0>
dewy-1  | 
dewy-1  |   File "/code/dewy/common/extract.py", line 242, in extract_url
dewy-1  |     return await extract_content(
dewy-1  |                  └ <function extract_content at 0xffff92dfb100>
dewy-1  | 
dewy-1  |   File "/code/dewy/common/extract.py", line 211, in extract_content
dewy-1  |     raise HTTPException(
dewy-1  |           └ <class 'fastapi.exceptions.HTTPException'>
dewy-1  | 
dewy-1  | fastapi.exceptions.HTTPException: 415: Cannot add document from unrecognized mimetype 'text/html; charset=utf-8' and file name '/docs/expression_language/cookbook/retrieval'

Looks like this is a case of the mimetype not matching (it starts with text/html but I need to split off the ; charset=utf-8 so it actually is equal to text/html, otherwise it's hitting an unknown mime-type path).

bjchambers added a commit that referenced this issue Mar 25, 2024
As reported in #112 and demonstrated in the test (prior to fixing) we
didn't handle mimetypes correctly if they contained an encoding. For
instance `text/html; charset=utf-8`. The fix is to split the string on
the semicolon, and only take the mimetype (appearing before the first
semicolon) for the matching.
bjchambers added a commit that referenced this issue Mar 25, 2024
As reported in #112 and demonstrated in the test (prior to fixing) we
didn't handle mimetypes correctly if they contained an encoding. For
instance `text/html; charset=utf-8`. The fix is to split the string on
the semicolon, and only take the mimetype (appearing before the first
semicolon) for the matching.

This closes #112.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant