You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All tests are currently failing due to the following error, indicating a dependency was updated which broke this import:
[ERRR] Error in unstructured.handle_event(FILESYSTEM("{'path': '/tmp/.bbot_test/scans/testexcavaterawdata_test_g2ykldx164/filedownload...", module=filedownload, tags={'in-scope', 'filedownload', 'file'})): /home/runner/work/bbot/bbot/bbot/modules/unstructured.py:101:handle_event(): numpy.core.multiarray failed to import
[TRCE] concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/home/runner/work/bbot/bbot/bbot/modules/unstructured.py", line 123, in extract_text
from unstructured.partition.auto import partition
File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/partition/auto.py", line 83, in <module>
from unstructured.partition.pdf import partition_pdf
File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 50, in <module>
from unstructured.partition.pdf_image.pdf_image_utils import (
File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/partition/pdf_image/pdf_image_utils.py", line 13, in <module>
import cv2
ImportError: numpy.core.multiarray failed to import
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/bbot/bbot/bbot/scanner/scanner.py", line 1062, in _acatch
yield
File "/home/runner/work/bbot/bbot/bbot/modules/base.py", line 629, in _worker
await handle_event_task
File "/home/runner/work/bbot/bbot/bbot/modules/unstructured.py", line 101, in handle_event
content = await self.scan.run_in_executor_mp(extract_text, file_path)
ImportError: numpy.core.multiarray failed to import
The text was updated successfully, but these errors were encountered:
unstructured has a lot of dependencies, which makes it unwieldy. However its functionality is really important to BBOT, so we need to find a way to prevent this kind of thing.
As we keep adding BBOT modules, there is more and more of a need for some kind of system that will let us cleanly package these bigger tools.
Docker is one solution, but we should keep an eye out for something more lightweight that doesn't require a running daemon. Something like zipapp, but better?
Ideally, this solution would not rely on the tests of the upstream package maintainer. Instead, it would cache a known-working version of the tool (including all its dependencies), and only upgrade it if all of our tests passed.
Getting a system like this in place will help us package/deploy these things in a reproduceable way across multiple linux distros, and make sure they don't break unexpectedly when an upstream dependency collapses.
All tests are currently failing due to the following error, indicating a dependency was updated which broke this import:
The text was updated successfully, but these errors were encountered: