Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.15.5
0.15.5
Enhancements
Features
Fixes
- Revert to using
unstructured.pytesseract
fork. Due to the unavailability of some recent release versions ofpytesseract
on PyPI, the project now uses theunstructured.pytesseract
fork to ensure stability and continued support. - Bump
libreoffice
verson in image. Bumps thelibreoffice
version to25.2.5.2
to address CVEs. - Downgrade NLTK dependency version for compatibility. Due to the unavailability of
nltk==3.8.2
on PyPI, the NLTK dependency has been downgraded to<3.8.2
. This change ensures continued functionality and compatibility.
0.15.4
0.15.4
Enhancements
Features
Fixes
- Resolve an installation error with
pytesseract>=0.3.12
that occurred duringpip install unstructured[pdf]==0.15.3
.
0.15.3
0.15.3
Enhancements
Features
Fixes
- Remove the custom index URL from
extra-paddleocr.in
to resolve the error in thesetup.py
configuration.
0.15.2
0.15.2
Enhancements
- Improve directory handling when extracting image blocks. The
figures
directory is no longer created when theextract_image_block_to_payload
parameter is set toTrue
.
Features
- Added per-class Object Detection metrics in the evaluation. The metrics include average precision, precision, recall, and f1-score for each class in the dataset.
Fixes
- Updates NLTK data file for compatibility with
nltk>=3.8.2
. The NLTK data file now containerpunkt_tab
, making it possible to upgrade tonltk>=3.8.2
. Thenltk==3.8.2
patches CVE-2024-39705. - Renames Astra to Astra DB Conforms with DataStax internal naming conventions.
- Accommodate single-column CSV files. Resolves a limitation of
partition_csv()
where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters). - Accommodate
image/jpg
in PPTX as alias forimage/jpeg
. Resolves problem partitioning PPTX files having an invalidimage/jpg
(should beimage/jpeg
) MIME-type in the[Content_Types].xml
member of the PPTX Zip archive. - Fixes an issue in Object Detection metrics The issue was in preprocessing/validating the ground truth and predicted data for object detection metrics.
- Removes dependency on unstructured.pytesseract Unstructured forked pytesseract while waiting for code to be upstreamed. Now that the new version has been released, this fork can be removed.
0.15.1
0.15.1
Enhancements
- Improve
pdfminer
embeddedimage
extraction to exclude text elements and produce more accurate bounding boxes. This results in cleaner, more precise element extraction inpdf
partitioning.
Features
- Update partition_eml and partition_msg to capture cc, bcc, and message_id fields Cc, bcc, and message_id information is captured in element metadata for both msg and email partitioning and
Recipient
elements are generated for cc and bcc wheninclude_headers=True
for email partitioning. - Mark ingest as deprecated Begin sunset of ingest code in this repo as it's been moved to a dedicated repo.
- Add
pdf_hi_res_max_pages
argument for partitioning, which allows rejecting PDF files that exceed this page number limit, when thehigh_res
strategy is chosen. By default, it will allow parsing PDF files with an unlimited number of pages.
Fixes
- Update
HuggingFaceEmbeddingEncoder
to useHuggingFaceEmbeddings
fromlangchain_huggingface
package instead of the deprecated version fromlangchain-community
. This resolves the deprecation warning and ensures compatibility with future versions of langchain. - Update
OpenAIEmbeddingEncoder
to useOpenAIEmbeddings
fromlangchain-openai
package instead of the deprecated version fromlangchain-community
. This resolves the deprecation warning and ensures compatibility with future versions of langchain. - Update import of Pinecone exception Adds compatibility for pinecone-client>=5.0.0
- File-type detection catches non-existent file-path.
detect_filetype()
no longer silently falls back to detecting a file-type based on the extension when no file exists at the path provided. InsteadFileNotFoundError
is raised. This provides consistent user notification of a mis-typed path rather than an unpredictable exception from a file-type specific partitioner when the file cannot be opened. - EML files specified as a file-path are detected correctly. Resolved a bug where an EML file submitted to
partition()
as a file-path was identified as TXT and partitioned usingpartition_text()
. EML files specified by path are now identified and processed correctly, including processing any attachments. - A DOCX, PPTX, or XLSX file specified by path and ambiguously identified as MIME-type "application/octet-stream" is identified correctly. Resolves a shortcoming where a file specified by path immediately fell back to filename-extension based identification when misidentified as "application/octet-stream", either by asserted content type or a mis-guess by libmagic. An MS Office file misidentified in this way is now correctly identified regardless of its filename and whether it is specified by path or file-like object.
- Textual content retrieved from a URL with gzip transport compression now partitions correctly. Resolves a bug where a textual file-type (such as Markdown) retrieved by passing a URL to
partition()
would raise whengzip
compression was used for transport by the server. - A DOCX, PPTX, or XLSX content-type asserted on partition is confirmed or fixed. Resolves a bug where calling
partition()
with a swapped MS-Officecontent_type
would cause the file-type to be misidentified. A DOCX, PPTX, or XLSX MIME-type received bypartition()
is now checked for accuracy and corrected if the file is for a different MS-Office 2007+ type. - DOC, PPT, XLS, and MSG files are now auto-detected correctly. Resolves a bug where DOC, PPT, and XLS files were auto-detected as MSG files under certain circumstances.
0.15.0
0.15.0
Enhancements
- Improve text clearing process in email partitioning. Updated the email partitioner to remove both
=\n
and=\r\n
characters during the clearing process. Previously, only=\n
characters were removed. - Bump unstructured.paddleocr to 2.8.0.1.
- Refine HTML parser to accommodate block element nested in phrasing. HTML parser no longer raises on a block element (e.g.
<p>
,<div>
) nested inside a phrasing element (e.g.<strong>
or<cite>
). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation. - Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth. A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.
- CI check for dependency licenses Adds a CI check to ensure dependencies are appropriately licensed.
Features
- Add support for specifying OCR language to
partition_pdf()
. Extend language specification capability toPaddleOCR
in addition toTesseractOCR
. Users can now specify OCR languages for both OCR engines when usingpartition_pdf()
. - Add AstraDB source connector Adds support for ingesting documents from AstraDB.
Fixes
- Remedy error on Windows when
nltk
binaries are downloaded. Work around a quirk in the Windows implementation oftempfile.NamedTemporaryFile
where accessing the temporary file by name raisesPermissionError
. - Move Astra embedded_dimension to write config
0.14.10
0.14.10
Enhancements
- Update unstructured-client dependency Change unstructured-client dependency pin back to
greater than min version and updated tests that were failing given the update. .doc
files are now supported in thearm64
image..libreoffice24
is added to thearm64
image, meaning.doc
files are now supported. We have follow on work planned to investigate adding.ppt
support forarm64
as well.- Add table detection metrics: recall, precision and f1
- Remove unused _with_spans metrics
Features
Fixes
- Fix counting false negatives and false positives in table structure evaluation
- Fix Slack CI test Change channel that Slack test is pointing to because previous test bot expired
- Remove NLTK download Removes
nltk.download
in favor of downloading from an S3 bucket we host to mitigate CVE-2024-39705
0.14.9
0.14.9
Enhancements
- Added visualization and OD model result dump for PDF In PDF
hi_res
strategy theanalysis
parameter can be used to visualize the result of the OD model and dump the result to a file. Additionally, the visualization of bounding boxes of each layout source is rendered and saved for each page. partition_docx()
distinguishes "file not found" from "not a ZIP archive" error.partition_docx()
now provides different error messages for "file not found" and "file is not a ZIP archive (and therefore not a DOCX file)". This aids diagnosis since these two conditions generally point in different directions as to the cause and fix.
Features
Fixes
- Fix a bug where multiple
soffice
processes could be attempted Add a wait mechanism inconvert_office_doc
so that the function first checks if anothersoffice
is running already: if yes wait till the other process finishes or till the wait timeout before spawning a subprocess to runsoffice
partition()
now forwardsstrategy
arg topartition_docx()
,partition_pptx()
, and their brokering partitioners for DOC, ODT, and PPT formats. Astrategy
argument passed topartition()
(or the default value "auto" assigned bypartition()
) is now forwarded topartition_docx()
,partition_pptx()
, and their brokering partitioners when those filetypes are detected.
0.14.8
0.14.8
Enhancements
- Move arm64 image to wolfi-base The
arm64
image now runs onwolfi-base
. Thearm64
build forwolfi-base
does not yet includelibreoffce
, and soarm64
does not currently support processing.doc
,.ppt
, or.xls
file. If you need to process those files onarm64
, use the legacyrockylinux
image.
Features
Fixes
-
Bump unstructured-inference==0.7.36 Fix
ValueError
when converting cells to html. -
partition()
now forwardsstrategy
arg topartition_docx()
,partition_ppt()
, andpartition_pptx()
. Astrategy
argument passed topartition()
(or the default value "auto" assigned bypartition()
) is now forwarded topartition_docx()
,partition_ppt()
, andpartition_pptx()
when those filetypes are detected. -
Fix missing sensitive field markers for embedders
0.14.7
0.14.7
Enhancements
- Pull from
wolfi-base
image. The amd64 image now pulls from theunstructured
wolfi-base
image to avoid duplication of dependency setup steps. - Fix windows temp file. Make the creation of a temp file in unstructured/partition/pdf_image/ocr.py windows compatible.
Features
- Expose conversion functions for tables Adds public functions to convert tables from HTML to the Deckerd format and back
Fixes
- Fix an error publishing docker images. Update user in docker-smoke-test to reflect changes made by the amd64 image pull from the "unstructured" "wolfi-base" image.
- **Fix a IndexError when partitioning a pdf with values for both
extract_image_block_types
andstarting_page_number
.