Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.4.3
0.4.3
- Adds
requests
as a base dependency - Fix in
exceeds_cap_ratio
so the function doesn't break with empty text - Fix bug in
_parse_received_data
. - Update
detect_filetype
to properly handle.doc
,.xls
, and.ppt
.
0.4.2
0.4.2
- Added
partition_image
to process documents in an image format. - Fixed utf-8 encoding error in
partition_email
with attachments fortext/html
0.4.1
0.4.1
- Added support for text files in the
partition
function - Pinned
opencv-python
for easier installation on Linux
0.4.0
0.4.0
- Added generic
partition
brick that detects the file type and routes a file to the appropriate
partitioning brick. - Added a file type detection module.
- Updated
partition_html
andpartition_eml
to support file-like objects in 'rb' mode. - Cleaning brick for removing ordered bullets
clean_ordered_bullets
. - Extract brick method for ordered bullets
extract_ordered_bullets
. - Test for
clean_ordered_bullets
. - Test for
extract_ordered_bullets
. - Added
partition_docx
for pre-processing Word Documents. - Added new REGEX patterns to extract email header information
- Added new functions to extract header information
parse_received_data
andpartition_header
- Added new function to parse plain text files
partition_text
- Added new cleaners functions
extract_ip_address
,extract_ip_address_name
,extract_mapi_id
,extract_datetimetz
- Add new
Image
element and function to find embedded imagesfind_embedded_images
- Added
get_directory_file_info
for summarizing information about source documents
0.3.5
0.3.5
- Add support for local inference
- Add new pattern to recognize plain text dash bullets
- Add test for bullet patterns
- Fix for partition_html that allows for processing div tags that have both text and child elements
- Add ability to extract document metadata from .docx, .xlsx, and .jpg files.
- Helper functions for identifying and extracting phone numbers
- Add new function extract_attachment_info that extracts and decode the attachment of an email.
- Staging brick to convert a list of Elements to a pandas dataframe.
0.3.4
0.3.4
- Python-3.7 compat
0.3.3
0.3.3
- Removes BasicConfig from logger configuration
- Adds the
partition_email
partitioning brick - Adds the
replace_mime_encodings
cleaning bricks - Small fix to HTML parsing related to processing list items with sub-tags
0.3.2
0.3.2
- Added
translate_text
brick for translating text between languages - Add an
apply
method to make it easier to apply cleaners to elements
0.3.1
0.3.1
- Added __init.py__ to
partition
0.3.0
0.3.0
- Implement staging brick for Argilla. Converts lists of
Text
elements toargilla
dataset classes. - Removing the local PDF parsing code and any dependencies and tests.
- Reorganizes the staging bricks in the unstructured.partition module
- Allow entities to be passed into the Datasaur staging brick
- Added HTML escapes to the
replace_unicode_quotes
brick - Fix bad responses in partition_pdf to raise ValueError
- Adds
partition_html
for partitioning HTML documents.