Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.5.8
0.5.8
Enhancements
- Update
elements_to_json
to return string when filename is not specified elements_from_json
may take a string instead of a filename with thetext
kwargdetect_filetype
now does a final fallback to file extension.- Empty tags are now skipped during the depth check for HTML processing.
Features
- Add local file system to
unstructured-ingest
- Add
--max-docs
parameter tounstructured-ingest
- Added
partition_msg
for processing MSFT Outlook .msg files.
Fixes
convert_file_to_text
now passes through thesource_format
andtarget_format
kwargs.
Previously they were hard coded.- Partitioning functions that accept a
text
kwarg no longer raise an error if an empty
string is passed (and empty list of elements is returned instead). partition_json
no longer fails if the input is an empty list.- Fixed bug in
chunk_by_attention_window
that caused the last word in segments to be cut-off
in some cases.
BREAKING CHANGES
stage_for_transformers
now returns a list of elements, making it consistent with other
staging bricks
0.5.7
0.5.7
Enhancements
- Refactored codebase using
exactly_one
- Adds ability to pass headers when passing a url in partition_html()
- Added optional
content_type
andfile_filename
parameters topartition()
to bypass file detection
Features
- Add
--flatten-metadata
parameter tounstructured-ingest
- Add
--fields-include
parameter tounstructured-ingest
Fixes
0.5.6
0.5.6
- Fix problem with PDF partition (duplicated test)
Enhancements
contains_english_word()
, used heavily in text processing, is 10x faster.
Features
- Add
--metadata-include
and--metadata-exclude
parameters tounstructured-ingest
- Add
clean_non_ascii_chars
to remove non-ascii characters from unicode string
Fixes
- Fixes duplicated elements issue with
partition_pdf(..., strategy="fast")
0.5.4
0.5.4
Enhancements
- Added Biomedical literature connector for ingest cli.
- Add
FsspecConnector
to easily integrate any existingfsspec
filesystem as a connector. - Rename
s3_connector.py
tos3.py
for readability and consistency with the
rest of the connectors. - Now
S3Connector
relies ons3fs
instead of onboto3
, and it inherits
fromFsspecConnector
. - Adds an
UNSTRUCTURED_LANGUAGE_CHECKS
environment variable to control whether or not language
specific checks like vocabulary and POS tagging are applied. Set to"true"
for higher
resolution partitioning and"false"
for faster processing. - Improves
detect_filetype
warning to include filename when provided. - Adds a "fast" strategy for partitioning PDFs with PDFMiner. Also falls back to the "fast"
strategy if detectron2 is not available. - Start deprecation life cycle for
unstructured-ingest --s3-url
option, to be deprecated in
favor of--remote-url
.
Features
- Add
AzureBlobStorageConnector
based on itsfsspec
implementation inheriting
fromFsspecConnector
- Add
partition_epub
for partitioning e-books in EPUB3 format.
Fixes
- Fixes processing for text files with
message/rfc822
MIME type. - Open xml files in read-only mode when reading contents to construct an XMLDocument.
0.5.3
0.5.3
Enhancements
auto.partition()
can now load Unstructured ISD json documents.- Simplify partitioning functions.
- Improve logging for ingest CLI.
Features
- Add
--wikipedia-auto-suggest
argument to the ingest CLI to disable automatic redirection
to pages with similar names. - Add setup script for Amazon Linux 2
- Add optional
encoding
argument to thepartition_(text/email/html)
functions. - Added Google Drive connector for ingest cli.
- Added Gitlab connector for ingest cli.
Fixes
0.5.2
0.5.2
Enhancements
unstructured-ingest
now uses a default--download_dir
of$HOME/.cache/unstructured/ingest
rather than a "tmp-ingest-" dir in the working directory.
Features
Fixes
setup_ubuntu.sh
no longer fails in some contexts by interpreting
DEBIAN_FRONTEND=noninteractive
as a commandunstructured-ingest
no longer re-downloads files when --preserve-downloads
is used without --download-dir.- Fixed an issue that was causing text to be skipped in some HTML documents.
0.5.1
0.5.1
Enhancements
Features
Fixes
- Fixes an error causing JavaScript to appear in the output of
partition_html
sometimes. - Fix several issues with the
requires_dependencies
decorator, including the error message
and how it was used, which had caused an error forunstructured-ingest --github-url ...
.
0.5.0
0.5.0
Enhancements
- Add
requires_dependencies
Python decorator to check dependencies are installed before
instantiating a class or running a function
Features
- Added Wikipedia connector for ingest cli.
Fixes
- Fix
process_document
file cleaning on failure - Fixes an error introduced in the metadata tracking commit that caused
NarrativeText
andFigureCaption
elements to be represented asText
in HTML documents.
0.4.16
0.4.16
Enhancements
- Fallback to using file extensions for filetype detection if
libmagic
is not present
Features
- Added setup script for Ubuntu
- Added GitHub connector for ingest cli.
- Added
partition_md
partitioner. - Added Reddit connector for ingest cli.
Fixes
- Initializes connector properly in ingest.main::MainProcess
- Restricts version of unstructured-inference to avoid multithreading issue
0.4.15
0.4.15
Enhancements
- Added
elements_to_json
andelements_from_json
for easier serialization/deserialization convert_to_dict
,dict_to_elements
andconvert_to_csv
are now aliases for functions
that use the ISD terminology.
Fixes
- Update to ensure all elements are preserved during serialization/deserialization