Skip to content

Releases: Unstructured-IO/unstructured

0.5.8

30 Mar 20:57
4148834
Compare
Choose a tag to compare

0.5.8

Enhancements

  • Update elements_to_json to return string when filename is not specified
  • elements_from_json may take a string instead of a filename with the text kwarg
  • detect_filetype now does a final fallback to file extension.
  • Empty tags are now skipped during the depth check for HTML processing.

Features

  • Add local file system to unstructured-ingest
  • Add --max-docs parameter to unstructured-ingest
  • Added partition_msg for processing MSFT Outlook .msg files.

Fixes

  • convert_file_to_text now passes through the source_format and target_format kwargs.
    Previously they were hard coded.
  • Partitioning functions that accept a text kwarg no longer raise an error if an empty
    string is passed (and empty list of elements is returned instead).
  • partition_json no longer fails if the input is an empty list.
  • Fixed bug in chunk_by_attention_window that caused the last word in segments to be cut-off
    in some cases.

BREAKING CHANGES

  • stage_for_transformers now returns a list of elements, making it consistent with other
    staging bricks

0.5.7

24 Mar 23:38
71e035c
Compare
Choose a tag to compare

0.5.7

Enhancements

  • Refactored codebase using exactly_one
  • Adds ability to pass headers when passing a url in partition_html()
  • Added optional content_type and file_filename parameters to partition() to bypass file detection

Features

  • Add --flatten-metadata parameter to unstructured-ingest
  • Add --fields-include parameter to unstructured-ingest

Fixes

0.5.6

21 Mar 20:42
3c95b97
Compare
Choose a tag to compare

0.5.6

  • Fix problem with PDF partition (duplicated test)

Enhancements

  • contains_english_word(), used heavily in text processing, is 10x faster.

Features

  • Add --metadata-include and --metadata-exclude parameters to unstructured-ingest
  • Add clean_non_ascii_chars to remove non-ascii characters from unicode string

Fixes

  • Fixes duplicated elements issue with partition_pdf(..., strategy="fast")

0.5.4

14 Mar 15:54
e43cb0e
Compare
Choose a tag to compare

0.5.4

Enhancements

  • Added Biomedical literature connector for ingest cli.
  • Add FsspecConnector to easily integrate any existing fsspec filesystem as a connector.
  • Rename s3_connector.py to s3.py for readability and consistency with the
    rest of the connectors.
  • Now S3Connector relies on s3fs instead of on boto3, and it inherits
    from FsspecConnector.
  • Adds an UNSTRUCTURED_LANGUAGE_CHECKS environment variable to control whether or not language
    specific checks like vocabulary and POS tagging are applied. Set to "true" for higher
    resolution partitioning and "false" for faster processing.
  • Improves detect_filetype warning to include filename when provided.
  • Adds a "fast" strategy for partitioning PDFs with PDFMiner. Also falls back to the "fast"
    strategy if detectron2 is not available.
  • Start deprecation life cycle for unstructured-ingest --s3-url option, to be deprecated in
    favor of --remote-url.

Features

  • Add AzureBlobStorageConnector based on its fsspec implementation inheriting
    from FsspecConnector
  • Add partition_epub for partitioning e-books in EPUB3 format.

Fixes

  • Fixes processing for text files with message/rfc822 MIME type.
  • Open xml files in read-only mode when reading contents to construct an XMLDocument.

0.5.3

09 Mar 15:13
e43e917
Compare
Choose a tag to compare

0.5.3

Enhancements

  • auto.partition() can now load Unstructured ISD json documents.
  • Simplify partitioning functions.
  • Improve logging for ingest CLI.

Features

  • Add --wikipedia-auto-suggest argument to the ingest CLI to disable automatic redirection
    to pages with similar names.
  • Add setup script for Amazon Linux 2
  • Add optional encoding argument to the partition_(text/email/html) functions.
  • Added Google Drive connector for ingest cli.
  • Added Gitlab connector for ingest cli.

Fixes

0.5.2

02 Mar 19:04
a5da3de
Compare
Choose a tag to compare

0.5.2

Enhancements

  • unstructured-ingest now uses a default --download_dir of $HOME/.cache/unstructured/ingest
    rather than a "tmp-ingest-" dir in the working directory.

Features

Fixes

  • setup_ubuntu.sh no longer fails in some contexts by interpreting
    DEBIAN_FRONTEND=noninteractive as a command
  • unstructured-ingest no longer re-downloads files when --preserve-downloads
    is used without --download-dir.
  • Fixed an issue that was causing text to be skipped in some HTML documents.

0.5.1

01 Mar 00:17
a6f8256
Compare
Choose a tag to compare

0.5.1

Enhancements

Features

Fixes

  • Fixes an error causing JavaScript to appear in the output of partition_html sometimes.
  • Fix several issues with the requires_dependencies decorator, including the error message
    and how it was used, which had caused an error for unstructured-ingest --github-url ....

0.5.0

28 Feb 15:45
6966178
Compare
Choose a tag to compare

0.5.0

Enhancements

  • Add requires_dependencies Python decorator to check dependencies are installed before
    instantiating a class or running a function

Features

  • Added Wikipedia connector for ingest cli.

Fixes

  • Fix process_document file cleaning on failure
  • Fixes an error introduced in the metadata tracking commit that caused NarrativeText
    and FigureCaption elements to be represented as Text in HTML documents.

0.4.16

28 Feb 04:50
5eaf449
Compare
Choose a tag to compare

0.4.16

Enhancements

  • Fallback to using file extensions for filetype detection if libmagic is not present

Features

  • Added setup script for Ubuntu
  • Added GitHub connector for ingest cli.
  • Added partition_md partitioner.
  • Added Reddit connector for ingest cli.

Fixes

  • Initializes connector properly in ingest.main::MainProcess
  • Restricts version of unstructured-inference to avoid multithreading issue

0.4.15

23 Feb 21:59
0d229f0
Compare
Choose a tag to compare

0.4.15

Enhancements

  • Added elements_to_json and elements_from_json for easier serialization/deserialization
  • convert_to_dict, dict_to_elements and convert_to_csv are now aliases for functions
    that use the ISD terminology.

Fixes

  • Update to ensure all elements are preserved during serialization/deserialization