Skip to content

Releases: Unstructured-IO/unstructured

0.4.3

18 Jan 17:31
59f972d
Compare
Choose a tag to compare

0.4.3

  • Adds requests as a base dependency
  • Fix in exceeds_cap_ratio so the function doesn't break with empty text
  • Fix bug in _parse_received_data.
  • Update detect_filetype to properly handle .doc, .xls, and .ppt.

0.4.2

17 Jan 16:36
9c3c14e
Compare
Choose a tag to compare

0.4.2

  • Added partition_image to process documents in an image format.
  • Fixed utf-8 encoding error in partition_email with attachments for text/html

0.4.1

13 Jan 22:23
419c086
Compare
Choose a tag to compare

0.4.1

  • Added support for text files in the partition function
  • Pinned opencv-python for easier installation on Linux

0.4.0

11 Jan 18:05
eba4c80
Compare
Choose a tag to compare

0.4.0

  • Added generic partition brick that detects the file type and routes a file to the appropriate
    partitioning brick.
  • Added a file type detection module.
  • Updated partition_html and partition_eml to support file-like objects in 'rb' mode.
  • Cleaning brick for removing ordered bullets clean_ordered_bullets.
  • Extract brick method for ordered bullets extract_ordered_bullets.
  • Test for clean_ordered_bullets.
  • Test for extract_ordered_bullets.
  • Added partition_docx for pre-processing Word Documents.
  • Added new REGEX patterns to extract email header information
  • Added new functions to extract header information parse_received_data and partition_header
  • Added new function to parse plain text files partition_text
  • Added new cleaners functions extract_ip_address, extract_ip_address_name, extract_mapi_id, extract_datetimetz
  • Add new Image element and function to find embedded images find_embedded_images
  • Added get_directory_file_info for summarizing information about source documents

0.3.5

05 Jan 00:50
a75499d
Compare
Choose a tag to compare

0.3.5

  • Add support for local inference
  • Add new pattern to recognize plain text dash bullets
  • Add test for bullet patterns
  • Fix for partition_html that allows for processing div tags that have both text and child elements
  • Add ability to extract document metadata from .docx, .xlsx, and .jpg files.
  • Helper functions for identifying and extracting phone numbers
  • Add new function extract_attachment_info that extracts and decode the attachment of an email.
  • Staging brick to convert a list of Elements to a pandas dataframe.

0.3.4

21 Dec 15:29
962c9dc
Compare
Choose a tag to compare

0.3.4

  • Python-3.7 compat

0.3.3

20 Dec 20:03
de4d0d4
Compare
Choose a tag to compare

0.3.3

  • Removes BasicConfig from logger configuration
  • Adds the partition_email partitioning brick
  • Adds the replace_mime_encodings cleaning bricks
  • Small fix to HTML parsing related to processing list items with sub-tags

0.3.2

15 Dec 22:20
1d68bb2
Compare
Choose a tag to compare

0.3.2

  • Added translate_text brick for translating text between languages
  • Add an apply method to make it easier to apply cleaners to elements

0.3.1

14 Dec 18:00
1700d4d
Compare
Choose a tag to compare

0.3.1

  • Added __init.py__ to partition

0.3.0

14 Dec 16:39
151732c
Compare
Choose a tag to compare

0.3.0

  • Implement staging brick for Argilla. Converts lists of Text elements to argilla dataset classes.
  • Removing the local PDF parsing code and any dependencies and tests.
  • Reorganizes the staging bricks in the unstructured.partition module
  • Allow entities to be passed into the Datasaur staging brick
  • Added HTML escapes to the replace_unicode_quotes brick
  • Fix bad responses in partition_pdf to raise ValueError
  • Adds partition_html for partitioning HTML documents.