Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Unstructured-IO / unstructured Public

Notifications You must be signed in to change notification settings
Fork 803
Star 9.5k

Code
Issues 119
Pull requests 49
Discussions
Actions
Projects 1
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Releases: Unstructured-IO/unstructured

Releases · Unstructured-IO/unstructured

0.4.3

18 Jan 17:31

MthwRobinson

This commit was created on github.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.4.3

0.4.3

Adds requests as a base dependency
Fix in exceeds_cap_ratio so the function doesn't break with empty text
Fix bug in _parse_received_data.
Update detect_filetype to properly handle .doc, .xls, and .ppt.

Assets 2

Loading

All reactions

0.4.2

17 Jan 16:36

MthwRobinson

This commit was created on github.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.4.2

0.4.2

Added partition_image to process documents in an image format.
Fixed utf-8 encoding error in partition_email with attachments for text/html

Assets 2

Loading

All reactions

0.4.1

13 Jan 22:23

MthwRobinson

This commit was created on github.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.4.1

0.4.1

Added support for text files in the partition function
Pinned opencv-python for easier installation on Linux

Assets 2

Loading

All reactions

0.4.0

11 Jan 18:05

MthwRobinson

This commit was created on github.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.4.0

0.4.0

Added generic partition brick that detects the file type and routes a file to the appropriate
partitioning brick.
Added a file type detection module.
Updated partition_html and partition_eml to support file-like objects in 'rb' mode.
Cleaning brick for removing ordered bullets clean_ordered_bullets.
Extract brick method for ordered bullets extract_ordered_bullets.
Test for clean_ordered_bullets.
Test for extract_ordered_bullets.
Added partition_docx for pre-processing Word Documents.
Added new REGEX patterns to extract email header information
Added new functions to extract header information parse_received_data and partition_header
Added new function to parse plain text files partition_text
Added new cleaners functions extract_ip_address, extract_ip_address_name, extract_mapi_id, extract_datetimetz
Add new Image element and function to find embedded images find_embedded_images
Added get_directory_file_info for summarizing information about source documents

Assets 2

Loading

All reactions

0.3.5

05 Jan 00:50

qued

This commit was created on github.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.3.5

0.3.5

Add support for local inference
Add new pattern to recognize plain text dash bullets
Add test for bullet patterns
Fix for partition_html that allows for processing div tags that have both text and child elements
Add ability to extract document metadata from .docx, .xlsx, and .jpg files.
Helper functions for identifying and extracting phone numbers
Add new function extract_attachment_info that extracts and decode the attachment of an email.
Staging brick to convert a list of Elements to a pandas dataframe.

Assets 2

Loading

All reactions

0.3.4

21 Dec 15:29

MthwRobinson

This commit was created on github.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.3.4

0.3.4

Python-3.7 compat

Assets 2

Loading

All reactions

0.3.3

20 Dec 20:03

yuming-long

This commit was created on github.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.3.3

0.3.3

Removes BasicConfig from logger configuration
Adds the partition_email partitioning brick
Adds the replace_mime_encodings cleaning bricks
Small fix to HTML parsing related to processing list items with sub-tags

Assets 2

Loading

All reactions

0.3.2

15 Dec 22:20

MthwRobinson

This commit was created on github.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.3.2

0.3.2

Added translate_text brick for translating text between languages
Add an apply method to make it easier to apply cleaners to elements

Assets 2

Loading

All reactions

0.3.1

14 Dec 18:00

MthwRobinson

This commit was created on github.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.3.1

0.3.1

Added __init.py__ to partition

Assets 2

Loading

All reactions

0.3.0

14 Dec 16:39

MthwRobinson

This commit was created on github.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.3.0

0.3.0

Implement staging brick for Argilla. Converts lists of Text elements to argilla dataset classes.
Removing the local PDF parsing code and any dependencies and tests.
Reorganizes the staging bricks in the unstructured.partition module
Allow entities to be passed into the Datasaur staging brick
Added HTML escapes to the replace_unicode_quotes brick
Fix bad responses in partition_pdf to raise ValueError
Adds partition_html for partitioning HTML documents.

Assets 2

Loading

All reactions

Previous 1 2 … 13 14 15 16 17 Next

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.