Skip to content
This repository has been archived by the owner on Mar 12, 2024. It is now read-only.
/ community Public archive

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

License

Notifications You must be signed in to change notification settings

Unstructured-IO/community

Open-Source Pre-Processing Tools for Unstructured Data

Welcome to the Unstructured Community! 😊

We are building an ecosystem of preprocessing pipeline tools for Data Scientists and Data Engineers, so they may quickly work through the challenge of extracting structured data from unstructured raw documents.

☕ Getting Started

Unstructured's open-source packages currently target Python 3.8. If you are using or contributing to Unstructured code, we encourage you to work with Python 3.8 in a virtual environment. You can use the following instructions to get up and running with a Python 3.8 virtual environment with pyenv-virtualenv:

Mac / Homebrew

  1. Install pyenv with brew install pyenv.
  2. Install pyenv-virtualenv with brew install pyenv-virtualenv
  3. Follow the instructions here to add the pyenv-virtualenv startup code to your terminal profile.
  4. Install Python 3.8 by running pyenv install 3.8.15.
  5. Create and activate a virtual environment by running:
pyenv virtualenv 3.8.15 unstructured
pyenv activate unstructured

You can changed the name of the virtual environment from unstructured to another name if you're creating a virtual environment for a pipeline. For example, if you're a creating a virtual environment for the SEC preprocessing, you can run pyenv virtualenv 3.8.15 sec.

Linux

  1. Run git clone https://github.com/pyenv/pyenv.git ~/.pyenv to install pyenv
  2. Run git clone https://github.com/pyenv/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv to install pyenv-virtualenv as a pyenv plugin.
  3. Follow steps 3-5 from the Mac/Homebrew instructions.

👐 Contributions

We welcome contributions! See all open issues for bugs, features, and enhancement requests in the community.

When contributing, please follow our Contributing to Unstructured guidelines.

Don't hesitate to reach out us on slack with any questions. Thank you!

📗 Key Concepts

🧱 Bricks

Bricks are the "blocks" or Python functions from which preprocessing pipelines are made, and are organized in the Unstructured library. These collectively form the Swiss Army knife that Python developers can use to extract structured data from raw documents into the format that they want. They may be used independently of any other Unstructured repos under the terms of its license. pip install unstructured and you are good to go.

🔹 Preprocessing pipeline APIs

A preprocessing pipeline API (or just "pipeline API") is a notebook that includes a Python function capable of transforming a raw document to structured data. By following the documented conventions, FastAPI APIs may be auto-generated from a pipeline notebook.

See pipeline-sec-filings for an example repo includes a preprocessing pipeline API and auto-generated FastAPI.

🔩 Developer tools for generating FastAPIs

The unstructured-api-tools library includes the tooling required to create FastAPIs from pipeline notebooks.

🤗 Hugging Face

Hugging Face Spaces offer a simple way to host ML demo apps, models and datasets directly on our organization’s profile. This allows us to showcase our projects and work collaboratively with other people in the ML ecosystem. Visit our space here!