DocumentAtom

DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.

Package	Version	Downloads
DocumentAtom.Excel
DocumentAtom.Image
DocumentAtom.Markdown
DocumentAtom.Pdf
DocumentAtom.PowerPoint
DocumentAtom.Ocr
DocumentAtom.Text
DocumentAtom.Word

New in v1.0.x

Initial release

Motivation

Parsing documents and extracting constituent parts is one part science and one part black magic. If you find ways to improve processing and extraction in any way that is horizontally useful, I'd would love your feedback on ways to make this library more accurate, more useful, faster, and overall better. My goal in building this library is to make it easier to analyze input data assets and make them more consumable by other systems including analytics and artificial intelligence.

Bugs, Quality, Feedback, or Enhancement Requests

Please feel free to file issues, enhancement requests, or start discussions about use of the library, improvements, or fixes.

Types Supported

DocumentAtom supports the following input file types:

Text
Markdown
Microsoft Word (.docx)
Microsoft Excel (.xlsx)
Microsoft PowerPoint (.pptx)
PNG images (requires Tesseract on the host)
PDF

Simple Example

Refer to the various Test projects for working examples.

The following example shows processing a markdown (.md) file.

using DocumentAtom.Core.Atoms;
using DocumentAtom.Markdown;

MarkdownProcessorSettings settings = new MarkdownProcessorSettings();
MarkdownProcessor processor = new MarkdownProcessor(_Settings);
foreach (Atom atom in processor.Extract(filename))
{
    Console.WriteLine(atom.ToString());
}

Atom Types

DocumentAtom parses input data assets into a variety of Atom objects. Each Atom includes top-level metadata including:

GUID
Type - including Text, Image, Binary, Table, and List
PageNumber - where available; some document types do not explicitly indicate page numbers, and page numbers are inferred when rendered
Position - the ordinal position of the Atom, relative to others
Length - the length of the Atom's content
MD5Hash - the MD5 hash of the Atom content
SHA1Hash - the SHA1 hash of the Atom content
SHA256Hash - the SHA256 hash of the Atom content
Quarks - sub-atomic particles created from the Atom content, for instance, when chunking text

The AtomBase class provides the aforementioned metadata, and several type-specific Atoms are returned from the various processors, including:

BinaryAtom - includes a Bytes property
DocxAtom - includes Text, HeaderLevel, UnorderedList, OrderedList, Table, and Binary properties
ImageAtom - includes BoundingBox, Text, UnorderedList, OrderedList, Table, and Binary properties
MarkdownAtom - includes Formatting, Text, UnorderedList, OrderedList, and Table properties
PdfAtom - includes BoundingBox, Text, UnorderedList, OrderedList, Table, and Binary properties
PptxAtom - includes Title, Subtitle, Text, UnorderedList, OrderedList, Table, and Binary properties
TableAtom - includes Rows, Columns, Irregular, and Table properties
TextAtom - includes Text
XlsxAtom - includes SheetName, CellIdentifier, Text, Table, and Binary properties

Table objects inside of Atom objects are always presented as SerializableDataTable objects (see SerializableDataTable for more information) to provide simple serialization and conversion to native System.Data.DataTable objects.

Underlying Libraries

DocumentAtom is built on the shoulders of several libraries, without which, this work would not be possible.

Each of these libraries were integrated as NuGet packages, and no source was included or modified from these packages.

My libraries used within DocumentAtom:

Version History

Please refer to CHANGELOG.md for version history.

Thanks

Special thanks to iconduck.com and the content authors for producing this icon.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DONATIONS.md		DONATIONS.md
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocumentAtom

New in v1.0.x

Motivation

Bugs, Quality, Feedback, or Enhancement Requests

Types Supported

Simple Example

Atom Types

Underlying Libraries

Version History

Thanks

About

Releases

Packages

Contributors 2

Languages

License

jchristn/DocumentAtom

Folders and files

Latest commit

History

Repository files navigation

DocumentAtom

New in v1.0.x

Motivation

Bugs, Quality, Feedback, or Enhancement Requests

Types Supported

Simple Example

Atom Types

Underlying Libraries

Version History

Thanks

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages