Anchor structured extraction #4

hoijui · 2022-08-01T13:35:00Z

As it is now, anchors extracted from documents get extracted in a flat space, while they usually exist in a tree namespace structure. This structure is described by the header level of the anchor, or the header level where the anchor exists in (if there are anchors other then headers themselfs), and all the super-headers of that header.
This is at least the case with Markdown and HTML, but probably also most other document formats.

example markdown document (doc.md):

# Top

## First Sub

bla bla bla

### A Sub Sub

bli bli bli

## Second Sub

blu blu blu

### B Sub Sub

tri tra tralala

<a name="in-text"/>

flat extraction:

doc.md#top
doc.md#first-sub
doc.md#a-sub-sub
doc.md#second-sub
doc.md#b-sub-sub
doc.md#in-text

structured extraction:

doc.md#
    \ top
        \ first-sub
            \ a-sub-sub
        \ second-sub
            \ b-sub-sub
                \ in-text

Why

This is useful when analyzing changes in documents, for example if a title has been renamed, but the structure overall has stayed the same, one might be able to generate an auto-fix for a missing link including a fragment (that is meant to map to an anchor).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anchor structured extraction #4

Anchor structured extraction #4

hoijui commented Aug 1, 2022

Anchor structured extraction #4

Anchor structured extraction #4

Comments

hoijui commented Aug 1, 2022

Why