Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anchor structured extraction #4

Open
hoijui opened this issue Aug 1, 2022 · 0 comments
Open

Anchor structured extraction #4

hoijui opened this issue Aug 1, 2022 · 0 comments

Comments

@hoijui
Copy link
Owner

hoijui commented Aug 1, 2022

As it is now, anchors extracted from documents get extracted in a flat space, while they usually exist in a tree namespace structure. This structure is described by the header level of the anchor, or the header level where the anchor exists in (if there are anchors other then headers themselfs), and all the super-headers of that header.
This is at least the case with Markdown and HTML, but probably also most other document formats.

example markdown document (doc.md):

# Top

## First Sub

bla bla bla

### A Sub Sub

bli bli bli

## Second Sub

blu blu blu

### B Sub Sub

tri tra tralala

<a name="in-text"/>

flat extraction:

doc.md#top
doc.md#first-sub
doc.md#a-sub-sub
doc.md#second-sub
doc.md#b-sub-sub
doc.md#in-text

structured extraction:

doc.md#
    \ top
        \ first-sub
            \ a-sub-sub
        \ second-sub
            \ b-sub-sub
                \ in-text

Why

This is useful when analyzing changes in documents, for example if a title has been renamed, but the structure overall has stayed the same, one might be able to generate an auto-fix for a missing link including a fragment (that is meant to map to an anchor).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant