Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Separate markdown into headings and paragraphs #9173

Conversation

NaokiHigashi28
Copy link
Contributor

@NaokiHigashi28 NaokiHigashi28 commented Sep 25, 2024

#153983 [RAG] markdown をヘッダーセクションごと(更に指定文字数以下になるように子セクションごと)に再起的に分割できる
#154085 markdown を parse してヘッダー毎にわける
#154087 テストも ChatGPT or Copilot に出力してもらったものをベースに追加する

備考

this mark down split into

Introduction without a header.

# Chapter 1
Content of chapter 1.

### Section 1.1.1
Content of section 1.1.1.

## Section 1.2
Content of section 1.2.

# Chapter 2
Content of chapter 2.

## Section 2.1
Content of section 2.1.

{ label: '0-content', text: 'Introduction without a header.' },
{ label: '1-heading', text: '# Chapter 1' },
{ label: '1-content', text: 'Content of chapter 1.' },
{ label: '1-1-1-heading', text: '### Section 1.1.1' },
{ label: '1-1-1-content', text: 'Content of section 1.1.1.' },
{ label: '1-2-heading', text: '## Section 1.2' },
{ label: '1-2-content', text: 'Content of section 1.2.' },
{ label: '2-heading', text: '# Chapter 2' },
{ label: '2-content', text: 'Content of chapter 2.' },
{ label: '2-1-heading', text: '## Section 2.1' },
{ label: '2-1-content', text: 'Content of section 2.1.' }

Copy link

changeset-bot bot commented Sep 25, 2024

⚠️ No Changeset found

Latest commit: 324b56e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@miya miya changed the base branch from master to feat/openai-vector-searching October 1, 2024 07:24
@miya miya requested review from yuki-takei and miya October 1, 2024 07:58
@@ -64,6 +64,7 @@
"@azure/openai": "^2.0.0-beta.2",
"@azure/storage-blob": "^12.16.0",
"@browser-bunyan/console-formatted-stream": "^1.8.0",
"@dqbd/tiktoken": "^1.0.16",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

現状使っていないように見える

* @param label - The label of the chunk
*/
function createChunk(chunks: Chunk[], content: string, label: string) {
const trimmedContent = content.trimEnd(); // 末尾の空白と改行を削除
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

コメントは全て英語で

@NaokiHigashi28 NaokiHigashi28 changed the title feat: Implement Markdown Section Splitting by Headers and Token Count feat: Separate markdown into headings and paragraphs Oct 7, 2024
@yuki-takei yuki-takei removed the request for review from miya October 7, 2024 09:31
mergify bot added a commit that referenced this pull request Oct 7, 2024
@mergify mergify bot merged commit cb56e0c into feat/openai-vector-searching Oct 7, 2024
12 checks passed
@mergify mergify bot deleted the feat/153983-154087-split-markdown-per-header-sections branch October 7, 2024 09:56
This was referenced Oct 15, 2024
@yuki-takei yuki-takei mentioned this pull request Oct 31, 2024
@github-actions github-actions bot mentioned this pull request Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants