Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enhance the triplets extraction in the knowledge graph by the batch size #2091

Merged
merged 20 commits into from
Nov 5, 2024

Conversation

Appointat
Copy link
Contributor

@Appointat Appointat commented Oct 23, 2024

Description

It calls the async function to accelerate the process of triplets extraction from the chunk text. The config can be set in the .env TRIPLET_EXTRACTION_BATCH_SIZE (default to 20).

How Has This Been Tested?

I have run the app server by set the value of TRIPLET_EXTRACTION_BATCH_SIZE differently. (1, 5, 100), and the running time varies.

Snapshots:

        batch_size = self._triplet_extraction_batch_size

        for i in range(0, len(chunks), batch_size):
            batch_chunks = chunks[i : i + batch_size]

            extraction_tasks = [
                self._graph_extractor.extract(chunk.content) for chunk in batch_chunks
            ]
            async_graphs: List[List[MemoryGraph]] = await asyncio.gather(
                *extraction_tasks
            )

            for chunk, graphs in zip(batch_chunks, async_graphs):
                for graph in graphs:
                    if document_graph_enabled:
                        # append the chunk id to the edge
                        for edge in graph.edges():
                            edge.set_prop("_chunk_id", chunk.chunk_id)
                            graph.append_edge(edge=edge)

                    # upsert the graph
                    self._graph_store_apdater.upsert_graph(graph)

                    # chunk -> include -> entity
                    if document_graph_enabled:
                        for vertex in graph.vertices():
                            self._graph_store_apdater.upsert_chunk_include_entity(
                                chunk=chunk, entity=vertex
                            )

Checklist:

  • My code follows the style guidelines of this project
  • I have already rebased the commits and make the commit message conform to the project standard.
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • Any dependent changes have been merged and published in downstream modules

Co-authored-by: Appointat <appointat@shu.edu.cn>
Co-authored-by: Appointat <appointat@shu.edu.cn>
@github-actions github-actions bot added the enhancement New feature or request label Oct 23, 2024
@Appointat
Copy link
Contributor Author

Appointat commented Oct 23, 2024

@Aries-ckt @fanzhidongyzby Could you please review it and add some tags? thanks

Aries-ckt
Aries-ckt previously approved these changes Oct 24, 2024
Copy link
Collaborator

@Aries-ckt Aries-ckt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Collaborator

@fanzhidongyzby fanzhidongyzby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read and write order of chunk_history in _graph_extractor needs to be adjusted, otherwise it will lead to inconsistency between text block recall and serial semantics.

Appointat and others added 4 commits October 28, 2024 16:30
Co-authored-by: Appointat <appointat@shu.edu.cn>
Co-authored-by: Appointat <appointat@shu.edu.cn>
Co-authored-by: Appointat <appointat@shu.edu.cn>
…thod

Co-authored-by: Appointat <appointat@shu.edu.cn>
@Appointat
Copy link
Contributor Author

The read and write order of chunk_history in _graph_extractor needs to be adjusted, otherwise it will lead to inconsistency between text block recall and serial semantics.

Thank you for your comment, I fixed it just now.

Copy link
Collaborator

@fanzhidongyzby fanzhidongyzby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refine the code by following comments

.env.template Outdated Show resolved Hide resolved
dbgpt/rag/transformer/graph_extractor.py Outdated Show resolved Hide resolved
@Appointat
Copy link
Contributor Author

Appointat commented Oct 30, 2024

image image

Copy link
Collaborator

@fanzhidongyzby fanzhidongyzby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Collaborator

@Aries-ckt Aries-ckt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aries-ckt Aries-ckt merged commit 25d47ce into eosphoros-ai:main Nov 5, 2024
4 checks passed
@Appointat Appointat deleted the feat/async_triplets_extraction branch November 5, 2024 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request hacktoberfest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants