Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoDB + new abstraction of vectordb #2942

Closed
wants to merge 1 commit into from
Closed

MongoDB + new abstraction of vectordb #2942

wants to merge 1 commit into from

Conversation

ranfysvalle02
Copy link
Contributor

@ranfysvalle02 ranfysvalle02 commented Jun 14, 2024

Why are these changes needed?

MongoDB has been ranked as the best vector database(https://www.mongodb.com/blog/post/atlas-vector-search-commands-highest-developer-nps-retool-state-ai-2023-survey) in the Retool AI report, so it is quite important to add MongoDB vector search as an option for Autogen RAG.

You can easily start the MongoDB vector search on a free tier M0 MongoDB Atlas cluster. Free tier cluster provides the full functionality of the MongoDB vector search. https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-overview/

But why is MongoDB such a standout? Well, there are a few key reasons.

MongoDB Atlas integrates smoothly with existing databases. For organizations already using MongoDB, this means a seamless expansion into the vector storage—no major system overhauls required!
MongoDB Atlas is built to handle operational heavy-lifting. It excels when serving large-scale, mission-critical applications, offering robustness and reliability where it counts.
MongoDB's flexibility in handling a variety of data types and structures makes it perfectly suited to the complexity of vector embeddings.

As such, implementing MongoDB as a Retrieval Agent can unlock new potential in your AI applications, bringing the full power of vector storage to bear.

Related issue number: 711

Closes #711

Checks

@ranfysvalle02
Copy link
Contributor Author

@microsoft-github-policy-service agree

@ranfysvalle02
Copy link
Contributor Author

@microsoft-github-policy-service agree company="MongoDB"

@Hk669 Hk669 added the rag retrieve-augmented generative agents label Jun 14, 2024
@Josephrp
Copy link

nice addition, it would be nice with an example, and in my opinion it would also be nice with an "advanced example" because it appears to be possible to also use mongodb on cloud services for example cosmosdb on azure :-)

Copy link
Contributor

@Jibola Jibola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting up this PR. Looks great and is helpful for us!

I've added some suggestions around implementation details, but overall it's a solid PR. Let me know what you think.

autogen/agentchat/contrib/vectordb/mongodb.py Outdated Show resolved Hide resolved
autogen/agentchat/contrib/vectordb/mongodb.py Outdated Show resolved Hide resolved
autogen/agentchat/contrib/vectordb/mongodb.py Show resolved Hide resolved
autogen/agentchat/contrib/vectordb/mongodb.py Outdated Show resolved Hide resolved
autogen/agentchat/contrib/vectordb/mongodb.py Outdated Show resolved Hide resolved
autogen/agentchat/contrib/vectordb/mongodb.py Outdated Show resolved Hide resolved
autogen/agentchat/contrib/vectordb/mongodb.py Outdated Show resolved Hide resolved
autogen/agentchat/contrib/vectordb/mongodb.py Outdated Show resolved Hide resolved
autogen/agentchat/contrib/vectordb/mongodb.py Outdated Show resolved Hide resolved
autogen/agentchat/contrib/vectordb/mongodb.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Jun 14, 2024

Codecov Report

Attention: Patch coverage is 0% with 114 lines in your changes missing coverage. Please review.

Project coverage is 15.71%. Comparing base (bf7e4d6) to head (5972fd0).
Report is 16 commits behind head on main.

Files Patch % Lines
autogen/agentchat/contrib/vectordb/mongodb.py 0.00% 111 Missing ⚠️
autogen/agentchat/contrib/vectordb/base.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2942       +/-   ##
===========================================
- Coverage   33.99%   15.71%   -18.29%     
===========================================
  Files          89       90        +1     
  Lines        9593     9719      +126     
  Branches     2054     2242      +188     
===========================================
- Hits         3261     1527     -1734     
- Misses       6057     8142     +2085     
+ Partials      275       50      -225     
Flag Coverage Δ
unittests 15.67% <0.00%> (-18.33%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the PR, @ranfysvalle02 !
Could you provide an example of how to set up a mongodb in local? You could refer to https://github.com/microsoft/autogen/blob/main/notebook/agentchat_pgvector_RetrieveChat.ipynb

Could you also update the dependencies in setup.py?

autogen/agentchat/contrib/vectordb/base.py Outdated Show resolved Hide resolved
@thinkall
Copy link
Collaborator

To fix the code formatting issue:

pip install pre-commit
pre-commit run --show-diff-on-failure --color=always --all-files

@thinkall thinkall mentioned this pull request Jun 15, 2024
11 tasks
@ranfysvalle02
Copy link
Contributor Author

Will be working on the feedback received from this thread as well as internally from MongoDB. I'll update the PR soon

@ranfysvalle02
Copy link
Contributor Author

I think I got most of the feedback in @thinkall @Jibola - let me know what you think

@ranfysvalle02
Copy link
Contributor Author

There are some issues with the tests --- working them out now.

Copy link
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ranfysvalle02 , the format issue still exists. Please let me know once it's ready for review. Thanks.

@thinkall thinkall self-assigned this Jun 19, 2024
@ranfysvalle02
Copy link
Contributor Author

@thinkall @Jibola I think this PR is ready for another look whenever you guys have the chance

autogen/agentchat/contrib/vectordb/mongodb.py Outdated Show resolved Hide resolved
autogen/agentchat/contrib/vectordb/mongodb.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@ranfysvalle02 ranfysvalle02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code has been updated ;)

@ranfysvalle02
Copy link
Contributor Author

It keeps showing "one change requested", but I'm not sure what this one is about @Hk669 -- Let me know if I covered it with the code update or if there is anything else I should change

Copy link
Collaborator

@Hk669 Hk669 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. if the comments are addressed. thanks @ranfysvalle02 for your contribution

@Hk669
Copy link
Collaborator

Hk669 commented Jun 20, 2024

@ranfysvalle02 can you run the pre-commit to fix the formatting.

@Hk669
Copy link
Collaborator

Hk669 commented Jun 20, 2024

@thinkall can you please review the PR.

@ranfysvalle02
Copy link
Contributor Author

pre-commit run --show-diff-on-failure --color=always --all-files

Fixed the formatting issue! thanks @Hk669

@thinkall let me know if there is anything else that needs work, I have some availability today to get this knocked out :)

Copy link

gitguardian bot commented Jun 21, 2024

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
10404662 Triggered Generic CLI Secret c44c8bd .github/workflows/dotnet-build.yml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

" \"collection_name\": \"flaml_collection_two\",\n",
" \"index_name\": \"flaml_index_two\",\n",
" \"db_config\": {\n",
" \"connection_string\": \"mongodb+srv://user:password@shared.demo.mongodb.net/test\", # MongoDB Atlas connection string\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ranfysvalle02 , I still see ConfigurationError: The DNS query name does not exist: _mongodb._tcp.shared.demo.mongodb.net..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the notebook -- I think this will do it :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mongodb://username:password@localhost/?directConnection=true worked for me. Then I see:

  1. Trying to create collection.
    2024-06-21 21:44:25,912 - autogen.agentchat.contrib.retrieve_user_proxy_agent - INFO - Found 2 chunks.
    VectorDB returns doc_ids: [[]]
    Which means retrieve_docs doesn't work as expected.

  2. ValueError: Collection flaml_collection_two already exists. which means overwrite=True doesn't work as expected.

@thinkall
Copy link
Collaborator

@ranfysvalle02 If you run pre-commit install you should get auto code formatting before you make a commit.

@ranfysvalle02
Copy link
Contributor Author

@ranfysvalle02 If you run pre-commit install you should get auto code formatting before you make a commit.

I will be better about this one sorry :)

@ranfysvalle02
Copy link
Contributor Author

@thinkall @Jibola @Hk669

We're getting closer! Would appreciate any help getting this down the line! Let me know what else needs to be updated! Going to disconnect for a while but will try to get back on this evening. Excited to get MongoDB into Autogen :)

@ranfysvalle02 ranfysvalle02 requested a review from Hk669 June 21, 2024 21:33
@ranfysvalle02
Copy link
Contributor Author

Sorry guys!!!!! Here is the new Pull Request with a fresh commit history... Did a lot of learning here :)

#2996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rag retrieve-augmented generative agents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants