-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SciPhi/AgentSearch-V1 · Datasets at Hugging Face #386
Labels
Comments
/# Time LLM Embed Multi Commands This markdown document contains the output of various CPU Batch 10 (The maximum I can run in 64GB)time llm embed-multi prompts-cpu-b10-run1 -d embeddings.db --attach logs $(llm logs path) --sql 'SELECT id, prompt FROM logs.responses LIMIT 1000' -m jina-embeddings-v2-base-en --prefix prompt/ --batch-size 10 Output:
GPUtime llm embed-multi prompts-gpu-b1-run2 -d embeddings.db --attach logs --sql -m jina-embeddings-v2-base-en --prefix prompt/ --batch-size 1 Output:
|
This was referenced Feb 27, 2024
This was referenced Mar 6, 2024
This was referenced Mar 16, 2024
Open
1 task
1 task
This was referenced Aug 2, 2024
Open
This was referenced Aug 22, 2024
This was referenced Nov 7, 2024
Open
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Getting Started
The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas!
To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code:
A full set of scripts to recreate the dataset from scratch can be found here. Further, you may check the docs for details on how to perform RAG over AgentSearch.
Languages
English.
Dataset Structure
The raw dataset structure is as follows:
Dataset Creation
This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets.
To cite our work, please use the following:
@software{SciPhi2023AgentSearch,
author = {SciPhi},
title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search},
year = {2023},
url = {https://github.com/SciPhi-AI/agent-search}
}
Source Data
@online{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}
@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}
License
Please refer to the licenses of the data subsets you use.
Suggested labels
{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }
The text was updated successfully, but these errors were encountered: