A specialized chatbot designed to streamline scientific research by helping academics and researchers find, analyze, and understand scientific papers. The system serves as an intelligent research assistant that can search through academic databases, process scholarly articles, analyze citation patterns, and engage in natural conversations about research topics.
This project aims to solve common challenges in academic research:
- Time-consuming literature search and analysis
- Complex paper interpretation and summarization
- Citation impact assessment
- Research trend identification
- Cross-format document processing
Through natural language conversation, researchers can:
- Search for relevant papers using plain language queries
- Get paper summaries and key findings
- Analyze citation patterns and impact
- Process and extract information from various document formats
- Explore research trends and connections
- Conversational Interface: Natural language interaction for research queries and paper discussions
- Smart Search:
- Vector similarity search using FAISS and sentence transformers
- Complex query parsing with field specifications
- Hybrid retrieval combining semantic and metadata filters
- Document Processing:
- Multi-format support (PDF, DOCX, XLSX, TXT)
- Automatic text extraction and analysis
- Metadata extraction and processing
- Research Analysis:
- Paper content analysis and summarization
- Citation impact evaluation
- Geographic and temporal citation patterns
- RAG System:
- Context-aware response generation
- Document chunking with overlap
- Efficient vector storage and retrieval
python 3.8+
ollama
PyMuPDF
flask
numpy
faiss-cpu
sentence-transformers
pandas
pypdf2
python-docx
chardet
- Clone the repository:
git clone https://github.com/yourusername/researchify.git
cd researchify
- Install dependencies:
pip install -r requirements.txt
- Start the Ollama server with Gemma-2B model:
ollama run gemma2
- Run the application:
python app.py
UPLOAD_FOLDER=uploads
MAX_CONTENT_LENGTH=16777216 # 16MB
PAPER_ANALYSIS_CACHE_SIZE=100
MODEL_NAME=gemma2
VECTOR_STORE_PATH=vector_store
SENTENCE_TRANSFORMER_MODEL=all-MiniLM-L6-v2
The system consists of several interconnected components:
βββ Core Components
β βββ Vector Store (FAISS + Sentence Transformers)
β βββ RAG System
β βββ Query Processing
βββ Document Handlers
β βββ PDF Processor
β βββ Word Processor
β βββ Excel Processor
β βββ Text Processor
βββ Analysis Modules
β βββ Paper Analyzer
β βββ Citation Analyzer
β βββ Impact Assessor
βββ API Layer
βββ Flask Server
βββ Chat Interface
βββ Research Endpoints
- FAISS similarity search implementation
- Sentence Transformer embeddings (all-MiniLM-L6-v2)
- Complex query support:
- Field-specific search (title, abstract, category)
- Boolean operations
- Metadata filtering
- Thread-safe concurrent access
- Persistent index storage
Specialized handlers for each format:
- PDF Processing:
- PyPDF2-based text extraction
- Encryption detection
- Metadata parsing
- Word Documents:
- Full text extraction
- Core properties retrieval
- Excel/CSV:
- Data preview generation
- Statistical summaries
- Text Files:
- Encoding detection
- Format preservation
- Chunk-based document processing
- Configurable overlap for context preservation
- Hybrid search combining:
- Vector similarity
- Category filtering
- Date-based filtering
- Response generation with context integration
Paper analysis includes:
- Content extraction and summarization
- Methodology identification
- Results analysis
- Limitations assessment
- Future work extraction
Citation analysis provides:
- Citation counts and trends
- Geographic distribution
- Research impact metrics
- Venue analysis
- Temporal patterns
Contributions are welcome! For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
MIT License - see the LICENSE file for details.
- arXiv API for paper access
- OpenAlex for citation data
- FAISS for vector search capabilities
- Sentence Transformers for embeddings
- Ollama and Gemma-2B for LLM support