Skip to content

πŸ”¬ Scientific chatbot that instantly searches arXiv.org papers, transforming an ocean of preprints into clear research insights. Powered by local LLMs from Ollama.

License

Notifications You must be signed in to change notification settings

KazKozDev/researchify

Repository files navigation

Researchify

License Python Flask FAISS Ollama Gemma RAG arXiv OpenAlex

πŸ”¬ Researchify - Scientific Research Assistant

A specialized chatbot designed to streamline scientific research by helping academics and researchers find, analyze, and understand scientific papers. The system serves as an intelligent research assistant that can search through academic databases, process scholarly articles, analyze citation patterns, and engage in natural conversations about research topics.

This project aims to solve common challenges in academic research:

  • Time-consuming literature search and analysis
  • Complex paper interpretation and summarization
  • Citation impact assessment
  • Research trend identification
  • Cross-format document processing

Through natural language conversation, researchers can:

  • Search for relevant papers using plain language queries
  • Get paper summaries and key findings
  • Analyze citation patterns and impact
  • Process and extract information from various document formats
  • Explore research trends and connections

Book Translator

✨ Features

  • Conversational Interface: Natural language interaction for research queries and paper discussions
  • Smart Search:
    • Vector similarity search using FAISS and sentence transformers
    • Complex query parsing with field specifications
    • Hybrid retrieval combining semantic and metadata filters
  • Document Processing:
    • Multi-format support (PDF, DOCX, XLSX, TXT)
    • Automatic text extraction and analysis
    • Metadata extraction and processing
  • Research Analysis:
    • Paper content analysis and summarization
    • Citation impact evaluation
    • Geographic and temporal citation patterns
  • RAG System:
    • Context-aware response generation
    • Document chunking with overlap
    • Efficient vector storage and retrieval

πŸš€ Getting Started

Prerequisites

python 3.8+
ollama
PyMuPDF
flask
numpy
faiss-cpu
sentence-transformers
pandas
pypdf2
python-docx
chardet

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/researchify.git
cd researchify
  1. Install dependencies:
pip install -r requirements.txt
  1. Start the Ollama server with Gemma-2B model:
ollama run gemma2
  1. Run the application:
python app.py

πŸ”§ Configuration

UPLOAD_FOLDER=uploads
MAX_CONTENT_LENGTH=16777216  # 16MB
PAPER_ANALYSIS_CACHE_SIZE=100
MODEL_NAME=gemma2
VECTOR_STORE_PATH=vector_store
SENTENCE_TRANSFORMER_MODEL=all-MiniLM-L6-v2

πŸ—οΈ Architecture

The system consists of several interconnected components:

β”œβ”€β”€ Core Components
β”‚   β”œβ”€β”€ Vector Store (FAISS + Sentence Transformers)
β”‚   β”œβ”€β”€ RAG System
β”‚   └── Query Processing
β”œβ”€β”€ Document Handlers
β”‚   β”œβ”€β”€ PDF Processor
β”‚   β”œβ”€β”€ Word Processor
β”‚   β”œβ”€β”€ Excel Processor
β”‚   └── Text Processor
β”œβ”€β”€ Analysis Modules
β”‚   β”œβ”€β”€ Paper Analyzer
β”‚   β”œβ”€β”€ Citation Analyzer
β”‚   └── Impact Assessor
└── API Layer
    β”œβ”€β”€ Flask Server
    β”œβ”€β”€ Chat Interface
    └── Research Endpoints

πŸ“Š Technical Details

Vector Search System

  • FAISS similarity search implementation
  • Sentence Transformer embeddings (all-MiniLM-L6-v2)
  • Complex query support:
    • Field-specific search (title, abstract, category)
    • Boolean operations
    • Metadata filtering
  • Thread-safe concurrent access
  • Persistent index storage

Document Processing

Specialized handlers for each format:

  • PDF Processing:
    • PyPDF2-based text extraction
    • Encryption detection
    • Metadata parsing
  • Word Documents:
    • Full text extraction
    • Core properties retrieval
  • Excel/CSV:
    • Data preview generation
    • Statistical summaries
  • Text Files:
    • Encoding detection
    • Format preservation

RAG Implementation

  • Chunk-based document processing
  • Configurable overlap for context preservation
  • Hybrid search combining:
    • Vector similarity
    • Category filtering
    • Date-based filtering
  • Response generation with context integration

Analysis Capabilities

Paper analysis includes:

  • Content extraction and summarization
  • Methodology identification
  • Results analysis
  • Limitations assessment
  • Future work extraction

Citation analysis provides:

  • Citation counts and trends
  • Geographic distribution
  • Research impact metrics
  • Venue analysis
  • Temporal patterns

🀝 Contributing

Contributions are welcome! For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch
  3. Commit your changes
  4. Push to the branch
  5. Open a Pull Request

πŸ“ License

MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • arXiv API for paper access
  • OpenAlex for citation data
  • FAISS for vector search capabilities
  • Sentence Transformers for embeddings
  • Ollama and Gemma-2B for LLM support

Researchify


From Bcn with ❀️ by KazKozDev