A Python script that automatically archives Wikipedia articles related to Python programming language, maintaining a structured and searchable knowledge base. 🐍
open_wiki/
├── script.py # Main entry point for archiving
├── wiki_archiver/ # Modular package
│ ├── config/ # Configuration settings
│ ├── core/ # Core archiving logic
│ ├── logging/ # Thread-safe logging
│ └── utils/ # Utility functions
├── wiki_articles/ # Archived article storage
└── requirements.txt # Project dependencies
- 🌐 Wikipedia article archiving
- 🔍 Category-based article discovery
- 💾 Progress tracking and resuming
- 🚀 Parallel processing
- 📝 Markdown content generation
-
Install Dependencies:
pip install -r requirements.txt
-
Run Locally:
python script.py [OPTIONS]
Options:
-c, --category
: Wikipedia category to archive (default: Python (programming language))-l, --language
: Wikipedia language (default: en)-d, --depth
: Maximum category recursion depth (default: 1)-o, --output
: Output directory for archived articles
-
GitHub Actions (automatic):
- ⏰ Runs daily at midnight UTC
- 💾 Caches dependencies and progress
- 🔄 Updates articles automatically
- 📤 Commits changes to repository
Edit these variables in wiki_archiver/config/__init__.py
:
LANGUAGE = "en" # Article language
CATEGORY = "Python (programming language)" # Main category
MAX_DEPTH = 1 # Category depth
MAX_WORKERS = 10 # Parallel threads
- ⚡ Parallel Processing: Uses ThreadPoolExecutor for concurrent downloads
- 💾 Progress Caching: Saves and resumes from last state
- 🔄 Rate Limiting: Smart API request management
- 📦 Efficient Storage: Deduplicates articles across categories
- ⚡ GitHub Actions Optimization: Caches dependencies and article data
Current archiving progress by category:
[░░░░░░░░░░] 1% 61.0% - Overall progress
[░░░░░░░░░░] 0% 0.0% - Core Language
[░░░░░░░░░░] 1% 100.0% - Libraries & Frameworks
[░░░░░░░░░░] 0% 0.0% - Development Tools
[░░░░░░░░░░] 0% 0.0% - Community & Culture
[░░░░░░░░░░] 0% 0.0% - Applications
- 🔱 Fork the repository
- 🌿 Create your feature branch
- 💾 Commit your changes
- 🚀 Push to the branch
- 📬 Create a Pull Request
- 📚 Support for additional programming language categories
- 🔍 Full-text search capabilities
- 📊 Article diff tracking
- ⚙️ Custom category configuration
- 🔌 API for programmatic access
Check out our TODO list for upcoming features, improvements, and project goals. We welcome contributions and suggestions!
Track changes in Wikipedia articles over time with our built-in diff tracking feature.
Quick example:
from wiki_archiver.diff_tracker import track_wikipedia_article_changes
# Track changes for a specific article
diff_info = track_wikipedia_article_changes('Python (programming language)')
print(diff_info['change_summary'])
For detailed documentation, see the Diff Tracking Guide.
For detailed information about programmatically using the Wikipedia Article Archiver, please refer to the API Documentation.
Quick example:
from wiki_archiver.api import archive_wikipedia
# Archive Wikipedia articles
results = archive_wikipedia(
categories=['Python (programming language)']
)
This project is licensed under the MIT License - see the LICENSE file for details.
- Ensure you have the latest version of Python
- Check network connectivity
- Verify API rate limits
- Review logs in
wiki_archiver.log
For issues or suggestions, please open an issue on GitHub.