news-please - an integrated web crawler and information extractor for news that just works
-
Updated
Oct 14, 2024 - Python
news-please - an integrated web crawler and information extractor for news that just works
Wiktionary dump file parser and multilingual data extractor
SEO & Security Audit for Websites. Lighthouse & Security Headers crawler, Sitemap/Keywords/Images Extractor, Summarizer, etc ...
A framework for creating semi-automatic web content extractors
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
适用于高性能系统的多进程解压缩软件(A multiprocess decompression software for high-performance system)
burpsuite extension for check and extract sensitive request parameter
Python reader of LabVIEW RSRC files (VI, CTL, LLB). File format description on the Wiki.
Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.
Extract article or news by url or html, parse the title and content, output in markdown format.
Basic website cloner written in Python
This Python script extracts Macromedia / Adobe Director movies and casts from Windows and Mac executables.
Anatomy and Visualization of the Network structure of the Dark web using multi-threaded crawler
Add a description, image, and links to the extractor topic page so that developers can more easily learn about it.
To associate your repository with the extractor topic, visit your repo's landing page and select "manage topics."