GitHub - owen9825/captcha-middleware: A middleware layer for Scrapy that detects CAPTCHA tests and solves them

captchaMiddleware

Checks for a CAPTCHA test and tries solving it. This is open-source so as to prevent slaves from being forced to solve CAPTCHA tests.

Installation

Note that this program relies on Tesseract v4, which is available on Ubuntu 18.04.

Install Tesseract and the language file

sudo apt-get install tesseract-ocr
sudo mkdir /usr/local/share/tessdata
wget https://github.com/tesseract-ocr/tessdata/raw/4.00/eng.traineddata
sudo mv eng.traineddata /usr/local/share/tessdata
sudo chmod a+w /usr/local/share/tessdata/eng.traineddata
export TESSDATA_PREFIX=/usr/local/share/tessdata
which tesseract

Make sure to include the TESSDATA_PREFIX in your bash profile.

Install Pillow in Python to substitute for PIL

pip install pillow

Install captchaMiddleware

python setup.py test
python setup.py install

If the tests fail, test your tesseract installation:

tesseract "unknown letter 0.jpg" prediction --psm 10 --oem 0

Configuration

Include this in the Downloader Middleware

DOWNLOADER_MIDDLEWARES = {
    …
    'captchaMiddleware.middleware.CaptchaMiddleware':500
}

In your spider, set a meta key to prevent trying the tests too many times:

from captchaMiddleware.middleware import RETRY_KEY
…
def start_requests(self):
     …
     for url in urls:
          yield scrapy.Request(url=url, callback=self.parse,
               errback=self.errorHandler, meta={RETRY_KEY:0})

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
captchaMiddleware		captchaMiddleware
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

captchaMiddleware

Installation

Configuration

About

Releases

Packages

Contributors 3

Languages

License

owen9825/captcha-middleware

Folders and files

Latest commit

History

Repository files navigation

captchaMiddleware

Installation

Configuration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages