Skip to content

A Powerful CLI Tool to automatically scrape information from Controller Of Examination AU written in Python 🐍.

License

Notifications You must be signed in to change notification settings

antony-jr/CoePy

Repository files navigation

CoePy GitHub issues GitHub forks GitHub stars GitHub license

This is a simple and powerful CLI tool written in python which can automate student information extraction from AU , and eventually makes our lives better. This script simply uses selenium to enter form feilds , the interesting part is that this script automatically solves the captcha generated by COE AU Website. Thus saves a lot of time , This can also help you if you want to check marks in bulk , Just execute this script , drink your coffee and wait for the script to show your requested information.

IMPORTANT : Only tested the login process , Which is quite good. But Still cannot test the scraping of information since the website is not responding to anyone , neither humans nor bots.

NOTE: Only tested on linux , may or may not work in other platforms.

Installation

I did not publish this in pypi(Python Package Index) since the code will be updated frequently , and cannot release it a gazillion times ,And also this project can be discontinued anytime.

Therefore you have to install it manually from source , Don't worry it will be easy. Before you do anything , Make sure you have google chrome or chromium installed in your computer. (Which will be used by this script to render the website since it depends so much on a real browser , I will be honest , the website is very cranky when scraped with requests)

Now execute the following commands in your terminal ,

 $ git clone https://github.com/antony-jr/CoePy
 $ cd CoePy
 $ sudo pip3 install -r requirements.txt
 $ ./coepy.py --help

Note : This script is only tested on Python 3.7.

Cracking Captcha

The most interesting part of this project is cracking the captcha generated by the website , Thats the biggest hurdle in automating this process , right ? So here is how I cracked it.

So first lets take a look at captcha's generated by COE AU Website ,

Note : On observing a sample of 1500 captcha's generated by COE AU Website , the captcha seems to only follow the above two patterns.

By Observations ,
The Static Properties of Captcha's generated by COE AU Website are,

  • Resolution of any arbitary captcha is always 70x20 Pixels (can be taken as 70x20 Matrix).
  • Any arbitary captcha is always binary , (i.e) It will always use only black and white colours.
  • Only uses numericals 0-9 and alphabets A-Z
  • Has very few noise.
  • Any arbitary captcha has 6 characters inscribed in it.
  • A Single character can be fitted inside a 10x8 Matrix.

Now lets do some math ,
Let 'R' be a matrix of a arbitary captcha image which is in binary.(i.e White Pixel is 255 and Black Pixel is 0) ,
Now from the static properties we know that 'R' must be a 70x20 Matrix , like so...

Let 'Cn' be a matrix that represents a single character from the captcha , Where 'n' represents the nth character from the captcha , Thus the range of 'n' must be 0 <= n < Number of Characters in the Captcha. Now from the static properties we know that 'Cn' must be a 10x8 Matrix , like so...

Let T0 T1 T2 ... T35 be a set of matrix which is of order 10x8 , Let this set be the test set., These matrices are obtained from coverting all possible characters in the captcha's generated by COE AU Website to a matrix which has the character in white pixels and background in black pixels , These characters are un-noised and filtered from all samples. From the static properties we know that the maximum number of characters is 36. Now these matrices are stored for later use.

Now each matrix in the set {C0 C1 C2 ... C5} is compared againts each element in the test set. The Percentage of Match is calculated between a single character(Cn) in the arbitary captcha and all the characters in the test set , The element in test set with the highest Percentage of Match is the most appropriate character in the position n of the resultant string which must have a length of 6(Since the captcha has only 6 characters inscribed in it).

T0 corresponds to the character '0' , T1 corresponds to the character '1' , and thus Tm corresponds to the character in chronological order in the list of all possible characters '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'.

The Percentage of Match is calculated like so... ,

Therefore The most appropriate character for nth character in the resultant string is obtained , This process is repeated upto n=6(Since the maximum character inscribed in the captcha is six). Thus the resultant string is obtained.

To see this in action , You can use the coepy-cpt.py script which is used to test the captcha parser ,

 $ cd CoePy
 $ pip3 install -r requirements.txt # Install the dependencies.
 $ python coepy-cpt.py CaptchaSamples/CaptchaSample0.png # The first captcha 
 $ python coepy-cpt.py CaptchaSamples/CaptchaSample1.png # The second captcha

Still this can get us wrong results sometimes , but we can just re-try , We just need the most likely answers.

License

The MIT License.
Copyright (C) 2018 Antony Jr.

About

A Powerful CLI Tool to automatically scrape information from Controller Of Examination AU written in Python 🐍.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published