Skip to content

Latest commit

 

History

History
382 lines (315 loc) · 6.75 KB

README.md

File metadata and controls

382 lines (315 loc) · 6.75 KB

CLI-Tool to recognise handwritten text from answer sheets using Tesseract OCR.
Using this extracted text to evaluate marks using NLP.

Installation:
Install Tesseract-OCR-Engine https://github.com/tesseract-ocr/tesseract/wiki
Install python dependencies pytesseract,pillow,pandas,numpy,matplotlib

Usage:
1)Clone the repository into your working directory
2)Make sure you update path of tesseract executable in main.py
3)add image for testing to images folder
4)main.py imagename
It will return a HOCR file,which is very similar to XHTML
5)file_conversion.py hocrfilename.
It will convert HOCR into dataframe and store the output in a pickle file/json file

Phase1 demonstration of the OCR of handwritten text and exploiting into JSON
(Rendered python notebook displayed as markdown using nbconvert)

Phase2 Using nltk to Create A NLP model to evaluate Answers

Download all the packages using the nltk downloader

import nltk
nltk.download()

png

from pytesseract import pytesseract
import sys
import os
#Edit path to tesseract executable if you installation directory changed

pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract'
from datetime import datetime

def replaceMultiple(mainString, toBeReplaces, newString):
   
    for elem in toBeReplaces :
        
        if elem in mainString :
            
            mainString = mainString.replace(elem, newString)
    
    return  mainString

mainStr=str(datetime.now())
file_name = replaceMultiple(mainStr, [':', '-', '.',' '] , "")
def generateFilename():
	mainStr=str(datetime.now())
	file_name = replaceMultiple(mainStr, [':', '-', '.',' '] , "")
	return file_name
from PIL import Image
from IPython.display import display
import matplotlib.pyplot as plt

im = Image.open("testfile1.jpg")
fig, ax = plt.subplots()
ax.imshow(im)
print("(width,height):"+str(im.size))
(width,height):(3000, 3115)
box=(250,180,2800,400)
cropped_image = im.crop(box)
display(cropped_image)
cropped_text= pytesseract.image_to_string(cropped_image, lang = 'eng')
print(cropped_text)

png

Conductor wn magnetic Field Produce voltage :
def createHOCR(imagepath):
	filename= generateFilename()
	pytesseract.run_tesseract(imagepath, filename, lang=None,extension='html', config="hocr")
	print("HOCR file generated: "+str(filename)+".hocr")
createHOCR("testfile.jpg")
HOCR file generated: 20181021042317089205.hocr
from lxml import etree
import pandas as pd
import os
import sys
import generate_filename as gf
def hocr_to_dataframe(fp):

    doc = etree.parse(fp)
    words = []
    wordConf = []

    for path in doc.xpath('//*'):
        if 'ocrx_word' in path.values():
            conf = [x for x in path.values() if 'x_wconf' in x][0]
            wordConf.append(int(conf.split('x_wconf ')[1]))
            words.append(path.text)

    dfReturn = pd.DataFrame({'word' : words,
                             'confidence' : wordConf})

    return(dfReturn)
filename=generateFilename()
dataframe=hocr_to_dataframe("20181021041156998790.hocr")
dataframe
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
word confidence
0 95
1 95
2 Q1. 89
3 Define 96
4 electromagnetic 96
5 induction. 95
6 Sane 23
7 | 90
8 Conductor 93
9 mM 42
10 magnetic 70
11 Field 63
12 produce 67
13 voltage 65
14 ‘Seconaewctntmnstnn 0
15 esionainsnenaneenrenncconanniiti 0
16 Q2. 89
17 What 96
18 are 96
19 3 96
20 examples 96
21 of 95
22 transparent 95
23 objects? 96
24 (Professor 96
25 provides 96
26 5 96
27 as 95
28 input) 90
29 95
30 Q3. 92
31 Complete 96
32 the 96
33 network 95
34 tree. 96
35 95
dataframe.to_json(filename+".json",orient='columns')
print("JSON generated: "+filename+".JSON")
dataframe.to_pickle(filename+".pkl")
print("Pickle generated: "+filename+".pkl")
JSON generated: 20181021042319190731.JSON
Pickle generated: 20181021042319190731.pkl