Fully automated end-to-end framework to extract data from bar plots and other figures in scientific research papers using modules such as OpenCV, AWS-Rekognition for text detection in images.
pdffigures2 is used to extract/download images (charts + tables) from the research papers.
Bar plots used are here: https://drive.google.com/drive/u/1/folders/154sgx3M49NoKOoOjoppsSuvqd2WzqZqX
Step 1: google_images_download
python module is used to download google images for each type of chart: Area chart, Line chart, bar plot, pie chart, venn diagram etc. based on their corresponding keywords.
$ git clone https://github.com/Joeclinton1/google-images-download.git
$ cd google-images-download && python setup.py install
Step 2: The downloaded images are carefully reviewed and the incorrect images are removed.
The following are the training data used, and model files.
training corpus: https://drive.google.com/drive/u/1/folders/1M8kwdQE7bpjpdT08ldBURFdzLaQR9n5h
model: https://drive.google.com/drive/u/1/folders/1GVW_MtFFYT-Tj44p0_QLKM7hVnn_AcKI
Below is the count of images for each type:
|
|
|
pretrained model VGG19 is used to train the images, and is run on the test images to classify the images to 13 different types such as Bar chart, Line graph, Pie chart etc.
The accuracy is calculated using stratified five-fold cross validation. The accuracy of the model is 84.08%
. The following are the training accuracy and loss curves captured during the training phase for each fold of cross validation.
The following are 100 randonly picked images which are predicted as bar plots. Highlighted images (6 in number out of 100 randomly picked) are incorrectly classified as bar plots.
- Firstly, the image is converted into black and white image, then the max-continuous ones along each row and each column are obtained.
- Next, for all columns, the maximum value of the max-continuous 1s is picked.
- A certain threshold (=10) is assumed, and the first column where the max-continuous 1s falls in the region [max - threshold, max + threshold] is the y-axis.
- Similar approach is followed for the x-axis, but the last row is picked where the max-continuous 1s fall in the region [max - threshold, max + threshold]
Both x and y axes are detected correctly for 1006 images out of 1254 images (test data set). Below are some of the failed cases in axes detection.
AWS-Rekognition is used to detect text in the image. DetectText API is used for detecting text. Only the text with confidence >= 80 are considered.
To improve text detection, double-pass algorithm is employed.
- Text detection using detect_text AWS Rekognition API, and considered only the text boxes for which confidence >= 80
- Fill the polygons corresponding to these text with white color
- Run text detection (2nd pass) on the new image, and consider only the ones with confidence >= 80
There is an issue with bounding box for vertical text or text with an angle. Therefore, bounding box is calculated from the polygon coordinates (or vertices) from the AWS Rekognition output.
- Filter the text boxes which are below the x-axis(, and to the right of y-axis).
- Run a sweeping line from x-axis (detected by axes detection algorithm) to the bottom of the image, and check when the sweeping line intersects with the maximum number of text boxes.
- This maximum intersection gives the bounding boxes for all of the x-labels.
- Filter the text boxes which are below the x-labels
- Run a sweeping line from x-labels to the bottom of the image, and check when the sweeping line intersects with the maximum number of text boxes.
- This maximum intersection gives all the bounding boxes for all the x-text.
- Filter the text boxes which are to the left of y-axis.
- Run a sweeping line from y-axis and start moving towards the left, and check when the sweeping line intersects with the maximum number of text boxes.
- Pick these text boxes where there was maximum intersection, and filter them further using python regex to detect only numeric values.
- Filter the text boxes which are to the left of y-axis.
- The remaining text boxes that are not classified as y-labels will be considered as y-text.
- Filter the text boxes that are above the x-axis, and to the right of y-axis.
- Clean the text to remove 'I'. These are obtained since error bars in the charts are detected as 'I' by AWS Rekognition OCR API(s).
- Use an appropriate regex to disregard the numerical values. These are mostly the ones which are there on top of the bars to denote the bar value.
- Now merge the remaining text boxes (with x-value threshold of 10) to make sure all the multi-word legends are part of a single bounding box.
- Since legends can be grouped horizontally or vertically, we need to run two sweeping lines (in x and y directions) to detect legends.
- Run a sweeping line from y-axis and start moving towards the right, and check when the sweeping line intersects with the maximum number of text boxes.
- Continue Step 6 with a sweeping line from x-axis and moving to top of the image and check when the sweeping line intersects with maximum number of text boxes.
- The maximum intersection obtained from a combined Step 6 and Step 7 gives the bounding boxes for all the legends.
This ratio is used to calculate the y-values from each bar-plot using the pixel projection method. Y-axis ticks are detected by left-bounding boxes to the y-axis.
Since the text detection (numeric values) isn't perfect, once the pixel values for the ticks and actual y-label texts are obtained, the outliers are removed by assuming a normal distribution and whether the values deviate very much. Then, the mean distance between the ticks is calculated. Further, the mean value of the actual y-label ticks is calculated. Finally, the value-tick ratio is calculated by:
- As an initial step, all the bounding boxes for the text in the image are whitened.
- Convert the resulting image into a binary image.
- Find contours (and bounding rectangles) in the resulting image.
- For each legend, find the nearest bounding box to the left and on the same height as the legend.
- Find the major color (or pattern) from the nearest bounding box obtained for each legend in Step 4.
- All the pixel values of the image are divided into clusters. Prior to clustering, all the white pixels are removed, and the bounding boxes found by above procedure for each legend are whitened.
- The number of clusters are determined by the number of legends detected. The colors finalized in the above procedure form the initial clusters.
- We then simplify the given plot into multiple plots (one per each cluster). These plots would be a simple bar plot. i.e.., by clustering we convert a stacked bar chart into multiple simpler bar plots.
- We then get the contours for the plot, and subsequently bounding rectangles for the contours determined.
- For each label, the closest bounding rectangle is picked.
- The height of each bounding box is recorded by the help of the merging rectangles obtained by the above procedure. This ratio is used to further calculate the y-values :
Below shows data extraction results on an image.
The results (axes, legends, labels, values, captions and file-names) are written to the Excel sheet.