Skip to content

Fully automated end-to-end framework to extract data from bar plots and other figures in scientific research papers using modules such as OpenCV, AWS-Rekognition.

Notifications You must be signed in to change notification settings

dsandeep0138/ChartReader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChartReader

Fully automated end-to-end framework to extract data from bar plots and other figures in scientific research papers using modules such as OpenCV, AWS-Rekognition for text detection in images.

Figure extraction

pdffigures2 is used to extract/download images (charts + tables) from the research papers.

Image set

Bar plots used are here: https://drive.google.com/drive/u/1/folders/154sgx3M49NoKOoOjoppsSuvqd2WzqZqX

Chart classification (Accuracy: 84.08%)

Data preparation

Step 1: google_images_download python module is used to download google images for each type of chart: Area chart, Line chart, bar plot, pie chart, venn diagram etc. based on their corresponding keywords.

$ git clone https://github.com/Joeclinton1/google-images-download.git
$ cd google-images-download && python setup.py install

Step 2: The downloaded images are carefully reviewed and the incorrect images are removed.

The following are the training data used, and model files.
training corpus: https://drive.google.com/drive/u/1/folders/1M8kwdQE7bpjpdT08ldBURFdzLaQR9n5h
model: https://drive.google.com/drive/u/1/folders/1GVW_MtFFYT-Tj44p0_QLKM7hVnn_AcKI

Below is the count of images for each type:

Plot type Count
BarGraph 528
VennDiagram 364
PieChart 355
ScatterGraph 335
Plot type Count
TreeDiagram 297
FlowChart 293
Map 276
ParetoChart 329
Plot type Count
BubbleChart 311
LineGraph 300
AreaGraph 299
NetworkDiagram 321
BoxPlot 312

Training phase:

pretrained model VGG19 is used to train the images, and is run on the test images to classify the images to 13 different types such as Bar chart, Line graph, Pie chart etc.

The accuracy is calculated using stratified five-fold cross validation. The accuracy of the model is 84.08%. The following are the training accuracy and loss curves captured during the training phase for each fold of cross validation.

Results (predictions on test data)

The following are 100 randonly picked images which are predicted as bar plots. Highlighted images (6 in number out of 100 randomly picked) are incorrectly classified as bar plots.

Axes Detection (Accuracy: 80.22%) [1006/1254 correct]

  1. Firstly, the image is converted into black and white image, then the max-continuous ones along each row and each column are obtained.
  2. Next, for all columns, the maximum value of the max-continuous 1s is picked.
  3. A certain threshold (=10) is assumed, and the first column where the max-continuous 1s falls in the region [max - threshold, max + threshold] is the y-axis.
  4. Similar approach is followed for the x-axis, but the last row is picked where the max-continuous 1s fall in the region [max - threshold, max + threshold]

Results

Both x and y axes are detected correctly for 1006 images out of 1254 images (test data set). Below are some of the failed cases in axes detection.

Text detection

AWS-Rekognition is used to detect text in the image. DetectText API is used for detecting text. Only the text with confidence >= 80 are considered.

Double-pass algorithm for text detection

To improve text detection, double-pass algorithm is employed.

  1. Text detection using detect_text AWS Rekognition API, and considered only the text boxes for which confidence >= 80
  2. Fill the polygons corresponding to these text with white color
  3. Run text detection (2nd pass) on the new image, and consider only the ones with confidence >= 80

Bounding Box calculation

There is an issue with bounding box for vertical text or text with an angle. Therefore, bounding box is calculated from the polygon coordinates (or vertices) from the AWS Rekognition output.

Label Detection

X-labels:

  1. Filter the text boxes which are below the x-axis(, and to the right of y-axis).
  2. Run a sweeping line from x-axis (detected by axes detection algorithm) to the bottom of the image, and check when the sweeping line intersects with the maximum number of text boxes.
  3. This maximum intersection gives the bounding boxes for all of the x-labels.

X-text

  1. Filter the text boxes which are below the x-labels
  2. Run a sweeping line from x-labels to the bottom of the image, and check when the sweeping line intersects with the maximum number of text boxes.
  3. This maximum intersection gives all the bounding boxes for all the x-text.

Y-labels:

  1. Filter the text boxes which are to the left of y-axis.
  2. Run a sweeping line from y-axis and start moving towards the left, and check when the sweeping line intersects with the maximum number of text boxes.
  3. Pick these text boxes where there was maximum intersection, and filter them further using python regex to detect only numeric values.

Y-text:

  1. Filter the text boxes which are to the left of y-axis.
  2. The remaining text boxes that are not classified as y-labels will be considered as y-text.

Legend detection

  1. Filter the text boxes that are above the x-axis, and to the right of y-axis.
  2. Clean the text to remove 'I'. These are obtained since error bars in the charts are detected as 'I' by AWS Rekognition OCR API(s).
  3. Use an appropriate regex to disregard the numerical values. These are mostly the ones which are there on top of the bars to denote the bar value.
  4. Now merge the remaining text boxes (with x-value threshold of 10) to make sure all the multi-word legends are part of a single bounding box.
  5. Since legends can be grouped horizontally or vertically, we need to run two sweeping lines (in x and y directions) to detect legends.
  6. Run a sweeping line from y-axis and start moving towards the right, and check when the sweeping line intersects with the maximum number of text boxes.
  7. Continue Step 6 with a sweeping line from x-axis and moving to top of the image and check when the sweeping line intersects with maximum number of text boxes.
  8. The maximum intersection obtained from a combined Step 6 and Step 7 gives the bounding boxes for all the legends.

Data extraction

Value-tick ratio calculation:

This ratio is used to calculate the y-values from each bar-plot using the pixel projection method. Y-axis ticks are detected by left-bounding boxes to the y-axis.

Since the text detection (numeric values) isn't perfect, once the pixel values for the ticks and actual y-label texts are obtained, the outliers are removed by assuming a normal distribution and whether the values deviate very much. Then, the mean distance between the ticks is calculated. Further, the mean value of the actual y-label ticks is calculated. Finally, the value-tick ratio is calculated by:

Pattern (or color) estimation

  1. As an initial step, all the bounding boxes for the text in the image are whitened.
  2. Convert the resulting image into a binary image.
  3. Find contours (and bounding rectangles) in the resulting image.
  4. For each legend, find the nearest bounding box to the left and on the same height as the legend.
  5. Find the major color (or pattern) from the nearest bounding box obtained for each legend in Step 4.

Getting bar plot for each legend

  1. All the pixel values of the image are divided into clusters. Prior to clustering, all the white pixels are removed, and the bounding boxes found by above procedure for each legend are whitened.
  2. The number of clusters are determined by the number of legends detected. The colors finalized in the above procedure form the initial clusters.
  3. We then simplify the given plot into multiple plots (one per each cluster). These plots would be a simple bar plot. i.e.., by clustering we convert a stacked bar chart into multiple simpler bar plots.
  4. We then get the contours for the plot, and subsequently bounding rectangles for the contours determined.
  5. For each label, the closest bounding rectangle is picked.
  6. The height of each bounding box is recorded by the help of the merging rectangles obtained by the above procedure. This ratio is used to further calculate the y-values :

Below shows data extraction results on an image.

Reporting results

The results (axes, legends, labels, values, captions and file-names) are written to the Excel sheet.

About

Fully automated end-to-end framework to extract data from bar plots and other figures in scientific research papers using modules such as OpenCV, AWS-Rekognition.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •