Automatic Identification and Data Extraction from 2-Dimensional Plots in Digital Documents

09/10/2008
by   William Brouwer, et al.
0

Most search engines index the textual content of documents in digital libraries. However, scholarly articles frequently report important findings in figures for visual impact and the contents of these figures are not indexed. These contents are often invaluable to the researcher in various fields, for the purposes of direct comparison with their own work. Therefore, searching for figures and extracting figure data are important problems. To the best of our knowledge, there exists no tool to automatically extract data from figures in digital documents. If we can extract data from these images automatically and store them in a database, an end-user can query and combine data from multiple digital documents simultaneously and efficiently. We propose a framework based on image analysis and machine learning to extract information from 2-D plot images and store them in a database. The proposed algorithm identifies a 2-D plot and extracts the axis labels, legend and the data points from the 2-D plot. We also segregate overlapping shapes that correspond to different data points. We demonstrate performance of individual algorithms, using a combination of generated and real-life images.

READ FULL TEXT

page 2

page 4

research
01/24/2023

Sherlock in OSS: A Novel Approach of Content-Based Searching in Object Storage System

Object Storage Systems (OSS) inside a cloud promise scalability, durabil...
research
09/11/2020

MRZ code extraction from visa and passport documents using convolutional neural networks

Detecting and extracting information from Machine-Readable Zone (MRZ) on...
research
08/09/2018

Sentimental Content Analysis and Knowledge Extraction from News Articles

In web era, since technology has revolutionized mankind life, plenty of ...
research
04/11/2022

Landmarks and Regions: A Robust Approach to Data Extraction

We propose a new approach to extracting data items or field values from ...
research
06/23/2021

ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations

We focus on electronic theses and dissertations (ETDs), aiming to improv...
research
07/17/2017

Exploring text datasets by visualizing relevant words

When working with a new dataset, it is important to first explore and fa...
research
07/06/2021

Plot2Spectra: an Automatic Spectra Extraction Tool

Different types of spectroscopies, such as X-ray absorption near edge st...

Please sign up or login with your details

Forgot password? Click here to reset