Detecting Layout Templates in Complex Multiregion Files

by   Gerardo Vitagliano, et al.

Spreadsheets are among the most commonly used file formats for data management, distribution, and analysis. Their widespread employment makes it easy to gather large collections of data, but their flexible canvas-based structure makes automated analysis difficult without heavy preparation. One of the common problems that practitioners face is the presence of multiple, independent regions in a single spreadsheet, possibly separated by repeated empty cells. We define such files as "multiregion" files. In collections of various spreadsheets, we can observe that some share the same layout. We present the Mondrian approach to automatically identify layout templates across multiple files and systematically extract the corresponding regions. Our approach is composed of three phases: first, each file is rendered as an image and inspected for elements that could form regions; then, using a clustering algorithm, the identified elements are grouped to form regions; finally, every file layout is represented as a graph and compared with others to find layout templates. We compare our method to state-of-the-art table recognition algorithms on two corpora of real-world enterprise spreadsheets. Our approach shows the best performances in detecting reliable region boundaries within each file and can correctly identify recurring layouts across files.


page 1

page 5

page 6


How Big Are Peoples' Computer Files? File Size Distributions Among User-managed Collections

Improving file management interfaces and optimising system performance r...

Wrangling Messy CSV Files by Detecting Row and Type Patterns

It is well known that data scientists spend the majority of their time o...

Robust PDF Files Forensics Using Coding Style

Identifying how a file has been created is often interesting in security...

An Exploratory Study of Bot Commits

Background: Bots help automate many of the tasks performed by software d...

Autoplot: A browser for scientific data on the web

Autoplot is software developed for the Virtual Observatories in Heliophy...

Stylised Choropleth Maps for New Zealand Regions and District Health Boards

New Zealand has two top-level sets of administrative divisions: the Dist...

Accelerating LSM-Tree with the Dentry Management of File System

The log-structured merge tree (LSM-tree) gains wide popularity in buildi...

Please sign up or login with your details

Forgot password? Click here to reset