• search hit 4 of 15
Back to Result List

Detecting layout templates in complex multiregion files

  • Spreadsheets are among the most commonly used file formats for data management, distribution, and analysis. Their widespread employment makes it easy to gather large collections of data, but their flexible canvas-based structure makes automated analysis difficult without heavy preparation. One of the common problems that practitioners face is the presence of multiple, independent regions in a single spreadsheet, possibly separated by repeated empty cells. We define such files as "multiregion" files. In collections of various spreadsheets, we can observe that some share the same layout. We present the Mondrian approach to automatically identify layout templates across multiple files and systematically extract the corresponding regions. Our approach is composed of three phases: first, each file is rendered as an image and inspected for elements that could form regions; then, using a clustering algorithm, the identified elements are grouped to form regions; finally, every file layout is represented as a graph and compared with others toSpreadsheets are among the most commonly used file formats for data management, distribution, and analysis. Their widespread employment makes it easy to gather large collections of data, but their flexible canvas-based structure makes automated analysis difficult without heavy preparation. One of the common problems that practitioners face is the presence of multiple, independent regions in a single spreadsheet, possibly separated by repeated empty cells. We define such files as "multiregion" files. In collections of various spreadsheets, we can observe that some share the same layout. We present the Mondrian approach to automatically identify layout templates across multiple files and systematically extract the corresponding regions. Our approach is composed of three phases: first, each file is rendered as an image and inspected for elements that could form regions; then, using a clustering algorithm, the identified elements are grouped to form regions; finally, every file layout is represented as a graph and compared with others to find layout templates. We compare our method to state-of-the-art table recognition algorithms on two corpora of real-world enterprise spreadsheets. Our approach shows the best performances in detecting reliable region boundaries within each file and can correctly identify recurring layouts across files.show moreshow less

Export metadata

Additional Services

Search Google Scholar Statistics
Metadaten
Author details:Gerardo VitaglianoORCiDGND, Lan JiangORCiDGND, Felix NaumannORCiDGND
DOI:https://doi.org/10.14778/3494124.3494145
ISSN:2150-8097
Title of parent work (English):Proceedings of the VLDB Endowment
Publisher:Association for Computing Machinery
Place of publishing:New York
Publication type:Article
Language:English
Date of first publication:2022/02/04
Publication year:2021
Release date:2023/01/05
Volume:15
Issue:3
Number of pages:13
First page:646
Last Page:658
Funding institution:HPI research school on Data Science and Engineering
Organizational units:An-Institute / Hasso-Plattner-Institut für Digital Engineering gGmbH
DDC classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Peer review:Referiert
Accept ✔
This website uses technically necessary session cookies. By continuing to use the website, you agree to this. You can find our privacy policy here.