An Algebraic Approach to Building Category Parse Trees for Web Tables

Dongpu Jin

Abstract


Web tables are commonly used to convey information on the Internet. A good understanding of web tables’ syntactic structures will not only improve the efficiency of information retrieval, but also provide a variety of angles to view the information. The web tables we study are well-formed and structured in accordance with the common human understanding. A well-formed table is comprised of a column header, a row header, a grid region to store meaningful data, and an empty stub region. The underlying layout and relational structures of a web table are revealed in its headers. Each column/row can be defined uniquely by a header-path that passes through the initial headers. Within the corpus of web tables we studied, most tables have very simple header structures, such as a linear array of symbols. Each symbol defines a particular column or row. However, some header structures are more complex. They consist of multiple layers of symbols and often have a hierarchical relation between each layer. According to a commonly used layout convention, symbols are repeated at lower layers to convey hierarchical relationship with higher layers. These observations motivate us to further explore the structures of web tables. Our goal is to discover more generic and fundamental ways of representing web table headers and improve the performance of information retrieval from web tables. We use an algebraic approach to analyze table headers. The starting point of our analysis is to represent headers using Sum of Product (SOP) equations, where ‘+’ and ‘*’ are redefined to correspond to spatial relationship between the header cells. Inspired by Boolean algebraic expression simplification techniques, we decomposed headers using a logic synthesis tool. As a result, header symbols are gathered into groups, which can be thought of as basic building blocks. We extract each group from the equation and call it a category. Utilizing the Shunting-Yard Algorithm and the Reverse Polish Notation, we build a parse tree representation of each category according to a specific grammar. A simple tree traversal on category parse trees produces all the header-paths. If we traverse in another order, we generate a different set of header-paths. This allows us not only to retrieve information more efficiently, but also view table information in a variety of ways. We processed 200 well-formed web tables through our system and generated production rules, SOP equations, RPNs and parse trees for all 200 tables.

Keywords


web tables; algebraic analysis; parse tree

Full Text:

PDF

Refbacks

  • There are currently no refbacks.


The Proceedings is produced as a service of UNC Asheville.