Balmin, Andrey

Structured, unstructured, and semistructured search in semistructured databases

2006

Balmin, Andrey

Abstract

A single framework for storing and querying XML data, using denormalized schema decompositions, can support both structured queries and unstructured searches, as well as serve as a foundation for combining the two forms of information access. XML data format becomes increasingly popular in applications that mix structured data and unstructured text. These applications require integration of structured query and text search mechanisms to access XML data. First, we introduce a framework for storing and querying XML data using denormalized schema decompositions. This framework was initially implemented in the XCacheDB XML database system, which uses XML schemas to shred XML data into relational storage. The XCacheDB supports a subset of XQuery language and emphasizes query optimization to reduce latency and output first results quickly. The XCacheDB relies on XML schemas, which poses a novel challenge for validation XML updates. We investigate the incremental validation of XML documents with respect to DTDs and XML Schemas. We exhibit an O(m log n) algorithm using an auxiliary structure of size O(n), where n is the size of the document and m is the number of updates. We exhibit a restricted class of DTDs called & quot;local" that arise commonly in practice and for which incremental validation can be done in practically constant time by maintaining only a list of counters. We present implementations and experimental evaluations of both general incremental validation and local validation in the XCacheDB system. We, then, present XKeyword system which uses a variation of XCacheDB of schema decompositions to support keyword proximity searches in XML databases. XKeyword decompositions include "ID relations" which store of IDs of target objects, and pre-compute common joins. Finally, we present an architecture of the Semi-Structured Search System (S4) designed to bridge the gap between traditional database and information retrieval systems. S4QL query language combines features of structured queries and text search to facilitate information discovery without knowledge of schema. S4 is based on the same schema decomposition framework of XCacheDB and XKeyword. However, the combination structured and unstructured query features pose novel challenges to efficient query processing. We outline these issues and possible ways of addressing them

Main Content

For improved accessibility of PDF content, download the file to your device.

UC San Diego

Structured, unstructured, and semistructured search in semistructured databases