Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Architectures for stateful data-intensive analytics

Abstract

The ability to do rich analytics on massive sets of unstructured data drives the operation of many organizations today and has given rise to a new class of data-intensive computing systems. Many of these analytics are update-driven, they must constantly integrate new data in the analysis, and a fundamental requirement for efficiency is the ability to maintain state. However, current data-intensive computing systems do not directly support stateful analytics, making programming harder and resulting in inefficient processing. This dissertation proposes that state become a first-class abstraction in data-intensive computing. It introduces stateful groupwise processing, a programming abstraction that integrates data -parallelism and state, allowing sophisticated, easily parallelizable stateful analytics. The explicit modeling of state abstracts the details of state management, making programming easier, and allows the runtime system to optimize state management. This work investigates the use of stateful groupwise processing in two distinct phases in the data management lifecycle : (i) the extraction of data from its sources and online analysis, and (ii) its storage and follow-on analysis. We propose two complementary architectures that manage data in these two phases. This work proposes In-situ MapReduce (iMR), a model and architecture for efficient online analytics. The iMR model combines stateful groupwise processing with windowed processing for analyzing streams of unstructured data. To allow timely analytics, the iMR model supports reduced data fidelity through partial data processing and introduces a novel metric for the systematic characterization of partial data. For efficiency, the iMR architecture moves the data analysis from dedicated compute clusters onto the sources themselves, avoiding costly data migrations. Once data are extracted and stored, a fundamental challenge is how to write rich analytics to gain deeper insights from bulk data. This work introduces Continuous Bulk Processing (CBP), a model and architecture for sophisticated dataflows on bulk data. CBP uses stateteful groupwise processing as the building block for expressing analytics, lending itself to incremental and iterative analytics. Further, CBP provides primitives for dataflow control that simplify the composition of sophisticated analytics. Leveraging the explicit modeling of state, CBP executes these dataflows in a scalable, efficient, and fault-tolerant manner

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View