Authors

Chun WangFollow

Date of Completion

5-10-2016

Embargo Period

5-10-2017

Keywords

Big Data; Online Updating; Estimating Equation; Added Variable

Major Advisor

Jun Yan

Associate Advisor

Elizabeth Schifano

Associate Advisor

Ming-Hui Chen

Field of Study

Statistics

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. This dissertation summarizes recent methodological and software developments in statistics that address the big data challenges at first and then presents statistical methods for big data arising from online analytical processing, where large amounts of data arrive in streams and require fast analysis without storage/access to the historical data, which is called online updating methods. In particular, iterative estimating algorithms and statistical inferences are developed for linear models and estimating equations that update as new data arrive. These algorithms are computationally efficient, minimally storage-intensive, and allow for possible rank deficiencies in the subset design matrices due to rare-event covariates. Goodness-of-fit tests, model diagnostics, and variable selection criteria are also developed under the same framework. When new variables become available, a method that utilizes the information from earlier data in the online updating algorithm with some corrections to reduce bias and improve efficiency is presented.

COinS