Date of Completion
Big Data; Online Updating; Estimating Equation; Added Variable
Field of Study
Doctor of Philosophy
Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. This dissertation summarizes recent methodological and software developments in statistics that address the big data challenges at first and then presents statistical methods for big data arising from online analytical processing, where large amounts of data arrive in streams and require fast analysis without storage/access to the historical data, which is called online updating methods. In particular, iterative estimating algorithms and statistical inferences are developed for linear models and estimating equations that update as new data arrive. These algorithms are computationally efficient, minimally storage-intensive, and allow for possible rank deficiencies in the subset design matrices due to rare-event covariates. Goodness-of-fit tests, model diagnostics, and variable selection criteria are also developed under the same framework. When new variables become available, a method that utilizes the information from earlier data in the online updating algorithm with some corrections to reduce bias and improve efficiency is presented.
Wang, Chun, "Online Updating Methods for Big Data Streams" (2016). Doctoral Dissertations. 1146.