Date of Completion

2-25-2020

Embargo Period

8-23-2020

Keywords

Apache Spark, Performance Modeling, Performance Prediction, Performance Interference, Job Scheduling, Straggler, Performance Optimization, Resource Allocation

Major Advisor

Mohammad Maifi Hasan Khan

Associate Advisor

Swapna Gokhale

Associate Advisor

Song Han

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

Software service providers are increasingly adopting cloud-based solutions to maximize resource utilization while minimizing operating cost. While performance predictability is becoming of paramount importance as the safety-critical nature of such systems continues to grow (e.g., IoT applications, infrastructure monitoring), however, large scale, high-degree of concurrency, and dynamic allocation of resources are making traditional performance modeling/tuning frameworks ill-suited that are not extendable. To address the aforementioned challenge, this thesis focuses on developing a data-driven performance modeling framework. Towards this objective, first, hierarchical performance models that can effectively capture and predict the execution time of a given job with high accuracy based on limited scale execution data are first developed. Subsequently, the models are extended to account for the underlying interactions among multiple jobs and predict the execution time of a job when interfered with other jobs. The extended models are then leveraged to design and implement a dynamic job scheduler that can automatically predict potential interference, and reschedule them to minimize interference and job execution time significantly. Second, analytical models are developed to predict the possibility of suboptimal performance problems caused by inefficient partition of input data and/or skewed task distribution across worker nodes, and recommend ways to address the identified problems by either repartitioning of input data (in case of task straggler problem) and/or changing the locality configuration setting (in case of skewed task distribution problem). Finally, the thesis focuses on dynamically allocating computing resources for cloud platforms, which leverages kernel-level application-specific resource usage metric to allocate resources dynamically to improve application performance while reducing resource requirements significantly compared to static resource allocation strategies. The effectiveness of our approach is evaluated on a real cluster using Apache Spark jobs, and is presented in the thesis. We believe that the presented approach will guide future research, and help to improve resource utilization while reducing operating costs significantly in cloud settings.

Available for download on Sunday, August 23, 2020

Share

COinS