Enterprise workloads (e.g., search and encryption) and mission-critical scientific simulations (e.g., climate simulation and fluid dynamics simulation) running on large-scale parallel systems are jeopardized by the increase of faults and errors in hardware and software. Understanding the vulnerability of these large-scale applications is important to minimize performance and power. Lack of the knowledge of application vulnerability forms a major bottleneck of execution efficiency, and jeopardizes HPC simulation capabilities. Previous works rely on random fault injection or detailed architecture analysis to evaluate application vulnerability. They can be slow and inaccurate. There is a big gap between the needs of reliable and efficient HPC and what the current methodologies can provide.
This project explores a new methodology to understand application vulnerability. It investigates new analytical and statistical models to quantify and characterize application vulnerability based on novel metrics and application semantics (including algorithm semantics and data semantics). We integrate modeling techniques into a broad context (hardware design and power management) for vulnerability analysis to improve the modeling accuracy and explore reliable and efficient protection for applications while examine the interplay between reliability, power, and performance.
- Dong Li (Faculty)
- Luanzheng Guo (PhD student)
- Wenqian Dong (PhD student)
- [SC'16] Luanzheng Guo, Jing Liang, and Dong Li. "Understanding Ineffectiveness of Application-Level Fault Injection". Poster in ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (Nominated as the best poster, 2.9% of all poster submissions).
- [SC’14] Li Yu, Dong Li, Sparsh Mittal, and Jeffrey S. Vetter. “Quantitatively Modeling Application Resiliency with the Data Vulnerability Factor”. In 26th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2014 (acceptance rate: 21%). (Nominated as the best student paper)
- [SC’13] Dong Li, Zizhong Chen, Panruo Wu, Jeffrey S. Vetter. “Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach”. In 25th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2013 (acceptance rate: 20%)
- [SC’12] Dong Li, Jeffrey S. Vetter, and Weikuan Yu. “Classifying Soft Error Vulnerabilities in Extreme-Scale Scientific Applications Using a Binary Instrumentation Tool”. In 24th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 (acceptance rate: 21%)
This research is based on work supported by NSF and collaboration with Lawrence Livermore National Lab and Los Alamos National Lab. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.