Application-Centric, Reliable High Performance Computing

Enterprise workloads (e.g., search and encryption) and mission-critical scientific simulations (e.g., climate simulation and fluid dynamics simulation) running on large-scale parallel systems are jeopardized by the increase of faults and errors in hardware and software. Understanding the vulnerability of these large-scale applications is important to minimize performance and power. Lack of the knowledge of application vulnerability forms a major bottleneck of execution efficiency, and jeopardizes HPC simulation capabilities. Previous works rely on random fault injection or detailed architecture analysis to evaluate application vulnerability. They can be slow and inaccurate. There is a big gap between the needs of reliable and efficient HPC and what the current methodologies can provide.

This project explores a new methodology to understand application vulnerability. It investigates new analytical and statistical models to quantify and characterize application vulnerability based on novel metrics and application semantics (including algorithm semantics and data semantics). We integrate modeling techniques into a broad context (hardware design and power management) for vulnerability analysis to improve the modeling accuracy and explore reliable and efficient protection for applications while examine the interplay between reliability, power, and performance.


  • Dong Li (Faculty)
  • Luanzheng Guo (PhD student)
  • Wenqian Dong (PhD student)

Recent Publications:



This research is based on work supported by NSF and collaboration with Lawrence Livermore National Lab and Los Alamos National Lab. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.

Image result for los alamos national laboratory              Image result for NSF            Image result for lawrence livermore national laboratory