Войти
  • 107Просмотров
  • 5 месяцев назадОпубликованоHPC Knowledge Portal

Taxonomy of errors for large scale ​HPC/AI environments

Joshua Mora CEO System-Stack Operating a large scale HPC/AI environment presents many challenges. Among them getting the highest ROI (Return of Investment). A wide range of errors get in your way from achieving a high ROI. This presentation helps raise awareness and understanding of the different types of errors that one can run into over the 3-5years life cycle of the HPC/AI environment and start thinking on avenues for addressing them. A comprehensive classification of errors by source (hardware, software, user), by type (misconfiguration, unplanned downtime, degradation) and by frequency (rarely, occasional, frequently), along with the corresponding examples is provided. The classification is presented sorted by the increase in severity and impact on ROI. A reliability metric is also provided to assess pragmatically the likelihood of a set of systems and components not failing within a window of time (i.e. the duration of a strong scaling parallel job). More information: