SC22 BoF131: Disaggregated Heterogeneous Architectures16-Nov-2022 @ Kay Bailey Hutchison Convention Center Dallas, 650 S Griffin St, Dallas, TX 75202, United States
This approach is gaining traction in the HPC landscape, with Perlmutter, Lumi and JUWELS representing just some examples. This BoF discusses the challenges seen by operators, vendors, developers of system software, programming models and tools, as well as application developers when adapting their codes to make use of such machines.
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), Nov 13–18, 2022, Dallas, Texas.
BoF Session 131. Schedule: November 16th, Wednesday 12:15 - 13:15
Today’s HPC systems are highly heterogeneous machines combining different processors, network, memory, and storage technologies. This diversification is expected to grow further with the integration of disruptive technologies, such as AI-accelerators, neuromorphic devices, or even quantum computers. Orchestrating and using this hardware-zoo poses enormous challenges. System developers and operators require scalable ways to interconnect the different technologies, advanced scheduling and management techniques, and I/O and data management mechanisms to deal with increasingly data-intensive workflows. The users, on their side, need methods to efficiently transfer data between compute, memory and storage elements, and strategies for programming thousands of devices with partially different instruction set architectures and vendor-specific environments.
The exact manifestation of the above challenges depends on how the hardware resources are organised at system level. Some experts advocate for a monolithic approach in which all nodes are equal, each node containing a variety of computing elements. Others go in exactly the opposite direction and segregate the resources at system level, grouping the different types into partitions or modules. This latter category is the focus of this BoF.
“Disaggregated” aka “modular supercomputing” refers to a system-level architecture in which
heterogeneous resources are organised in partitions or modules, each one with a different type of
node-configuration. This approach is gaining traction in the HPC landscape, with Perlmutter, Lumi and JUWELS representing
just some examples. This BoF will be a forum to discuss most recent topics of research around
disaggregated heterogeneous architectures, their operation and use. Discussions will include the
challenges seen by operators, vendors, developers of system software, programming models and tools, as
well as application developers when adapting their codes to make use of such machines.
Addressed audience comprises HPC centres operating or planning to deploy modular/disaggregated supercomputers, vendors building them including storage and network administrators, developers of system software, programming models and tools that address system-level heterogeneity, and application developers that are adapting their codes to make use of such machines. The panel of speakers represent these sectors and will raise their respective challenges.
Questions & Discussions:
This BoF is a forum for open and constructive discussion. We encourage every participant to directly speak up during the session. Still, if you do not feel like speaking up in public, you can post your questions in this google-doc. Our moderator will take them and pose them for you:
- 5 min: motivation for heterogeneous disaggregation (Nick Wright, NERSC)
- 5 min: main challenges from operation perspective (Kengo Nakajima, Univ. Tokyo and RIKEN R-CCS)
- 5 min: main challenges for data management (Philippe Deniel, CEA)
- 5 min: main challenges for programming and use (Anshu Dubey, ANL)
40 min moderated discussion between panel and audience.