System management in the BlueGene/L supercomputer

G. Almasi, L. Bachega, R. Bellofatto, J. Brunheroto, C. Cascaval, J. Castaños, P. Crumley, C. Erway, J. Gagliano, D. Lieber, P. Mindlin, J. E. Moreira, R. K. Sahoo, A. Sanomiya, E. Schenfeld, R. Swetz, M. Bae, G. Laib, K. Ranganathan, Y. AridorT. Domany, Y. Gal, O. Goldshmidt, E. Shmueli

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Scopus citations

Abstract

The BlueGene/L supercomputer will use system-on-a-chip integration and a highly scalable cellular architecture to deliver 360 teraflops of peak computing power. With 65536 compute nodes, BlueGene/L represents a new level of scalability for parallel systems. As such, it is natural for many scalability challenges to arise. In this paper, we discuss system management and control, including machine booting, software installation, user account management, system monitoring, and job execution. We address the issue of scalability by organizing the system hierarchically. The 65536 compute nodes are organized in 1024 clusters of 64 compute nodes each, called processing sets. Each processing set is under control of a 65th node, called an I/O node. The 1024 processing sets can then be managed to a great extent as a regular Linux cluster, of which there are several successful examples. Regular cluster management is complemented by BlueGene/L specific services, performed by a service node over a separate control network. Our software development and experiments have been conducted so far using an architecturally accurate simulator of BlueGene/L, and we are gearing up to test real prototypes in 2003.

Original languageEnglish
Title of host publicationProceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003
PublisherInstitute of Electrical and Electronics Engineers
ISBN (Electronic)0769519261, 9780769519265
DOIs
StatePublished - 1 Jan 2003
Externally publishedYes
EventInternational Parallel and Distributed Processing Symposium, IPDPS 2003 - Nice, France
Duration: 22 Apr 200326 Apr 2003

Publication series

NameProceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003

Conference

ConferenceInternational Parallel and Distributed Processing Symposium, IPDPS 2003
Country/TerritoryFrance
CityNice
Period22/04/0326/04/03

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Theoretical Computer Science
  • Software

Fingerprint

Dive into the research topics of 'System management in the BlueGene/L supercomputer'. Together they form a unique fingerprint.

Cite this