TY - GEN
T1 - System management in the BlueGene/L supercomputer
AU - Almasi, G.
AU - Bachega, L.
AU - Bellofatto, R.
AU - Brunheroto, J.
AU - Cascaval, C.
AU - Castaños, J.
AU - Crumley, P.
AU - Erway, C.
AU - Gagliano, J.
AU - Lieber, D.
AU - Mindlin, P.
AU - Moreira, J. E.
AU - Sahoo, R. K.
AU - Sanomiya, A.
AU - Schenfeld, E.
AU - Swetz, R.
AU - Bae, M.
AU - Laib, G.
AU - Ranganathan, K.
AU - Aridor, Y.
AU - Domany, T.
AU - Gal, Y.
AU - Goldshmidt, O.
AU - Shmueli, E.
N1 - Publisher Copyright:
© 2003 IEEE.
PY - 2003/1/1
Y1 - 2003/1/1
N2 - The BlueGene/L supercomputer will use system-on-a-chip integration and a highly scalable cellular architecture to deliver 360 teraflops of peak computing power. With 65536 compute nodes, BlueGene/L represents a new level of scalability for parallel systems. As such, it is natural for many scalability challenges to arise. In this paper, we discuss system management and control, including machine booting, software installation, user account management, system monitoring, and job execution. We address the issue of scalability by organizing the system hierarchically. The 65536 compute nodes are organized in 1024 clusters of 64 compute nodes each, called processing sets. Each processing set is under control of a 65th node, called an I/O node. The 1024 processing sets can then be managed to a great extent as a regular Linux cluster, of which there are several successful examples. Regular cluster management is complemented by BlueGene/L specific services, performed by a service node over a separate control network. Our software development and experiments have been conducted so far using an architecturally accurate simulator of BlueGene/L, and we are gearing up to test real prototypes in 2003.
AB - The BlueGene/L supercomputer will use system-on-a-chip integration and a highly scalable cellular architecture to deliver 360 teraflops of peak computing power. With 65536 compute nodes, BlueGene/L represents a new level of scalability for parallel systems. As such, it is natural for many scalability challenges to arise. In this paper, we discuss system management and control, including machine booting, software installation, user account management, system monitoring, and job execution. We address the issue of scalability by organizing the system hierarchically. The 65536 compute nodes are organized in 1024 clusters of 64 compute nodes each, called processing sets. Each processing set is under control of a 65th node, called an I/O node. The 1024 processing sets can then be managed to a great extent as a regular Linux cluster, of which there are several successful examples. Regular cluster management is complemented by BlueGene/L specific services, performed by a service node over a separate control network. Our software development and experiments have been conducted so far using an architecturally accurate simulator of BlueGene/L, and we are gearing up to test real prototypes in 2003.
UR - http://www.scopus.com/inward/record.url?scp=33847123003&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2003.1213483
DO - 10.1109/IPDPS.2003.1213483
M3 - Conference contribution
AN - SCOPUS:33847123003
T3 - Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003
BT - Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003
PB - Institute of Electrical and Electronics Engineers
T2 - International Parallel and Distributed Processing Symposium, IPDPS 2003
Y2 - 22 April 2003 through 26 April 2003
ER -