The Cray XD1 is a highly modular Linux-based supercomputer, with great ease of scalability. The smallest complete base unit is a chassis. (This is where the name “mini” comes in.) It is a 12 processor Opteron machine. Up to 12 chassis can be installed in a rack. Multirack configurations integrate hundreds of processors into a single system. The system is a great marriage of Cray, Linux and AMD technologies.
One of the greatest features of the XD1 is its ability to self-heal whenever a problem is detected. It uses an elaborate fault detection system that monitors over 200 critical hardware functions as well as the Linux OS, predicting eminent failures and automatically recovering from them.
The Cray XD1 system provides extensive fault detection, isolation, and prediction capabilities, coupled with automated proactive and reactive self-healing intelligence.
Fault Detection: In each chassis, a dedicated management processor with its own super-visory network continuously monitors over 200 critical hardware functions, including temperatures, voltages, in-rush currents, parity errors, and component diagnostics. The management system also monitors the sanity and operation of the Linux operating system and key internal services such as DNS, NIS, and LDAP.
Proactive Management: Sophisticated proactive controls adjust a broad range of operating parameters to maintain peak performance and optimal operating conditions. The periodic refresh of system software in an SMP helps avoid problems with corrupted software. These proactive measures improve the mean time between failures (MTBF) and ensure system resiliency and job completion.
Recovery from Failures: The Cray XD1’s self-healing intelligence facilitates a quick and automated recovery in the event of a hardware failure, reducing outages from hours to minutes.
Redundancy features include “N+1 sparing” and the ability to reallocate resources in the event of a failure, enabling a replace-ment SMP to assume the persona of a failed SMP and restore full capacity to the affected partition. Jobs are automatically rescheduled from the last checkpoint.
Before you even look at the special technology, Cray makes sure that the system is adequately equipped.
The Cray XD1 features the direct connect processor (DCP) architecture, which removes PCI bottlenecks and memory contention to deliver superior sustained performance. According to the HPC Challenge benchmarks, the Cray XD1 has the lowest latency of any HPC system, with MPI latency of 1.8 microseconds and random ring latency of 1.3 microseconds. Tests conducted by the Ohio Supercomputer Center show that the Cray XD1 ships messages with four times lower MPI latency than common cluster interconnects such as Infiniband, Quadrics or Myrinet, and 30 times lower than Gigabit Ethernet employed in lowest-cost clusters. The Cray XD1’s interconnect delivers twice the bandwidth of 4X Infiniband for messages up to 1 KB and 60 percent higher throughput for very large messages.
The Linux/Opteron system runs x86 32/64 bit codes. Field programmable gate arrays (FPGAs) are available to accelerate applications, and the Active Manager subsystem provides single system command and control and high availability features. A 3VU (5.25″) chassis provides 12 compute processors, 58 peak gigaflops, 96 GB/second aggregate switching capacity, 1.8-microsecond MPI interprocessor latency, 84 GB maximum memory and 1.5 TB maximum disk storage. A 12-chassis rack provides 144 compute processors, 691 peak gigaflops, 1TB/second aggregate switching capacity, 2 microsecond MPI interprocessor latency, 922 GB/second aggregate memory bandwidth, 1 TB maximum memory and 18 TB maximum disk storage.
So if you or your company has the need for one of these systems you can pick one up in the neighborhood of just under $100,000 to about $2 million. This is really cheap as far as supercomputers are concerned.