Personal tools
You are here: Home Research Trends & Opportunities High Performance and Quantum Computing High Performance and Super Computing

High Performance and Super Computing

(Supercomputer, Lawrence Livermore National Laboratory)

 - Overview

High-performance computing is undergoing a period of rapid evolution involving changes in computational architectures and the types of problems that are being attacked. Aside from the traditional types of “hard” computational problems, there is now considerably more emphasis on “system-scale” simulation, data-intensive problems, and large-scale artificial intelligence/machine learning applications.


- The Summit Supercomputer

[MIT]: Summit Supercomputer, Oak Ridge National Laboratory (ORNL) - The world’s most powerful supercomputer, as of June, 2018, is tailor made for the AI era.

The new machine is capable, at peak performance, of 200 petaflops - 200 million billion calculations a second. To put that in context, everyone on earth would have to do a calculation every second of every day for 305 days to crunch what the new machine can do in the blink of an eye. Summit is 60 percent faster than the Chinese SunWay TaihuLight (神威·太湖之光) (with a LINPACK benchmark rating of 93 petaflops, as of March 2018) and almost eight times as fast as a machine called Titan, which is also housed at ORNL and held the US supercomputing speed record until Summit’s arrival. 

With a peak performance of 200,000 trillion calculations per second—or 200 petaflops. For certain scientific applications, Summit will also be capable of more than three billion billion mixed precision calculations per second, or 3.3 exaops. Summit will provide unprecedented computing power for research in energy, advanced materials and artificial intelligence (AI), among other domains, enabling scientific discoveries that were previously impractical or impossible.

Summit is the first supercomputer designed from the ground up to run AI applications, such as machine learning and neural networks. It has over 27,000 GPU chips from Nvidia, whose products have supercharged plenty of AI applications, and also includes some of IBM’s Power9 chips, which the company launched last year specifically for AI workloads. There’s also an ultrafast communications link for shipping data between these silicon workhorses. 

All this allows Summit to run some applications up to 10 times faster than Titan while using only 50 percent more electrical power. Among the AI-related projects slated to run on the new supercomputer is one that will crunch through huge volumes of written reports and medical images to try to identify possible relationships between genes and cancer. Another will try to identify genetic traits that could predispose people to opioid addiction and other afflictions.

Summit is also an important stepping stone to the next big prize in computing: machines capable of an exaflop, or a billion billion calculations a second. The experience of building Summit, which fills an area the size of two tennis courts and carries 4,000 gallons of water a minute through its cooling system to carry away about 13 megawatts of heat, will help inform work on exascale machines, which will require even more impressive infrastructure. Things like Summit’s advanced memory management and the novel, high-bandwidth linkages that connect its chips will be essential for handling the vast amounts of data exascale machines will generate. 


- Parallel Supercomputing

Parallel computing is the concurrent use of multiple processors (CPUs) to do computational work. 

In traditional (serial) programming, a single processor executes program instructions in a step-by-step manner. Some operations, however, have multiple steps that do not have time dependencies and therefore can be separated into multiple tasks to be executed simultaneously. For example, adding a number to all the elements of a matrix does not require that the result obtained from summing one element be acquired before summing the next element. Elements in the matrix can be made available to several processors, and the sums performed simultaneously, with the results available faster than if all operations had been performed serially. 

Parallel computations can be performed on shared-memory systems with multiple CPUs, distributed-memory clusters made up of smaller shared-memory systems, or single-CPU systems. Coordinating the concurrent work of the multiple processors and synchronizing the results are handled by program calls to parallel libraries; these tasks usually require parallel programming expertise.

Parallel supercomputers have been in the mainstream of high-performance computing for the past decades. However, their popularity is waning. The reasons for this decline are many, but include factors like being expensive to purchase and run, potentially difficult to program, slow to evolve in the face of emerging hardware technologies, and difficult to upgrade without, generally, replacing the whole system. The decline of dedicated parallel supercomputer has been compounded by the emergence of commodity-off-the-shelf clusters of PCs and workstations. 

- Cluster Supercomputing

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. 

The components of a cluster are usually connected to each other through fast local area networks, with each node (computer used as a server) running its own instance of an operating system. In most circumstances, all of the nodes use the same hardware and the same operating system, although in some setups, different operating systems can be used on each computer, or different hardware. 

Clusters are typically used for High Availability (HA) for greater reliability or High Performance Computing (HPC) to provide greater computational power than a single computer can provide. As high-performance computing (HPC) clusters grow in size, they become increasingly complex and time-consuming to manage. Tasks such as deployment, maintenance, and monitoring of these clusters can be effectively managed using an automated cluster computing solution. Cluster computing can scale to very large systems, hundreds or even thousands of machines can be networked. In fact, the entire Internet can be viewed as one truly huge cluster. 

Computer clusters emerged as a result of convergence of a number of computing trends including the availability of low-cost microprocessors, high-speed networks, and software for high-performance distributed computing. They have a wide range of applicability and deployment, ranging from small business clusters with a handful of nodes to some of the fastest supercomputers in the world such as IBM's Sequoia.


- Grid Supercomputing

Grids are a form of distributed computing whereby a "super virtual computer" is composed of many networked loosely coupled computers acting together to perform large tasks, such as analyzing huge sets of data or weather modeling. Through the cloud, you can assemble and use vast computer grids for specific time periods and purposes, paying, if necessary, only for what you use to save both the time and expense of purchasing and deploying the necessary resources yourself. Also by splitting tasks over multiple machines, processing time is significantly reduced to increase efficiency and minimize wasted resources.   

Unlike with parallel computing, grid computing projects typically have no time dependency associated with them. They use computers that are part of the grid only when idle, and operators can perform tasks unrelated to the grid at any time. Security must be considered when using computer grids as controls on member nodes are usually very loose. Redundancy should also be built in as many computers may disconnect or fail during processing. 

The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files. Grid computing is distinguished from conventional high-performance computing systems such as cluster computing in that grid computers have each node set to perform a different task/application. Grid computers also tend to be more heterogeneous and geographically dispersed (thus not physically coupled) than cluster computers. Although a single grid can be dedicated to a particular application, commonly a grid is used for a variety of purposes. Grids are often constructed with general-purpose grid middleware software libraries. Grid sizes can be quite large.



[More to come ...]



Document Actions