“The logical prototype simulates the entire circuit design of the proposed processor,” says John Shalf of the NERSC Division, principal investigator of Green Flash.

The prototype was designed in collaboration with Tensilica, using Tensilica’s Xtensa LX extensible processor core as the basic building block, and was run on a RAMP (Research Accelerator for Multiple Processors) BEE3 hardware emulator, which is used for computer architecture research.

The climate code that ran on the prototype was a next-generation, limited area model version of the geodesic Global Cloud-Resolving Model (GCRM) developed by the research group of David Randall, Director of the Center for Multiscale Modeling of Atmospheric Processes at Colorado State University (CSU) and principal investigator of the DOE SciDAC project “Design and Testing of a Global Cloud-Resolving Model.” Randall, Ross Heikes, and Celal Konor of CSU are collaborating with the Green Flash team.

David Donofrio of Berkeley Lab’s Computational Research Division (CRD), who works on the hardware design of Green Flash, ran the prototype in a demonstration at the SC08 conference in Austin, Texas in November 2008.

Green Flash was first proposed publicly in the paper “Towards Ultra-High Resolution Models of Climate and Weather,” written by Michael Wehner and Lenny Oliker of CRD and Shalf of NERSC. The paper was published in the May 2008 issue of the International Journal of High Performance Computing Applications.

Green Flash Simultaneously Addresses Three Problems:

The Green Flash project addresses three research problems simultaneously — a climate science problem, a computer architecture/hardware problem, and a software problem.

The climate science problem stems from current low resolution climate models that cannot capture the behavior of cumulus convective cloud systems. Instead, researchers must use statistical patterns to approximate this important mechanism of heat and moisture transport. Direct numerical simulation of individual clouds systems has been proposed as a solution to replace these approximations with a more rigorous formulation but would require horizontal grid resolutions approaching 1 km. Randall’s research group is gradually working toward that goal; their current SciDAC project aims to develop a cloud model with 3 km resolution.

Increasing Model Resolution: Topography of California and Nevada at three different model resolutions. The left panel shows the relatively low resolution typical of the models used for the International Panel on Climate Change’s Fourth Assessment Report, published in 2006. The center panel shows the upper limit of current climate models with statistical approximations of cloud systems. The right panel shows the resolution needed for direct numerical simulation of individual cloud systems.

To develop a 1 km cloud model, scientists would need a supercomputer that is 1,000 times more powerful than what is available today. But building a supercomputer that powerful with conventional microprocessors (the kind used to build personal computers) would cost about $1 billion and would require 200 megawatts of electricity to operate — enough energy to power a small city of 100,000 residents. That constitutes the computer architecture problem. In fact, the energy consumption of conventional computers is now recognized as a major problem not just for climate science, but for all large-scale computing.

Shalf, Wehner, and Oliker see a possible solution to these challenges — achieving high performance with a limited power budget and with economic viability — in the low-power embedded microprocessors found in cell phones, iPods, and other electronic devices. Unlike the general-purpose processors found in personal computers and most supercomputers, where versatility comes at a high cost in power consumption and heat generation, embedded processors are designed to perform only what is required for specific applications, so their power needs are much lower. The embedded processor market also offers a robust set of design tools and a well-established economic model for developing application-specific integrated circuits (ASICs) that achieve power efficiency by tailoring the design to the requirements of the application. Chuck McParland of CRD has been examining issues of manufacturability and cost projections for the Green Flash design to demonstrate the cost-effectiveness of this approach.

Meeting the performance target for the climate model using this technology approach will require on the order of 20 million processors. Conventional approaches to programming are unable to scale to such massive concurrency. The software problem addressed by the Green Flash project involves developing new programming models that are designed with million-way concurrency in mind, and exploiting auto-tuning technology to automate the optimization of the software design to operate efficiently on such a massively parallel system.

To meet this challenge, Tony Drummond of CRD and Norm Miller of the Earth Sciences Division are working on analyzing the code requirements; and Shoaib Kamil, a graduate student in computer science at the University of California, Berkeley (UCB) who is working at NERSC, has been developing an auto-tuning framework for the climate code. This framework automatically extracts sections of the Fortran source code of the climate model and optimizes them for Green Flash and a variety of other architectures, including multicore processors and graphics processors.

Developing Hardware and Software Together
An innovative aspect of the Green Flash research is the hardware/software co-design process, in which early versions of both the processor design and the application code are developed and tested simultaneously. The RAMP emulation platform allows scientists to run the climate code on different hardware configurations and evaluate those designs while they are still on the drawing board. Members of the RAMP consortium on the UC Berkeley campus, including John Wawrzyneck and Krste Asanovic (both of whom have joint appointments at NERSC), Greg Gibling, and Dan Burke, have been working closely with David Donofrio of CRD and the Green Flash hardware team throughout the development process.


Auto-Tuning for Performance and Efficiency: Sample auto-tuning results for currently available processors, from research by Koushik Datta and Samuel Webb Williams of CRD. The top panels (A) show improvements in performance and scalability for the AMD Opteron processor. The bottom panels (B) show significant improvements in power efficiency for four out of six types of processors. Processors with lots of on-chip parallelism, such as Cell and G80, show enormous potential for energy efficiency if computer scientists can figure out how to program them to exploit this capability. Auto-tuning is part of that programming solution.

At the same time, auto-tuning tools for code generation test different software implementations on each hardware configuration to increase performance, scalability, and power efficiency. Marghoob Mohiyuddin, another UCB graduate student at NERSC, has been working on automating the hardware/software co-design process and has recently demonstrated a 30–50% improvement in power efficiency over conventional approaches to design space exploration. The result will be a combination of hardware and software optimized to solve the cloud modelling problem.

The researchers estimate that the proposed Green Flash supercomputer, using about 20 million embedded microprocessors, would deliver the 1 km cloud model results and cost perhaps $75 million to construct (a more precise figure is one of the project goals). This computer would consume less than 4 megawatts of power and achieve a peak performance of 200 petaflops.

The hardware goals for the prototype research are fairly simple: produce a hardware prototype of a single Green Flash processor by fall 2009, and an entire node of the system (64 to 128 processors) by fall 2010. But the software goals are more challenging.

“We have open issues that we’re still wrestling with,” Shalf said. “We’ve come up with a solution for how do you program this machine for this particular code, but we haven’t fully explored the question of what is the more generalized way of programming a machine with 20 million processors. What I’d like to get out of this project is some lessons that we learned from examining this one particular case that we can generalize to other codes with similar algorithms on a machine of this scale. Whether or not we build a Green Flash, the answers to these questions will be the most important research challenges for the computer science community for the next decade.”

The Green Flash prototype research is funded by LBNL’s Laboratory Directed Research and Development program.