Data center world, liquid cooling is becoming king

On Feb 4, 2019

In Iron Man 2, there is a moment when Tony Stark is watching a decades-old film of his deceased father, who tells him “I’m limited by the technology of my time, but one day you’ll figure this out. And when you do, you will change the world.” It’s a work of fiction but the notion expressed is legitimate. The visions and ideas of technologists are frequently well ahead of the technology of their times. Star Trek may have always had it, but it took the rest of us decades to get tablets and e-readers right.

The concept of liquid cooling sits squarely in this category as well. While the idea has been around since the 1960s, it remained a fringe concept when compared to the much cheaper and safer air cooling method. It took another 40 odd years before liquid cooling even started to take off in the 2000s, and then it was mostly confined to PC hobbyists who wanted to overclock their CPUs well beyond the recommended limits set by Intel and AMD.

Today, however, liquid cooling seems to be having a moment. You can buy a liquid cooling system for your PC for under $100, and a whole cottage industry of enterprise and data center vendors (like CoolIT, Asetek, Green Revolution Computing, Ebullient, just to name four) are all promoting liquid cooling of data center equipment. Liquid cooling continues to be primarily used in areas of supercomputing, high performance computing (HPC), or other situations involving massive amounts of compute power where CPUs run at almost 100 percent utilization, but such a use case is becoming mainstream.

There are two common types of liquid cooling: direct to chip and immersion. In direct to chip, a heat sink is attached to the CPU just like with a standard fan, but instead of a fan there are two tubes connected. It has one tube where cold water comes in and cools the heat sink, which is absorbing heat from the CPU, and one pipe to take the hot water away. It is then cooled and returned to the CPU in a closed loop not unlike the human blood stream.

With immersion, the hardware is flooded with a liquid bath, which must obviously be non-conductive. Generally, this approach can best be compared to the pools used to cool nuclear reactor rods. Immersion is much more cutting edge and requires much more expensive coolant than direct to chip, which can still use plain-old water. On top of that, there is the risk of spillage with immersion. For that reason, direct to chip is much more popular for now.

For one major example, take Alphabet. When Google’s parent company Alphabet introduced its TensorFlow 3.0 AI processors in May 2018, CEO Sundar Pichai said the chips are so powerful, “that for the first time we’ve had to introduce liquid cooling in our data centers.” The switch was the price Alphabet paid for an eight-fold improvement in performance.

On the flip side, Skybox Datacenters recently announced a massive, 40,000 server supercomputer for oil and gas exploration from DownUnder GeoSolutions (DUG). This initiative is expected to deliver 250 petaflops of computing power, more than any existing supercomputer—and the expected liquid cooling system would enclose the servers, using more than 720 enclosures in fully submerged tanks filled with dielectric fluid.

Either way, “Liquid cooling is the cooling of the future and always will be,” said Craig Pennington, vice president of design engineering at data center operator Equinix. “It seems so obvious that’s the right thing to do but no one has done it.”

So how has liquid cooling gone from esoteric science on the fringe of computing to near-mainstream in modern data centers? Like all technologies, it was partially the result of evolution involving trial and error and a lot of engineering. But specifically for liquid cooling, today’s massive data center operators should perhaps thank the early overclockers, who may really be the unsung heroes of liquid cooling.

The IBM System 360 data processing control panel. #VintageChicH. Armstrong Roberts/ClassicStock/Getty Images

What we talk about when we talk about liquid cooling

Liquid cooling really entered the popular imagination back in 1964 when IBM explored immersion cooling for the company’s System 360 mainframe (the company’s first mainframe computer). The concept was simple: chilled water would be run through a contraption that cooled the water down to below room temperature, and then the water would be supplied directly to the system. IBM’s ultimate setup used what’s now known as rear door cooling, where a radiator-like device was mounted on the back of a mainframe. This device drew in hot air from the mainframe via fans, and that air was then cooled by the water, much like how a radiator cools a car engine.

Over the time since, engineers improved upon this basic concept, and two dominant forms of liquid cooling ultimately emerged—immersion and direct contact. Immersion is exactly what it says: the electronics are sitting in a liquid bath that, for obvious reasons, cannot be water. The liquid must be non-conductive, or dielectric (companies like 3M even engineer fluid specifically for this purpose).

Immersion, though, has a lot of challenges and drawbacks. Because it sits in liquid, the server can only be accessed from the top. That’s where external ports must be located. A 1U (rack unit) solution is impractical, so you can’t stack the racks. The dielectric fluid, usually mineral oil, is very expensive and can be messy to clean up if it spills. You need special hard drives and will likely have to spend a lot to retrofit a data center. That’s why in the case of the Houston supercomputer mentioned above, immersion is best done with a new data center rather than retrofitting an old one.

By contrast, direct contact liquid cooling is where a heat sink (or heat exchanger) sits on the chip, just like a regular heat sink. Instead of an attached fan blowing on the heat sink, however, this setup has two water pipe connections—one to bring cool water in to cool the plate, and another to take away the hot water created by contact with the heat plate. This has become the most common form of liquid cooling, adopted by major OEMs like HP Enterprise, Dell EMC, and IBM, as well as cabinet and enclosure makers like Chatsworth Systems and Schneider Electric.

Direct to chip uses water, though it’s rather particular about the quality of water. You certainly can’t use unfiltered municipal water. Just look at your faucet or shower head. Who wants a calcium buildup in their servers? At the very least, direct contact liquid cooling requires pure, distilled water, and sometimes water mixed with an antifreeze. This type of liquid coolant is a very precise science in and of itself.

The Intel connection

How did we get from IBM’s radiators to today’s extravagant cooling solutions? Again, thank the overclockers. Around the turn of the century, liquid cooling started catching on with PC overclockers and system builder hobbyists who wanted to run their computers at higher speed than officially rated. Still, it was a very esoteric art with no standard design. Everyone did their own thing. It required user assembly MacGyvering that put Ikea products to shame. Most of the coolers didn’t even fit well in the case.

In early 2004, things changed due to some internal politicking at Intel. An engineer from the company’s Hillsboro, Oregon design center—where most of the company’s chip design work is done, despite Intel HQ being in Santa Clara, California—had been working on a custom cooling project for Intel for several years. The project had cost Intel more than $1 million to develop at that time, and its aim was a liquid cooler for Intel CPUs. Unfortunately, Intel was about to kill the project.

The engineer hoped for a different result. To save his project, he brought the idea to Falcon Northwest, a builder of top-of-the-line systems for gamers in Portland. “The reasoning was as a company, they saw liquid cooling as a tacit endorsement of overclocking, and overclocking back then was verboten,” said Kelt Reeves, president of Falcon Northwest. Intel had logical reasons for this stance. At the time, unscrupulous retailers in Asia were selling overclocked PCs at higher than rated clock speeds with poor cooling, and this somehow became Intel’s problem in the public discourse. Thus, the company opposed overclocking.

But this Oregon engineer thought if he could get a customer and market case for the cooler, Intel would ultimately relent. (What Intel had built also happened to be a far better solution than what was on the market elsewhere, Reeves told Ars.) So after a little internal advocating and negotiating between companies, Intel allowed Falcon to sell the cooling systems—partially since Intel had already produced thousands of them. The only catch? Falcon couldn’t acknowledge Intel’s involvement. Falcon agreed, and soon it became the first PC maker to ship a fully-sealed, all-in-one-liquid cooler.

That pioneering modern liquid cooling solution wasn’t exactly consumer friendly, Reeves noted. Falcon had to modify cases to make the radiator fit and had to invent a cold plate to cool the water. But over time, CPU cooler makers like ThermalTake and Corsair studied what Intel did and aimed for incremental improvements. From there, server products and vendors like CoolIT and Asetek sprung up to bring liquid cooling specifically to the data center. Some of what they do—like the tubing that will not break, crack, or leak for up to seven years—was eventually licensed to consumer CPU cooler vendors, and this sharing of advancements back and forth has become the norm.

As this market started growing in multiple ways, even Intel eventually changed its tune. It now markets the overclocking capabilities of the K and X series of its CPUs, for instance, and the company doesn’t even bother to provide a stock fan with its top of the line CPUs for gamers.

“[Liquid cooling] is tried and true—everyone on the consumer side doing it,” Reeves said. “Intel stopped shipping stock fans with high end CPUs because they require liquid cooling; it’s been proven by and even blessed by Intel at this point. I don’t think you will find anyone who would say an all-in-one is not reliable enough.”