This post was written by Hasnain Shabbir, Thermal Engineer, Dell Inc.
Why this blog?
Working on thermal design of Dell servers and especially their thermal control systems, I find it very useful to get your feedback. It helps me to improve our products for next generation and let you know about thermal design features.
Since the launch of our PowerEdge 12th generation server lineup, we have heard from customers about temperature concerns in the system; for example, component, surface cover, and exhaust temperatures. To answer these concerns, I want to help explain the key thermal design philosophy behind our 12th generation servers. This post is intended to give you useful information about temperature levels, options contained within the BIOS for adjusting temperatures, and explanation of other benefits, like power savings, that are not necessarily intuitive.
What is ‘hot’?
Let’s have a short conversation about temperatures. When we “feel” temperatures, we feel them relative to temperatures with which we are most familiar. The air in many business buildings is around 25C (77F) – comfortable for a human. The surface of a hot cup of coffee can be around 45-55C (113-131F) and feels uncomfortably hot to the touch. Server cover temperatures may reach the same range, leading you to think that is unusual, since you may not ordinarily be in contact with something at that temperature.
As it turns out, CPUs and many other components (like memory, network and storage controller chips) in servers are designed to run reliably at such high temperatures. For example, server CPU temperature limits of 90-100C (194-212F) – a range since it varies from different CPU bins – mean that you can run the processor close to those limits without impacting the reliability of the CPUs. The last statement is a very important one since most discussions about hot components boils down to (no pun intended) concerns about reliability.
So if someone says, “the cover of the computer is too hot to touch!” it is relative to what we feel as “normal” in our day to day life. We might personally feel really hot at an ambient of 35C (95F), or feel a surface at 50C (122F) to be really hot to the touch, and then translate that to our concerns about a CPU in the server that is running at 90C (176F). But, operation at “hotter” temperatures does not make the product less reliable; in fact, the product is designed to run close to specification temperatures within the warranted life of the product.
But why run it so hot?
So the next logical question is: Why does Dell operate the CPUs at such high temperatures? Why not run the CPU or server cooler?
You might be surprised to hear that it is actually easier for me to design a server to run cooler than necessary. Strange, right? But the fans used to cool the CPUs can themselves consume a big chunk of the server operational power. Because you pay for power to run the server and hence the fans, I want to lower your electricity bills by reducing the fan speed as much as I can while allowing the CPUs to operate within their specification temperatures. Furthermore, reducing the fan speed reduces the acoustical noise.
The art is in designing the cooling system to be a minimum burden on the system power consumption. Instead of running fans at full speed all the time, causing top cover temperatures to be closer to what we as humans feel as comfortable, and consuming 20% of server power to cool the system, I optimize the fan speed to meet the CPU specifications to use as low as 3% of server power. That is a significant amount of power savings. The power savings multiply by the number of servers deployed and save recurring power cost for the life of the server. In addition, airflow from the server needs to be handled and conditioned, and this can impose needs on data center air handling, which has its own additional cost. So our effort is to cool (not over cool) the server within thermal specifications of all components in the system with minimum waste of cooling power.
I am feeling hot in here!
A consequence of running the components hotter to conserve and optimize system power is that average temperatures in and around the system are higher, including exhaust air temperatures. This may be viewed negatively among IT professionals generally because they must physically interact with the servers during service. Although these temperatures are within their appropriate safety and handling limits, and power savings are realized with no risk to system reliability, they are not necessarily ergonomically friendly. With that in mind, and the fact that different customers have different requirements, Dell provides additional knobs that allow IT professionals to set higher fan speeds than those mandated by thermal algorithm. In this way, cooling concerns can be mitigated and handling temperatures can be improved, but at the higher operating cost of the fans due to higher fan power consumption. Note that the fans may only be set at higher speeds and not lower than what the algorithm mandates to keep the systems within thermal specifications and ensure system reliability is not at risk. Some of you may be concerned that running the fans at higher speeds may risk the fan reliability but you should be comforted to know that the fans are reliable to the life of the product even if they were run at full speed all the time. These knobs are available in the BIOS menu and are described in pages 8-10 of the following white paper:
So I hope this blog will help you open your hearts to more heat (but greater efficiency) and feel a little cooler in your pockets (saving you recurring operating cost). I will be delighted to receive “hot” as well as “cold” comments on this blog and hope to post more in the future.
I will leave you with some useful links for further reading:
Dell Power & Cooling Technologies website: http://content.dell.com/us/en/enterprise/power-and-cooling-technologies.aspx
Advanced Thermal Control – A white paper: http://www.dell.com/downloads/global/products/pedge/advanced_thermal_control_whitepaper.pdf
Technical specifications for Intel processors: http://ark.intel.com/