Dell™ PowerEdge™ 12th generation servers that use either 2- or 4-processor sockets are NUMA-capable (Non-Uniform Memory Access) by default. Memory on these systems is broken up into “local” and “remote” memory, based on how near the memory is to a specific core executing a thread. Accessing remote memory is generally more costly than local memory from a latency standpoint, and can negatively impact application performance if memory is not allocated local to the core(s) running the workload. Therefore to improve performance, some efforts must be made in Linux environments to ensure that applications are run on specific sets of cores and use the memory closest to them.
With the correct tools and techniques, considerable performance gains can be achieved on memory-intensive applications that may not be completely NUMA-aware. This white paper showcases these tools and gives examples of performance impacts to illustrate how important fine tuning NUMA locality can be in terms of overall performance for some workload types. In addition to the performance uplift of correctly affinitizing applications to specific cores and memory, this paper discusses the concept of NUMA in relation to the PCIe bus, also known as NUMA I/O.