This is a document designed to (hopefully) answer some frequently asked
questions about the NUMA architecture.
Frequently Asked Questions:
What does NUMA stand for?
OK, So what does Non-Uniform Memory Access really mean?
What is the difference between NUMA and SMP?
What is the difference between NUMA and ccNUMA?
What is a node?
What is meant by local and remote memory?
What do you mean by distance?
Could you give a real-world analogy of the NUMA architecture to help
understand all these terms?
- Why should I use NUMA? What are the benefits of NUMA?
- What are the peculiarities of NUMA?
- What are some alternatives to NUMA?
Could you give a brief description of the main NUMA architecture
Frequently Given Answers:
Last updated: 1/04/02 Any problems, additions, etc., please
send email to this page's
- What does NUMA stand for?
NUMA stands for Non-Uniform Memory Access.
- OK, So what does Non-Uniform Memory
Access really mean to me?
Non-Uniform Memory Access means that it will take longer to access some regions
of memory than others. This is due to the fact that some regions of memory are
on physically different busses from other regions. For a more visual
description, please refer to the section on
NUMA architeture implementations. Also, see the
real-world analogy for the NUMA architecture. This can result in some
programs that are not NUMA-aware performing poorly. It also introduces
the concept of local and remote
- What is the difference between NUMA and SMP?
The NUMA architecture was designed to surpass the scalability limits of the SMP
architecture. With SMP, which stands for Symmetric
Multi-Processing, all memory access are posted to the same shared
memory bus. This works fine for a relatively small number of CPUs, but the
problem with the shared bus appears when you have dozens, even hundreds, of
CPUs competing for access to the shared memory bus. NUMA alleviates these
bottlenecks by limiting the number of CPUs on any one memory bus, and
connecting the various nodes by means of a high speed interconnect.
- What is the difference between NUMA and ccNUMA?
The difference is almost nonexistent at this point. ccNUMA stands for
Cache-Coherent NUMA, but NUMA and ccNUMA have really come to be
synonymous. The applications for non-cache coherent NUMA machines are almost
non-existent, and they are a real pain to program for, so unless specifically
stated otherwise, NUMA actually means ccNUMA.
- What is a node?
One of the problems with describing NUMA is that there are many different ways
to implement this technology. This has led to a plethora of "definintions"
for node. A fairly technically correct and also fairly ugly definition
of a node is: a region of memory in which every byte has the same
distance from each CPU. A more common definition is: a block of memory and the
CPUs, I/O, etc. physically on the same bus as the memory. Some architectures
do not have memory, CPUs, and I/O all on the same physical bus, so the second
definition does not truly hold. In many cases, the less technical definition
should be sufficient, but often the technical definition is more correct.
- What is meant by local and
The terms local memory and remote memory are typically used in
reference to a currently running process. That said, local memory is
typically defined to be the memory that is on the same node as the CPU
currently running the process. Any memory that does not belong to the
node on which the process is currently running is then, by that
Local and remote memory can also be used in reference to things
other than the currently running process. When in interrupt context, there
technically is no currently executing process, but memory on the node containing
the CPU handling the interrupt is still called local memory. Also, you
could use local and remote memory in terms of a disk. For example
if there was a disk (attatched to node 1) doing a DMA, the memory it is reading
or writing would be called remote if it were located on another node
(ie: node 0).
- What do you mean by distance?
NUMA-based architectures necessarily introduce a notion of distance
between system components (ie: CPUs, memory, I/O busses, etc). The metric used
to determine a distance varies, but hops is a popular metric, along with
latency and bandwidth. These terms all mean essentially the same thing that
they do when used in a networking context (mostly because a NUMA machine is not
all that different from a very tightly coupled cluster). So when used to
describe a node, we could say that a particular range of memory is 2
hops (busses) from CPUs 0..3 and SCSI Controller 0. Thus, CPUs 0..3 and the
SCSI Controller are a part of the same node.
- Could you give a real-world analogy of the
NUMA architecture to help understand all these terms?
Imagine that you are baking a cake. You have a group of ingredients
(=memory pages) that you need to complete the recipe(=process). Some of the
ingredients you may have in your cabinet(=local memory), but some of the
ingredients you might not have, and have to ask a neighbor for(=remote memory).
The general idea is to try and have as many of the ingredients in your own
cabinet as possible, since this reduces your time and effort in making the cake.
You also have to remember that your cabinets can only hold a fixed amount of
ingredients(=physical nodal memory). If you try and buy more, but you have no
room to store it, you may have to ask your neighbor to keep it in his/her
cabinet until you need it(=local memory full, so allocate pages remotely).
A bit of a strange example, I'll admit, but I think it works. If you have a
better analogy, I'm all ears! ;)
- Why should I use NUMA? What are the benefits of
The main benefit of NUMA is, as mentioned above, scalability. It is extremely
difficult to scale SMP past 8-12 CPUs. At that number of CPUs, the memory bus
is under heavy contention. NUMA is one way of reducing the number of CPUs
competing for access to a shared memory bus. This is accomplished by having
several memory busses and only having a small number of CPUs on each of those
busses. There are other ways of building massively multiprocessor machines,
but this is a NUMA FAQ, so we'll leave the discussion of other methods to other
- What are the peculiarities of NUMA?
CPU and/or node caches can result in NUMA effects. For example, the CPUs on a
particular node will have a higher bandwidth and/or a lower latency to access
the memory and CPUs on that same node. Due to this, you can see things like
lock starvation under high contention. This is because if CPU x in the node
requests a lock already held by another CPU y in the node, it's request will
tend to beat out a request from a remote CPU z.
- What are some alternatives to NUMA?
Also, splitting memory up and (possibly arbitrarily) assigning it to groups of
CPUs can give some performance benefits similar to actual NUMA. A setup like
this would be like a regular NUMA machine where the line between local
and remote memory is blurred, since all the memory is actually on the
same bus. The PowerPC Regatta system is an example of this.
You can achieve some NUMA-like performance by using clusters as well. A cluster
is very similar to a NUMA machine, where each individual machine in the cluster
becomes a node in our virtual NUMA machine. The only real difference is
the nodal latency. In a clustered environment, the latency and bandwidth on the
internodal links are likely to be much worse.
- Could you give a brief description of the main
NUMA architecture implementations?
Sure! The main types are IBM NUMA-Q, Compaq Wildfire, and SGI MIPS64. Click
here for descriptions and diagrams of
the above system types, and also a standard SMP system for comparison.