Anatomy of a CPU – Catrachadas

The CPU is often referred to as the brain of a computer and, like the human brain, consists of several parts that work together to process information. There are parts that hold information, parts that store information, parts that process information, parts that help output information, and more. In today's explanation, we'll look at the key elements of a CPU and how they all work together to power your computer.

You should know that this article is part of our Anatomy series, which analyzes all of the technologies behind PC components. We also have a special series on CPU design that looks more closely at the CPU design process and how it works internally. It is a highly recommended technical reading. This anatomy article will revisit some of the basics of the CPU series, but at a higher level and with additional content.

Compared to previous articles in our anatomy series, this will inevitably be more abstract. If you look into a power supply, you can clearly see the capacitors, transformers and other components. This is simply not possible with a modern CPU, since everything is so small and Intel and AMD do not publicly disclose their designs. Most CPU designs are proprietary, so the topics covered in this article represent the general functions of all CPUs.

TechSpots anatomy of the computer hardware series

You may have a desktop PC at work, at school, or at home. You can use one to prepare tax returns or play the latest games. You may even want to build and optimize computers. But how well do you know the components that make up a PC?

So let's dive in. Every digital system needs a kind of central unit. Basically, a programmer writes code to get the job done, and then a CPU executes that code to achieve the intended result. The CPU is also connected to other parts of a system such as memory and I / O to ensure that the relevant data is supplied. However, we will not cover these systems today.

The CPU blueprint: an ISA

When you analyze a CPU, the first thing you'll encounter is the Instruction Set Architecture (ISA). This is the graphic design for the functioning of the CPU and the interaction of all internal systems with each other. Just as there are many dog ​​breeds within the same type, there are many different types of ISAs on which a CPU can be built. The two most common types are x86 (in desktops and laptops) and ARM (in embedded and mobile devices).

There are some others like MIPS, RISC-V and PowerPC that have more niche applications. An ISA specifies which instructions the CPU can process, how it interacts with memory and caches, how the work is divided into the many processing levels and much more.

To cover the main parts of a CPU, we follow the path a command takes when it executes. Different types of instructions can follow different paths and use different parts of a CPU. However, we will generalize here to cover the largest parts. We start with the most basic design of a single-core processor and gradually increase the complexity as we move towards a more modern design.

Control unit and data path

The parts of a CPU can be divided into two parts: the control unit and the data path. Imagine a wagon. The engine moves the train, but the conductor pulls the levers behind the scenes and controls the various aspects of the engine. A CPU is the same way.

The data path is similar to the engine and, as the name suggests, is the path in which the data flows during processing. The data path receives the inputs, processes them and sends them to the right place when they are done. The control unit tells the data path how it should work like the train driver. Depending on the instruction, the data path forwards signals to different components, switches different parts of the data path on and off and monitors the status of the CPU.

The instruction cycle – get

First of all, our CPU has to find out which instructions are to be executed next and transfer them from the memory to the CPU. Instructions are created by a compiler and are specific to the ISA of the CPU. ISAs share the most common types of statements such as loading, saving, adding, subtracting, etc., but there are many additional special types of statements that are unique to each particular ISA. The control unit knows which signals have to be directed where for each type of command.

For example, if you run an EXE file on Windows, the code for that program is moved to memory and the CPU is told where the first command begins. The CPU always maintains an internal register that contains the location of the next command to be executed. This is called a program counter (PC).

Once it knows where to start, the first step in the command cycle is to get that command. This moves the command from memory to the CPU's command register and is called the fetch stage. Realistically, the command is probably already in the CPU's cache, but we'll cover these details in a moment.

The instruction cycle – decode

If the CPU has a command, it must find out exactly what type of command it is. This is called the decoding stage. Each instruction has a specific set of bits, the opcode, that tells the CPU how to interpret it. This is similar to using different file extensions to tell a computer how to interpret a file. For example, .jpg and .png are both image files, but they organize data in different ways, so the computer needs to know the type to be able to interpret them correctly.

Depending on how complex the ISA is, the instruction decoding part of the CPU can become complex. An ISA like RISC-V may have only a few dozen instructions, while x86 may have thousands. On a typical Intel x86 CPU, the decoding process is one of the most difficult and takes up a lot of space. The most common types of instructions that a CPU would decode are memory, arithmetic, or branch instructions.

3 main instruction types

A store command may be something like "read the value from memory address 1234 to value A" or "write the value B to memory address 5678". An arithmetic instruction could be something like "add value A to value B and save the result in value C". A branch instruction could be something like "execute this code if the value C is positive, or execute this code if the value C is negative". A typical program may chain these together to something like "add the value at memory address 1234 to the value at memory address 5678 and store it in memory address 4321 if the result is positive, or address 8765 if the result is negative ". .

Before we start executing the instruction just decoded, we have to pause for a moment to talk about registers.

A CPU has some very small but very fast memory elements called registers. On a 64-bit CPU, these would each contain 64 bits, and there could only be a few dozen for the core. These are used to store values ​​that are currently in use and can be viewed as an L0 cache. In the command examples above, all values ​​A, B and C would be stored in registers.

The ALU

Back to the execution phase. This is different for the three types of instructions we talked about above, so we'll cover each one individually.

Start with arithmetic instructions, which are the easiest to understand. These types of instructions are fed into an arithmetic log unit (ALU) for processing. An ALU is a circuit that typically takes two inputs with a control signal and outputs a result.

Imagine a basic calculator that you used in middle school. To perform an operation, enter the two input numbers and the type of operation you want to perform. The computer carries out the calculation and outputs the result. In the case of our CPU's ALU, the type of operation is determined by the opcode of the command and the controller would send it to the ALU. In addition to basic arithmetic, ALUs can also perform bit-by-bit operations such as AND, OR, NOT and XOR. The ALU also outputs some status information for the control unit about the calculation that has just been completed. This can include, for example, whether the result was positive, negative, zero or overcrowded.

An ALU is most commonly associated with arithmetic operations, but can also be used for store or branch instructions. For example, the CPU may need to compute a memory address specified as a result of a previous arithmetic operation. The offset may also have to be calculated to supplement the program counter required for a branch instruction. Something like "If the previous result was negative, jump 20 statements ahead."

Storage instructions and hierarchy

For storage instructions, we need to understand a concept called a storage hierarchy. This represents the relationship between caches, RAM and main memory. When a CPU receives a store instruction for a data item that it does not yet have locally in its registers, it goes through the memory hierarchy until it finds it. Most modern CPUs contain three cache levels: L1, L2 and L3. The first place the CPU checks is the L1 cache. This is the smallest and fastest of the three cache levels. The L1 cache is typically divided into a part for data and a part for instructions. Remember that instructions, like data, must be retrieved from memory.

A typical L1 cache can be a few hundred KB in size. If the CPU does not find what it is looking for in the L1 cache, it checks the L2 cache. This can be of the order of a few MB. The next step is the L3 cache, which can be a few tens of MB in size. If the CPU cannot find the required data in the L3 cache, it is moved to the main memory and finally to the main memory. With each step, the available storage space increases by about an order of magnitude, but also the latency.

As soon as the CPU has found the data, it is called up in the hierarchy so that the CPU can quickly access it in the future if necessary. There are many steps here, but it ensures that the CPU can quickly access the data it needs. For example, the CPU can read from its internal registers in just one or two cycles, L1 in a handful of cycles, L2 in about ten cycles, and L3 in a few dozen. If it has to go to memory or main memory, it can take tens of thousands or even millions of cycles. Depending on the system, each core probably has its own private L1 cache, shares an L2 with another core, and shares an L3 between groups with four or more cores. We'll talk more about multi-core CPUs later in this article.

Branch and jump instructions

The last of the three main types of instructions is the branch instruction. Modern programs keep jumping around and a CPU rarely executes more than a dozen contiguous instructions without branching. Branch instructions come from programming elements such as if statements, for loops and return statements. These are all used to interrupt program execution and to switch to another part of the code. There are also jump instructions, which are branch instructions that are always executed.

Conditional branches are particularly difficult for a CPU because it may execute multiple instructions at the same time and may not determine the result of a branch until it has started on subsequent instructions.

To understand why this is a problem, we need to redirect again and talk about pipelining. Each step in the command cycle can take a few cycles. This means that the ALU is otherwise idle while a statement is being fetched. To maximize the efficiency of a CPU, we divide each stage into a process called pipelining.

The classic way to understand this is an analogy to washing. You have two loads to do, and washing and drying take one hour each. You can put the first load into the washer and then into the dryer when done, and then start the second load. This would take four hours. However, if you split the work and start washing the second load while the first load dries, you can do both loads in three hours. The one hour reduction scales with the number of loads and the number of washers and dryers. It still takes two hours to do a single load, but the overlap increases the total throughput from 0.5 loads / h to 0.75 loads / h.

CPUs use the same method to improve instruction throughput. A modern ARM or x86 CPU can have more than 20 pipeline stages, which means that the core can process more than 20 different instructions at a time. Each design is unique, but a scan division can be 4 cycles to retrieve, 6 cycles to decode, 3 cycles to execute, and 7 cycles to update the results back into memory.

Back to the branches, hopefully you can spot the problem. If we do not know by cycle 10 that an instruction is a branch, we have already executed 9 new instructions that may be invalid when the branch is executed. To avoid this problem, CPUs have very complex structures, which are referred to as branch predictors. They use similar machine learning concepts to guess whether to take a branch or not. The subtleties of branch predictors go far beyond the scope of this article, but at a basic level, they track the status of previous branches to find out whether an upcoming branch is likely to be taken or not. Modern branch predictors can have an accuracy of 95% or more.

As soon as the result of the branch is known (it has ended this phase of the pipeline), the program counter is updated and the CPU executes the next instruction. If the branch was incorrectly predicted, the CPU issues all instructions after the branch that were incorrectly executed and starts again in the correct place.

Execution out of order

Now that we know how to perform the three most common types of instructions, let's take a look at some of the advanced features of a CPU. Virtually all modern processors do not execute instructions in the order in which they are received. A paradigm called out-of-order execution is used to minimize downtime while waiting for other instructions to complete.

If a CPU knows that an upcoming instruction requires data that is not ready in time, it can change the instruction order and introduce an independent instruction from later in the program while waiting. This reordering of commands is an extremely powerful tool, but by no means the only trick that CPUs use.

Another feature to improve performance is prefetching. If you set the time it takes for a random statement to complete from start to finish, you will find that memory access takes the most time. A prefetcher is a unit in the CPU that tries to anticipate future instructions and the data needed. If someone comes who needs data that the CPU has not cached, he accesses the RAM and fetches this data into the cache. Hence the name Pre-Fetch.

Accelerators and the future

Another important feature that is now available in CPUs are task-specific accelerators. These are circuits, the whole task of which is to carry out a small task as quickly as possible. This can include encryption, media encoding, or machine learning.

The CPU can do these things itself, but it is far more efficient to have a unit for them. A good example of this is the integrated graphics compared to a dedicated GPU. The CPU can certainly perform the calculations required for graphics processing, but a dedicated unit for it offers orders of magnitude better performance. With the rise of the accelerators, the actual core of a CPU may take up only a small fraction of the chip.

The picture below shows an Intel CPU from several years. Most of the storage space is occupied by cores and cache. The second picture below shows a much newer AMD chip. Most of the space is occupied by components other than the cores.

Go multicore

The last important feature is how we can combine a number of individual CPUs into a multicore CPU. It's not that easy to just make multiple copies of the single-core design we talked about earlier. Just as there is no easy way to convert a single-thread program into a multi-thread program, the same concept applies to hardware. The problems arise from the dependency between the cores.

For a 4-core design, the CPU must be able to issue instructions four times faster. It also requires four separate interfaces to the memory. With multiple entities that may be working with the same data, problems such as coherence and consistency must be resolved. If two cores were processing instructions that used the same data, how do you know who has the correct value? What if one core changed the data but didn't reach the other core in time for execution? Since they have separate caches in which overlapping data can be stored, complex algorithms and controllers must be used to eliminate these conflicts.

Proper branch prediction is also extremely important as the number of cores in a CPU increases. The more cores execute instructions at the same time, the higher the probability that one of them will process a branch instruction. This means that the flow of commands can change at any time.

Typically, separate cores process instruction streams from different threads. This helps reduce the dependency between cores. When you activate Task Manager, one core often becomes hard and the others hardly work. Many programs are not designed for multithreading. There may also be certain cases where it is more efficient for a core to do the job than to pay the overhead penalties for trying to split the job.

Physical design

Most of this article has focused on the architectural design of a CPU, as that's where most of the complexity lies. However, all of this has to be created in the real world, which adds to the complexity.

A clock signal is used to synchronize all components in the entire processor. Modern processors typically operate between 3.0 GHz and 5.0 GHz, and that didn't seem to have changed in the past ten years. With each of these cycles, the billions of transistors in a chip turn on and off.

Clocks are important to ensure that every step of the pipeline shows all values ​​at the right time. The clock determines how many commands a CPU can process per second. Increasing the frequency by overclocking makes the chip faster, but also increases power consumption and heat dissipation.

Heat is the worst enemy of a CPU. If the digital electronics warms up, the microscopic transistors may deteriorate. This can damage a chip if the heat is not removed. For this reason, all CPUs are equipped with heat distributors. The actual silicon chip of a CPU may only occupy 20% of the surface of a physical device. By increasing the space requirement, the heat can be distributed more evenly to a heat sink. It also allows more pins to connect to external components.

Modern CPUs can have a thousand or more input and output pins on the back. However, a mobile chip can only have a few hundred pins because most of the computer parts are inside the chip. Regardless of the design, about half is for power supply, the rest for data communication. This includes communication with RAM, chipset, memory, PCIe devices and more. For high-performance CPUs that consume a hundred or more amperes at full load, they need hundreds of pins to evenly distribute the power consumption. The pins are usually gold-plated to improve electrical conductivity. Different manufacturers use different arrangements of pens in their many product lines.

All together with an example

Finally, we take a quick look at the design of an Intel Core 2 CPU. This is from 2006, so some parts may be out of date, but details on newer designs are not available.

Starting at the top, we have the instruction cache and the ITLB. The Translation Lookaside Buffer (TLB) helps the CPU to see where it is in memory to find the required command. These instructions are stored in an L1 instruction cache and then sent to a predecoder. The x86 architecture is extremely complex and dense, so decoding involves many steps. In the meantime, both the branch predictor and the prefetcher are looking for possible problems caused by incoming instructions.

From there, the instructions are sent to an instruction queue. Remember how the out-of-order design enables a CPU to execute instructions and execute the most current. This queue contains the current instructions that a CPU is taking into account. As soon as the CPU knows which instruction can best be executed, it is further decoded in micro operations. While an instruction can include a complex task for the CPU, micro-ops are granular tasks that can be more easily interpreted by the CPU.

These instructions are then included in the register alias table, the ROB and the reservation station. The exact function of these three components is somewhat complex (see University course for graduates), but they are used in the process out of order to manage the dependencies between instructions.

A single "core" actually has many ALUs and memory ports. Incoming operations are placed in the reservation station until an ALU or memory port is available for use. Once the required component is available, the instruction is processed using the L1 data cache. The output results are saved and the CPU can now start with the next command. That's all!

While this article was not meant to be a definitive guide to how each CPU works, it should give you a good idea of ​​its interior and complexity. Frankly, no one outside of AMD and Intel knows how their CPUs work. Each section of this article represents a whole field of research and development, so the information presented here only scratches the surface.

Continue reading

To learn more about how the various components covered in this article are structured, read Part 2 of our CPU design series. To learn more about how a CPU is physically built at the transistor and silicon levels, read Part 3.

Purchasing links:

Leave a Reply

Your email address will not be published. Required fields are marked *