Will history repeat itself? Intel wants to make a name for itself in the discrete GPU area with its upcoming Xe-HP GPU series. We look at Project Larrabee – the last time Intel tried to make a graphics card – to understand how things could play out.
AMD has just taken the crown of CPU performance, new consoles look like minimalist PCs, and Intel is working on a flagship GPU based on the Xe architecture that will compete with Ampere and Big Navi. 2020 was a bad year for many reasons. But even then an Intel graphics card? You'll be forgiven for writing it down as another 2020 curiosity, but that's where things get interesting.
Raja Koduri's rationale is not Intel's first attempt at creating a desktop-quality discrete GPU. Fourteen years ago, the CPU manufacturer began work on Project Larrabee, a CPU / GPU hybrid solution that is set to revolutionize graphics processing. Larrabee scared the competition to the point that AMD and Nvidia briefly considered a merger.
What happened then? Three years of development, an infamous IDX demo and then … nothing. Project Larrabee shuffled quietly off the stage and some of its intellectual property was salvaged for Xeon Phi, a multi-layered accelerator for HPC and enterprise workloads. Intel has again integrated small, low-profile iGPUs into its processors.
What happened to Larrabee? What went wrong? And why haven't we seen a competitive Intel GPU in over a decade? With the introduction of the discrete Xe HP GPU next year, today will be the last time Intel tried to build a graphics card.
Intel Larrabee: What Was That?
While Intel started working on Project Larrabee sometime in 2006, the first official tidbit came at the Intel Developer Forum (IDF) in 2007 in Beijing. Pat Gelsinger, CEO of VMWare, then Senior VP at Intel, had this to say:
“Intel has begun designing products based on a highly parallel, IA-based, programmable architecture, code-named" Larrabee ". It can be easily programmed using many existing software tools and can be scaled to trillions of floating point operations per second (teraflops) of performance. The Larrabee architecture will include enhancements to accelerate applications such as scientific computing, discovery, mining, synthesis, visualization, financial analysis, and healthcare applications. "
A highly parallel, programmable architecture that can be scaled to teraflops of performance. Gelsinger's statement would have been a perfect description of any Nvidia or AMD GPU on the market, except for one important point: Larrabee was programmable and IA (x86) -based.
This meant that its cores, unlike the fixed-function shader units contained in GPUs, were functionally similar to universal Intel CPU cores. Larrabee rendered graphics, but it wasn't a GPU in the traditional sense. To understand exactly how Larrabee works, you should first look at how traditional GPUs work.
A GPU, a CPU, or something else? How Larrabee worked
Nvidia and AMD GPUs are massively parallel processors with hundreds (or thousands) of very simple cores that handle the logic with fixed functions. These cores are limited in what they can do, but the parallelization speeds up these special graphics workloads massively.
GPUs are very good at rendering graphics. However, the fixed function of their shader cores made it difficult to run workloads without games on them. This meant that innovations in game technology were often kept in check by the capabilities of the graphics hardware.
New DirectX or OpenGL graphics functions required completely new hardware designs. For example, tessellation is a DirectX 11 function that dynamically increases the geometric complexity of screen objects. AMD and Nvidia had to implement tessellation hardware with a fixed function in their Fermi and Terascale 2 cards in order to be able to use this new function.
In contrast, Larrabee was built from a large number of simplified x86-compatible CPU cores loosely based on the Pentium MMX architecture. Unlike GPUs, CPUs have general purpose programmable logic. You can handle almost any type of workload with ease. This should be Larrabee's great asset. A universal, programmable solution like Larrabee can perform tessellation (or any other graphics workload) in software. Larrabee lacked hardware with a fixed function for rasterization, interpolation and pixel mixing. In theory, this would lead to a decrease in performance. However, the raw throughput from Larrabee, the flexibility of these x86 cores, and the promised game-specific drivers from Intel should make up for this.
Developers wouldn't be limited to what functions graphics hardware might or might not perform, which opens the door to all kinds of innovation. In theory at least, Larrabee offered the flexibility of a multi-core CPU, but with a raw throughput at teraflop level that corresponded to the top GPUs from 2007. In practice, however, Larrabee was unable to deliver. Higher core configurations are poorly scaled. They fought for consistency, let alone AMD and Nvidia GPUs, on traditional raster workloads. What exactly has brought Intel into this technological impasse?
Cost and Philosophy: The Reasons for Larrabee
Companies like Intel don't invest billions of dollars in new paradigms without a long-term strategic goal in mind. In the mid-2000s, GPUs gradually became more flexible. With the Xenos GPU of the Xbox 360, ATI presented a uniform shader architecture. Terascale and Tesla (the Nvidia GeForce 200 series) brought uniform shaders into the PC area. GPUs got better at general computing workloads, and this has affected Intel and other chip manufacturers. Did GPUs want to make CPUs obsolete? What could be done to stem the tide? Many chip manufacturers have opted for multi-core, simplified CPUs.
The PlayStation 3's cell processor is the best-known result of these considerations. Sony engineers initially believed that the eight-core cell would be powerful enough to handle CPU and graphics workloads on its own. Sony recognized its mistake late in the PlayStation 3's development cycle and resorted to the RSX, a GPU based on Nvidia's GeForce 7800 GTX. Most PlayStation 3 games relied heavily on RSX for graphics workloads, which often resulted in poorer performance and picture quality compared to the Xbox 360.
Cell's SPUs (Synergistic Processing Units) have been used by some first-party studios to help render graphics – especially Naughty Dog titles like The Last of Us and Uncharted 3. Cell certainly helped, but it clearly wasn't fast enough to deal with graphic rendering on its own.
Intel thought much like Larrabee. Unlike Cell, Larrabee can be scaled to designs with 24 or 32 cores. Intel believed that the raw amount of processing grunt would effectively make Larrabee compete with fixed-function GPU hardware.
Intel's graphics philosophy, however, wasn't the determining factor. It was chargeable. Designing a GPU from scratch is extremely complicated, time consuming, and expensive. A brand new GPU would take years to develop and cost Intel billions of dollars. Worse still, there was no guarantee that it would surpass or even meet the upcoming Nvidia and AMD GPU designs.
In contrast, Larrabee has reduced the existing Pentium MMX architecture from Intel to the 45 nm process node. By reusing a known hardware design, Intel (in theory) was able to bring a working Larrabee design to market faster. This would also make it easier to set and monitor performance expectations. Larrabee eventually burned a multi-billion dollar hole in Intel's pockets. However, ironically, cost efficiency was one of the first selling points. Larrabee looked revolutionary on paper. Why did it never start?
What went wrong with Larrabee?
Larrabee was a great idea. But execution is just as important as innovation. This is where Intel failed. During its 4-year life cycle, Larrabee was plagued by misunderstandings, a hasty development cycle, and fundamental problems with its architecture. In retrospect, there were red flags from the start.
In Larrabee's first announcement, gaming wasn't even mentioned as a use case. However, almost immediately after that, Intel began talking about Larrabee's gaming features and driving expectations high. In 2007, Intel was many times bigger than Nvidia and AMD combined. When Intel claimed that Larrabee was faster than existing GPUs, it was taken for granted given the talent pool and resource budget.
Expectations for Larrabee games were raised even further when Intel bought Offset Software months after purchasing the Havok physics engine. The studio's first game, Project Offset, premiered in 2007 and featured unprecedented imagery. Unfortunately, nothing came out of the purchase of the offset software. Intel closed the studio in 2010, around the time Larrabee was put on hold.
Intel's gaming performance estimates were against the hype. A 1 GHz Larrabee design with 8 to 25 cores could make the F.E.A.R. at 1600×1200 at 60 FPS. This wasn't impressive even for 2007. By estimating Larrabee's performance at 1GHz instead of the expected shipping frequency, Intel undercut the part's gaming capabilities. In a PC Pro article, an Nvidia engineer mocked that a 2010 Larrabee card would deliver 2006 GPU performance.
Who was Larrabee for? What was it good at? How did the competition fare? Due to the lack of clarity from Intel, none of these questions were ever answered. Communication wasn't the only problem, however. During development, Intel engineers found that Larrabee had serious architecture and design problems.
The GPU that couldn't be scaled
Each Larrabee core was based on an optimized version of the Pentium MMX architecture with low performance. The performance per core was a fraction of the Intel Core 2 parts. However, Larrabee should compensate for this by scaling it to 32 or more cores. It was these large Larrabee implementations – with 24 and 32 cores – that Intel compared to Nvidia and AMD GPUs.
The problem was getting the cores to talk to each other and work together efficiently. Intel decided to use a ring bus to connect Larrabee cores to each other and to the GDDR5 memory controller. This was a double 512-bit connection with a bandwidth of over 1 TB / s. Thanks to cache coherence and an overabundance of bandwidth, Larrabee was able to scale relatively well … until you hit 16 cores.
One of the main disadvantages of the ring topology is that data must be routed through each node on its way. The more cores you have, the greater the delay. Caching can fix the problem, but only to a certain extent. Intel tried to solve this problem by using multiple ring buses in its larger Larrabee parts, each supplying 8 to 16 cores. Unfortunately, this made the design more complex and did little to solve scaling problems.
By 2009, Intel was in a catch 22:16 core Larrabee design and was nowhere near as fast as competing Nvidia and AMD GPUs. 32- and 48-core designs could fill the gap, but with twice the power consumption and immense additional costs.
IDF 2009 and Quake 4 Ray-Tracing: Postponing the Conversation
In September 2009, Intel presented Larrabee, which is running a current game with activated real-time ray tracing. This should be Larrabee's turning point: Silicon under development running Quake 4 with ray tracing enabled. Here Larrabee hardware was powered by a real game of lighting technology that was simply not possible on GPU hardware at the time.
While the Quake 4 demo generated media hype and sparked discussions about real-time ray tracing in games, far more serious issues were skipped. The Quake 4 ray tracing demo wasn't a traditional DirectX or OpenGL raster workload. It was based on a software renderer that Intel had previously presented on a Tigerton Xeon setup.
The IDF 2009 demo showed that Larrabee can execute a complex piece of CPU code pretty well. However, it did nothing to clear up questions about Larrabee's screening performance. In an attempt to turn the conversation away from Larrabee's rasterization performance, Intel accidentally drew attention to this.
Just three months after the IDF demo, Intel announced that Larrabee would be delayed and the project would be scaled down to a "software development team".
A few months later, Intel pulled the plug all the way out, stating that they "won't be bringing a discrete graphics product to market, at least in the short term." This marked the end of Larrabee as a consumer product. However, the IP that the team created continues in a new corporate avatar: the Xeon Phi.
Xeon Phi and the corporate market: the circle is complete
The strangest part of the Larrabee saga is that Intel actually delivered on all of its initial promises. Back in 2007, when Larrabee was first announced, Intel positioned it as a bold new offering for the corporate and HPC market: a highly parallel, programmable design with many cores that can process numbers far faster than a conventional CPU.
In 2012, Intel announced its Xeon Phi coprocessor, which is designed to do just that. The first generation Xeon Phi parts were even brought out with PCIe: it was a shrunk, tweaked Larrabee except for the name. Intel continued to sell Xeon Phi coprocessors to enterprise and research customers until last year before the product line was tacitly discontinued.
Today, as Intel is back to working on a discrete GPU architecture, there are definitely lessons to be learned here, along with pointers to Intel's long-term strategy with Xe.
Learn From Larrabee: Where Does Xe Go From Here?
Intel Xe is fundamentally different from Larrabee. For starters, Intel now has a lot of experience building and supporting modern GPUs. Since the HD 4000, Intel has invested considerable resources in building high-performance GPUs that are supported by a fairly robust software stack.
Xe builds on over a decade of experience and shows it. The top-of-the-line Intel Xe LP GPU in Tiger Lake configurations equals or exceeds entry-level discrete GPUs from AMD and Nvidia. Xe manages this even when 28 W of power has to be shared with four Tiger Lake CPU cores. Inconsistent performance between games indicates that Intel's driver stack still needs some work. By and large, however, Xe-LP holds its own against AMD and Nvidia entry-level offers. Xe (and previous generation Intel iGPUs) leverage a number of Larrabee innovations, including tile-based rendering and SIMD variable width: Intel's research and development for Larrabee was not lost.
While Xe-LP proves that Intel can build a decent, efficient mobile graphics chip, the real question here is how Xe-HPG, the discrete desktop variant, will perform. Efficient, low-power GPUs don't always scale to 4K gaming flagships. If it did, Imagination's highly efficient PowerVR chips would be giving Nvidia and AMD a run for their money.
Based on the previous statements from Intel, Xe-HPG should offer function parity with modern AMD and Nvidia GPUs. This means hardware ray tracing and full support for other aspects of the DirectX 12 Ultimate feature set. Intel also talked about using MCM to scale the performance of future Xe parts. By packing multiple GPU chips into a single MCM package, future Xe designs could scale performance well beyond what we see today with Amps and Big Navi.
However, the competition does not stand still. AMD is already using MCM for its chip-based Zen CPU design, while Nvidia's next-generation "hopper" GPUs are expected to use this technology to maximize performance as well.
So the question isn't really whether or not Intel can build a great discrete GPU – probably too. However, in a rapidly evolving hardware space, how Xe-HPG can hold its own against upcoming Nvidia and AMD GPUs is important.
Intel can learn from Larrabee here: Clear communication and expectation management are crucial. In addition, Intel needs to evolve and keep realistic development schedules. A wait of two years could set Xe back a generation or more. Finally, they need to focus on developing a mature driver stack: a powerful Xe-HP GPU doesn't count for much when held back by spotty drivers.
Will Xe usher in a new era of Intel graphics dominance? Or will it go the way of Larrabee? We will only know for sure in the coming months when the "Baap of All" will be launched.