Intel Core i7 Presentation Page: 1
 
Nehalem WaferIntel reveal some more Nehalem information at recent Core i7 presentation
 
 
Recently OC3D received an invitation to a presentation Intel were planning to hold in the Hilton at Heathrow's Terminal 4, and I was the lucky one who got to go.
 
Before I get down to the nitty gritty, a lot of the information that I am allowed to publish at this time has already been floating around the net, but at least now it has been confirmed by Intel.  There are a couple of new bits though, so stick with it and take a read.  As for the stuff I'm not allowed to publish, well that's a lot more interesting and for another day.
 
At the end of the day, I took away with me a memory stick full of Powerpoint slides from the presentation and lots of promo pictures of Nehalem wafers and cores.  But rather than use those glorified images, I've decided to stick with the real pictures I took on the day and slides throughout this article.
 
It was a long old day and at about 10:15, to start things off after a little hanging around was a presentation on the architecture details of Nehalem by Chief Architect, Ronak Singhal, who is one of four lead architects who worked to create Nehalem.  If I'm totally honest, quite a lot of it had me scratching my head a little.  I'm going to try and not bore you with too much techo-babble but keep to the major points I think might be of interest.
 
Ronak has been working for Intel since 1997, and back then he was working on the Pentium 4 processor.  He said he had been working on Nehalem since 2003 and went on to explain that there will be two Nehalem CPU's released in Q4 of 2008, one being a mainstream quad-core chip, and the other being a server product.  Also mentioned was that Intel intend to launch their Octo-core processor towards the latter part of Q4 2009.  Bit of a wait hey?
 
 
Tick Tock Tick Tock Tick........
 
Tick-Tock Development Module
 
The first slides of the presentation showed information on their "Tick-Tock Development Model" which basically shows that they look to release new architecture every two years, which is "Tock", i.e. Merom (Core2) of old, Nehalem (Core i7) of new and Sandy Bridge of the future.  Staggered in between these two years are Intel's "Tick" releases which are essentially the same architecture but created on a smaller manufacturing process, i.e. Penryn.  Westmere is set to be the future "Tick" to Nehalem's "Tock".
 
Obviously "Tock" stages take much longer to create than the "Tick" stages.  This means that Nehalem was being worked on quite a time before Penryn, and interestingly, improvements being designed for the next architectural "Tock" can make their way into a  "Tick" release, but on a smaller scale.
 
 
 
Branch Prediction
 
One of the improvements in Nehalem Ronak spoke about was Branch Prediction improvements. While this technology was in Merom and improved on a little in Penryn, it has moved on quite a lot further in Nehalem and really makes a difference.  In Nehalem, Intel have introduced a second level branch predictor per core.  This new branch predictor helps out the normal one that is in the processor pipeline and supports it in a similar fashion as an L2 cache does with an L1 cache.
 
Nehalem DieThis second level predictor has a much larger set of history data it can call upon to help to predict branches, but while being larger is a good thing, it also makes it quite a lot slower than the first level predictor.
 
The first level predictor runs in pretty much the same way as it always has which is by predicting branches as best it can.  However, in Nehalem the second level predictor will also be  simultaneously scanning branches.  When the first level predictor makes a prediction based on the type of branch but doesn't have enough historical data to make an highly accurate prediction, the second level predictor can jump in, and as it has a larger history window to predict from, can on-the-fly find mispredicts and correct them without too much of a penalty.
 
Intel gave an example of the second level branch predictor making significant improvements on applications with very large code sizes i.e. database applications and also some games to a certain extent.  In short, if the second level branch detector catches something the first level detector missed or mispredicted, it will throw it out of the pipeline which saves time on wrong calculations, and in turn on power.
 
Now for some Cache & Hyper Threading on the next page


Intel Core i7 Presentation Page: 2
 Nehalem's Cache Structure
 Core / Uncore
Intel have implemented quite a few changes to the cache structure in Nehalem compared to Penryn, including an all new
L3 cache.
 
Something new that Intel is bringing to us with this modular design, shown on the slide to the right, is the "uncore".  In short, everything other than the cores and their own cache is in the "uncore", such as the integrated memory controller, QPI links and the shared L3 cache.  All of the components in the "uncore" are completely modular and so can be scaled according to the section of the market each chip is being aimed at.  Intel can add or remove cores, QPI links, integrated graphics (which Intel say will come in late 2009) and they could even add another integrated memory controller if they so wish.
 
 
 
Nehalem Cache StructureThe 64KB L1 cache is comprised of a 32KB instruction cache and 32KB data cache which is the same as we see in Penryn.  However, the L2 cache is a totally new design compared to what we see in the Core 2 CPU's of today.  Each core receives 256KB of unified cache that is extremely low latency and scales well to keep extra load off the L3 cache.  While this 256KB is a lot smaller than that of Penryn, it is much quicker and it only takes 10 cycles from load to get data out of the L2 cache.  Each core within a Nehalem CPU with have its L1 & L2 cache integrated within the core itself.
 
The L3 cache that is coming with Nehalem is totally new to Intel, and is also very similar in design to AMD's Phenom CPU's.  It is an inclusive cache that Intel can scale according to how many cores are in any given processor. An inclusive cache means that ALL of the data residing in the L1 or L2 caches within each core will also reside within the L3 cache.  What this achieves is better performance, and in turn lower power consumption, due to any core knowing that if it can't find information it's looking for within
the L3 cache, then it  doesn't exist within any core's L1 and L2 caches.  This will help lower "core snoop traffic" which will become more of a problem as CPU's comprise of more and more cores.
 
 
Hyper ThreadingHyper Threading
 
The launch of Nehalem brings the return of Hyper Threading (also known as simultaneous multi-threading (SMT)).
 
With Nehalem, Intel is a hell of a lot more prepared for the implementation of HT than it was when it was last present in Intel's Pentium 4 processors.  This is largely due to the massive memory bandwidth and larger caches available which aid in getting data to the core faster and more predictably.
 
An Operating System will see an HT enabled processor as multiple processors i.e. a quad core would show up as 8 cores when in reality it's 4 cores and 8 threads.  The OS would then proceed to send 8 threads of instructions to the CPU.

Intel felt that with Nehalem the time was right to re-implement HT, not only for the reasons above, but because at this point in time there are a lot more applications that can actually take advantage of this technology.
 
Intel are also very happy to use this technology again, and build on it in the future due to its performance /die size ratio.  Its performance gain percentage is actually larger than the percentage of die real estate it inhabits.  Ronak explained that in general, when implementing technology, they hope for a 1:1 ratio when it came to gains vs die area it consumes.  He also said that using HT was much more power efficient than adding an entire core.  The inclusion of HT on Nehalem only takes up roughly 5% - 10% of the die area, and the gains seen on supported applications are generally higher than this figure.
 
 
One other thing I feel I should mention is Ronak explained that higher bandwidth hungry applications may not see a gain from HT at all. This is because the bandwidth is already saturated by the data from 4 cores, and adding more threads could actually be detrimental to performance.
 
Here is a little slide to show performance of HT with disabled being 0% and the bars showing the gains with HT being enabled.
 
HT Performance Chart
 
 
On the next page we look at the IMC, QPI & Power Management


Intel Core i7 Presentation Page: 3
Nehalem ChipIntegrated Memory Controller (IMC) & Quick Path Interconnect (QPI)
 
One of the massive changes that comes with Nehalem is the 'on chip' DDR3 integrated memory controller which is located in the "uncore".  While AMD has been using an IMC for some time, Intel have taken it one step further with their triple-channel memory controller which offers massively increased bandwidth.
 
When the first Nehalem CPU's are released, they will feature the triple-channel DDR3 memory controller which means that in order to gain maximum bandwidth from the platform, you will need to run three DDR3 memory modules.  Intel confirmed that upon release, there will be triple-channel available from leading memory vendors.

The Nehalem IMC is pretty scalable too: besides offering massively high bandwidth and extremely low latencies, the number of memory channels can be varied - both buffered and non-buffered memories are supported and memory speeds can be adjusted all based on the market segment that the processor will be aimed towards.  We should expect lower end, less expensive dual-channel parts at some point in the future, but no time scale was provided.
 
Also, at launch, the IMC on Nehalem will only officially support PC3-8500 memory but we were told that is going to increase with time.  This can of course be overclocked and we were shown some slides giving performance indicators, but sadly I'm not allowed to show them (sorry).  Also worth a mention is that there will be NO DDR2 support at all.
 
 
QPI SlideAnother necessary addition that Intel has been talking about for a while is the move away from the front side bus architecture and to their new Quick Path Interconnect (QPI).  QPI was previously known as Common System Interface (CSI) and is Intel's answer to AMD's Hyper Transport.
 
The QPI on Nehalem is a direct connect architecture that is point to point and will transmit data from socket to socket as well as from the CPU to the chipset.  The QPI will scale according to the segment each CPU is targeted at and also on the number on CPU's per platform.  As the number of CPU's goes up as does the QPI's as shown in the bottom picture in the slide to the left.
 
One of the reasons it was necessary to move to the QPI was because of the above mentioned IMC.  The QPI is also a requirement for efficient chip-to-chip communications where one CPU needs to access data that is stored in memory on another
                                                                            CPU's memory controller.
 
Each QPI link is bi-directional supporting 6.4 GT/s (Giga Transfers) per link.  Each link is 2-bytes wide so you get 12.8GB/s of bandwidth per link in each direction which equates to a total of 25.6GB/s of bandwidth on a single QPI link.

The top of the line Nehalem processors (i.e. Extreme) will have two QPI links while mainstream versions will only have one.
 
 
Power Management

PCU SlideUp until now most of what I have written about isn't exactly new information and most of it has been floating around for a little while, however something that only came to light within the last month or so was the implementation of the power consumption and regulation logic in the processor.

What Intel revealed was that rather than using simple algorithms for switching off the power planes of the new Nehalem cores as in previous CPUs, the Core i7 will feature a complete on-die microcontroller called the Power Control Unit (PCU).  This chip consists of more than a million transistors which, in comparison is somewhere in the ball park of the transistor count on the Intel 486 microprocessor! 
 
This new controller, which has its own embedded firmware, is responsible for managing the power states of each core on a CPU and takes readings of temperature, current, power and operating system requests.
 
Each Nehalem core has its own Phase Locked loop (PLL) which basically means each core can be clocked independently, similarly to AMD’s fairly new Phenom line of processor. Also similar to Phenom is the fact that each core runs off the same core voltage. But this is where I'll stop comparing Nehalem to Phenom as the difference is that Intel have implemented their integrated power gates.
 
When creating the power gate, Intel's architects had to work very closely with their manufacturing engineers to create a material that would be suitable to act as a barrier between any core and its source of voltage as seen in the slide below.  At the time I couldn't really see what the slide was showing, and even now I struggle. However, it's relevant so it should be here.
 
PCU Slide 3             PCU Slide 2

What the power gate brings to the table is that while the CPU is still using a single power plane/core voltage, it can totally shut off (or very very close) any individual core by stopping the voltage getting to it during deep sleep states.  This differs from the current situation in which all cores have to run at the same voltage and this applies for Intel and AMD CPU's.  At present if one or more cores are active in a CPU then the other cores can't have their voltage lowered, which means the idle cores will still be leaking power while not in use.
 
In a little more detail, the power gates allow any number of cores on a CPU to be working at normal voltages, while any cores idling can have their power shut off completely which in turn reduces idle core power leakage dramatically.  It was asked why this hasn't been seen on previous processors and Ronak said that there has never been a suitable and reliable material to use as a gate until now.
 
This is just the basics of what the CPU offers. There is a lot more to the CPU and power gates than I can explain in this article, and to fully explain it would take up several more pages and probably bore you half to death.  So moving on.....
 
On to Turbo Mode and final thoughts


Intel Core i7 Presentation Page: 4
Turbo Mode
Wafer, SSD, CPU and Die
 
Turbo Mode is a feature that has been spoken about quite a lot recently, but there have been many mixed claims about just how it works in Nehalem.  Whilst it made its debut with mobile Penryn, it never really got a chance to actually work.  What it was designed to do was if for instance you had a dual-core mobile Penryn CPU running a single threaded application, leaving one core totally idle, and the chips TDP was lower than what it was designed for, then Turbo Mode would aim to increase the clock speed of the active core.  The reason this didn't really work was due to a lot of applications (starting with Vista) bumping the single thread load around active cores leaving them unable to initialise Turbo Mode for any length of time.
 
In Nehalem, this feature has been refined to work a whole lot better, largely in part to the CPU.  The idea is pretty straight forward in that if you have a quad-core CPU and only two of the cores are active, then as long as the CPU detects that the heat levels are ok and the power levels are under the TDP, the two idle cores can be shut down and the two remaining active cores will be overclocked.
 
Turbo Mode can also come in to effect even if all four cores are active, so long as the CPU detects heat and power levels are under their set limits.  In this case all four cores would be given a boost as per the slide below (bottom right).  All Nehalem processors will at least be able to go up a single clock step (133MHz) in Turbo mode, even if all cores are active. Just as long as the CPU detects that the TDP hasn’t been exceeded.
 
Turbo Mode         Turbo Mode 2
  
At present the level of overclock isn't very significant, and for now will more than likely be around 266MHz.  Intel does however have large ambitions for Turbo Mode and we should expect to see higher boosts in the future.
 
Intel also claim that the CPU is aware of the conditions it's running in, for instance if your case is very cool or you have water-cooling, then the CPU will recognise that it is well under its TDP and push the clocks potentially quite a bit higher.  It remains to be seen whether this will be the case on the first Nehalem CPU's released and from what I can gather it wont be. 
 
While this feature might not excite most of us that much, for the not so avid overclocker, it could provide a very welcome extra performance boost with absolutely no effort.
 
For the rest of us, Intel confirmed that Turbo Mode can just be turned off in the BIOS which made me happy.
 
 
Final Thoughts
 
While Penryn was well received (especially the dual-cores), it didn't really give any major leaps in performance, but with AMD's Barcelona / Phenom CPU not really challenging Intel, it was a welcome release regardless. But even then, a lot of people were just looking forward to Nehalem and hoping it was going to provide the next giant leap forward in CPU performance.  Now we are getting very close to the release of Nehalem and the anticipation is building.
 
I find myself one of the lucky ones to have witnessed first hand what Nehalem can do, and I must say I was very impressed. But with precise performance numbers still under NDA, I can't pass on any of the detailed information that I saw.
 
That said, I'm still going to give you a rough idea.  Gaming at present seemed to offer gains of anything from 8% to 40%.  This is very vague I know, but it's the best I can do.  The place where Nehalem really excelled, however, was in encoding and rendering.  I saw an Nehalem CPU ray tracing on-the-fly and doing an amazingly good job of it.  Encoding in programs such as Adobe Premier Pro and Sony Vegas 8 saw gains anywhere up to 60% on a comparable Core 2 CPU.
 
So you’ll need a new motherboard and CPU (obviously), and perhaps some new memory, but if you’re running well threaded applications then Nehalem will knock your socks off.
 
I'm just looking forward to Nehalem's release, so one of our reviewers can get their mitts on a sample to put through its paces and share with you what it can do.
 
 
Let us know whether you will be one of the early adopters of Nehalem in our forum.