NHacker Next
login
▲AMD's Pre-Zen Interconnect: Testing Trinity's Northbridgechipsandcheese.com
107 points by zdw 4 days ago | 19 comments
Loading comments...
bee_rider 17 hours ago [-]
It is probably good that Chips and Cheese stays technical and objective. But, this was from the pre-Zen bad days of AMD, right? I wonder if a “where’d it all go wrong” post would be more interesting. Or, maybe more optimistically, “how’d this set things up for Zen.”
hajile 15 hours ago [-]
The interconnect didn't have very much to do with what went wrong or setting up for Zen.

AMD made a killer design with Athlon64 that should have taken over the entire industry and made them the largest hardware company on the planet. Instead, Intel leveraged their market position to make it economically infeasible for computer manufacturers to buy AMD chips.

AMD was out of money which limited options. Denard Scaling had just failed, but Moore's Law was still in effect and multithreading was hyped as the future of everything. This made a big argument for lots of smaller cores and the most area-efficient way to do this was sharing less-used resources resulting in AMD betting big on small-core CMT.

At the same time, AMD's ATI division was under pressure to make a new, flexible GPU design (that became GCN) and the cult of Nvidia (even knowingly shipping massive numbers of defective chips then having a worse GPU than GCN still wasn't enough to lose market dominance).

The interconnect was a lower-priority redesign, so they slapped a bandaid on it and pushed the redesign down the road.

ahartmetz 15 hours ago [-]
I think you're being too nice about Bulldozer. It really was a big fat unforced error. Approximately no one wants to buy a CPU that's significantly slower than the last one at common (single core) tasks.

Today, Intel is still selling more CPUs than AMD in most market segments even though they are usually worse.

ahartmetz 10 hours ago [-]
(I did some research. Apparently, Bulldozer was initially supposed to be a fairly "fat" core single-threaded, just with more multithreading. Something didn't work out, and the crisis was managed by scaling back things that really hurt performance.)

https://www.osnews.com/story/135785/bulldozer-amds-crash-mod...

Zardoz84 14 hours ago [-]
They Bett too soon on having a high number of cores. And latter evolution of that microarchitecture wasn't bad.

From a proud ex user of a FX8370E

Tuna-Fish 14 hours ago [-]
Lower single core throughput with more cores is just always a bad bet. Existing software runs better on faster cores, people buy hardware to run existing software, and people write software to run well on hardware that other people already own.

The reason for AMD's resurgence right now is not that they have more cores, but that they have better cores. If they had even faster cores, and fewer of them per die, they'd be selling even better.

ahartmetz 13 hours ago [-]
Well. I got the very first Ryzen model, the 1800X, because it had twice the amount of cores that Intel was selling, and they were just a few percent slower per core. If they had been 40% slower, I would have passed.

My most important workload - compiling C++ - is atypically parallel, but even there, single-core is important, too.

reginald78 12 hours ago [-]
I think AMD was both lucky and good. They came out with a forward thinking design that could bring them back from the brink, but I'm not sure their stuff would have sold if Intel hadn't left them an opening. Most importantly was Intel's failure to execute on 10nm, global foundries 14nm wouldn't have compared as favorably to 10nm even with more cores. And since Intel was stubbornly refusing to sell more cores on anything but their expensive HEDT platform there was a market segment being neglected.
ahartmetz 12 hours ago [-]
Zen was actually, in a way, a conservative design. They specifically mentioned "performance compatibility" with Intel - they couldn't be weirdly slow in some workloads because of an exotic design. You could say that they had the luxury of not having to aim for the moon because Intel was pretty much parked on the ground, with a failing new process and no ambition to increase core count. Against the original SV ethos, against Andy Grove, just plain old greed and complacency.
mlinhares 14 hours ago [-]
Athlon64 should have been the wake up call for intel to focus on engineering to beat AMD, but they decided they would bully the market into a worse product forever.
Tuna-Fish 14 hours ago [-]
... It was exactly that. In the following decade, Intel comprehensively beat AMD. But then they let up, and started spending all their money on idiotic acquisitions instead.
adgjlsfhk1 7 hours ago [-]
yes and no. A lot of why they comprehensively beat AMD was because they illegally bribed Dell and the other main PC makers to not use AMD cpus.
bcrl 5 hours ago [-]
The Core CPUs were excellent designs that didn't sacrifice performance while hitting a much lower power target than the P4. Intel was saved by the fact that the design team in Israel didn't go down the path of the P4 aiming for 10 GHz and then wildly missing the mark due to power constraints.

AMD's Bulldozer was their attempt at a P4 style core: increase clocks at all costs. Again, aiming for ridiculous clock rates without considering the power cost was a mistake. However, some of the design techniques AMD came up with to hit higher clock speeds live on in today's Ryzen designs (just as ideas from the P4 live on in today's Intel CPUs).

Tabula made the same design mistake in their FPGA fabric: the thinking was that registers are cheap and memory blocks can be run at ridiculous clock speeds in order to share, multiplex and reuse transistors across LUTs. Great in theory if it wasn't for the ridiculous cost of power and the complexity of the software to make it work.

The power wall is real. Not every hardware design team makes the right choices early in a design when estimating power use and constraining the design space appropriately. The difference is that the tools used to estimate power consumption of a hardware design today are far better than they were 20 years ago as a direct consequence of these (and other) failures.

toast0 15 hours ago [-]
That's probably in their Bulldozer article [1]. But this article is about memory access on their APUs; you just have to accept the CPU was what it was, no need to dwell on it here.

[1] https://chipsandcheese.com/p/bulldozer-amds-crash-modernizat...

freeqaz 15 hours ago [-]
Something in the article that I had to look up that might bother others. He uses the term 'DCT' in this sentence, but it's never defined in the article. AFAIK it stands for 'DRAM Memory Controller', but that could be an LLM hallucination. Running a web search defines it as Discrete Cosine Transform. :P

> "AMD’s BIOS and Kernel Developer’s Guide (BKDG) indicates there’s a 4-bit read pointer for a sideband signal FIFO between the GMC and DCT, so the “Garlic” link may have a queue with up to 16 entries."

Should maybe swap DCT in for MCT (memory controller)?

dcminter 15 hours ago [-]
DCT is explicitly "DRAM Controller" in the referenced "BIOS and Kernel Developer’s Guide" - see definitions table on p23 of https://www.amd.com/content/dam/amd/en/documents/archived-te...
jauntywundrkind 15 hours ago [-]
It's wild how much extra work was done to avoid coherency, yet share memory.

Ok, there's the first part, the Garlic bus, which gives the GPU its own access to the DRAM request controller, instead of going through the CPU's memory controller.

Since the GPU is mostly going to miss, it's great that it's not wasting energy trying to go to the CPU's cache. But it means if you do want to share memory now you need a whole other access path for the GPU to read from the CPU memory, even though it's literally the same RAM (but maybe different cache). So, add a new Onion link, that lets the GPU go through the crossbar, and get handled by the memory controller. And this one is slower.

Infinity Fabric seems conceptually so much easier, to keep things in sync. But the work to snoop the bus, to maintain coherency: it has to be pretty massive effort.

It's so so different a thing, but I wonder how AMD deal with coherency (or not?) on the 6 Memory Control Die (MCD) in the 6800xt GPU. Having separate chips whose job is to be cache and dram controller, that must need at least some understanding of who has what memory, that has to be wild.

One other comment, on:

> modern games struggle or won’t launch at all on Trinity, so I’ve selected a few older workloads

I wonder how many more games would run under Linux? Theres an absurd amount of work still going into the radeonsi driver. The driver just switched to the newer ACO compiler pipeline by default, last December, for example. That said, Trinity is (2012) using a (2010) TeraScale3 (gfx4). This is old! But the improvements have been ongoing, in a way commercial systems would unlikely to ever be; there's so many wins over such a long time; not compatibility but getting multi threaded driver support (2017) also comes to mind as a big leap! https://www.phoronix.com/news/RadeonSI-ACO-Default-Pre-GFX10 https://www.phoronix.com/news/RadeonSI-G3D-Threads https://www.google.com/search?q=site%3Aphoronix.com+radeonsi

I wonder how granular the breakdown/fallback modes are for running ; I suspect if there's an unsupportable feature somewhere in the graphics pipeline the whole pipeline will usually need to fallback to CPU rendering, but perhaps perhaps perhaps there's some ability to fill in some GPU features via CPU while running most of the pipeline on CPU (and not having the latency destroy everything, perhaps using that Onion link/cacheable host memory)?

hajile 15 hours ago [-]
Redesigning their interconnect stuff for both GPU and CPU then implementing and validating would have been a massive expense and would have added additional time to ship.

With the company facing bankruptcy, I'd imagine that a small team hacking together the different GPU and CPU interconnects was cheaper and faster than designing a whole new interconnect and coherency then implementing and testing it everywhere.

toast0 15 hours ago [-]
> It's wild how much extra work was done to avoid coherency, yet share memory.

Having separate, non-coherent memory is status quo for GPUs. Bringing the GPU onto the die means you've got to share the path to memory, but access patterns are different.

Designing for the typical case where the addresses used are distinct is totally reasonable, it's not wild at all. After that works, you can try to maie shared use faster, too, but from the article, that didn't really happen in this design; the features are there, but the bandwidth isn't.