That is Half 3 of a sequence. Half 1 could be discovered right here, Half 2 could be discovered right here, and Half 4 could be discovered right here.
Hexagon:
Subsequent up is the Qualcomm Hexagon NPU which supplants the Qualcomm Hexagon Processor, which itself supplants the Qualcomm Hexagon DSP. This isn’t to mock the unit, it seems to be a extremely stable piece of labor, only one that has developed over time with vital new performance. OK, we are literally mocking the advertising folks however wouldn’t you? Yeah we thought so. Lets take a quick have a look at the evolution of the Qualcomm Hexagon within the SDX2.
Hexagons of the distant previous
There are three elements of the Hexagon, the scalar unit, the vector unit, and the matrix unit, all of are optimized for scalar, vector, and matrix operations respectively, therefore the names. Guess you thought we had been going some place else with that one, proper? However critically the title change from DSP to Processor to NPU got here with the additions of the vector and matrix items. The mathematically literate amongst you’ll be positive to note that there’s a distinction of 1 within the yr numbers, why is the 2025 Hexagon not on this chart?
The brand new Hexagon is extra
Sure the 2025 Hexagon continues to be referred to as an NPU however internally it’s NPU6, a brand new structure. There isn’t a serious change from the excessive stage view, it’s simply extra, extra threads, extra capabilities, and wider nearly all the pieces. Earlier than you begin counting blocks in a blurry image, save your time, we are going to undergo the numbers in a bit. The Hexagon additionally has it’s personal command processor and runs it’s personal realtime OS identical to the GPU.
Now for the numbers. All of it begins with the Scalar unit which as you possibly can see above has historically supported six threads. It now helps 12, and sure the picture above reveals all of them clearly-ish. Qualcomm says Hexagon can do SMT throughout six threads which suggests there’s a second ‘core’ and every helps half the threads, however that’s principally semantics from a consumer perspective. Every thread is a 4-wide VLIW structure, and helps multi-level department prediction, user-mode DMA to forestall latency and overhead from switching modes, and {hardware} synchronization.
There are two grasp ports, mainly one for every ‘core’, and the block can do 64b DMA due to the relatively massive fashions we’re coping with now. Curiously whereas the Hexagon X2 can do 64b DMAs, the cores are nonetheless 32b internally identical to the X1 which may solely do 32b DMAs. That’s going to take a little bit of attention-grabbing math but it surely works so, nicely, it really works. However why do you want 2x extra wider scalar threads with a claimed 143% extra throughput?
Hexagon doubles the vector thread rely
As a result of the vector unit now has eight threads, mainly one scalar thread to regulate every vector thread. Management could also be a little bit of a powerful phrase, it feeds it and gently shepherds it to a mutually acceptable and helpful numerical conclusion. Mainly one scalar thread per vector thread plus 4. Why 4? Clearly the six threads of earlier technology was sufficient for it’s 4 vector threads plus two. There’s a purpose for this and we are going to come to it later.
Again to the vector unit, every of the engines can now course of 4 128B, huge B, SIMD vectors every and does FP8 and BF16 on high of what it did earlier than. Gasp I hear you gasp, that’s numerous knowledge to grind by means of a cycle, and you’d be proper. Qualcomm claims a +143% vector throughput improve this technology which matches the scalar unit improve precisely. This one isn’t coincidence, and we gained’t make the apparent joke right here.
That brings us to the final huge bit, the matrix unit. The X1 had a matrix unit too however this one is greater and extra succesful too. This new model helps 2-bit weights however doesn’t assist the FP2 knowledge kind like Intel GPUs. No that isn’t a joke, I cracked it to an Intel architect and he identified that it’s truly supported in {hardware}. Egg on face there, eh, however I’ve a brand new and higher joke about 2-bit weights that I’ll solely inform in particular person. Anyway the matrix unit has it’s personal weight and activation caches as a result of not doing so would blow out it’s effectivity. It’s on a separate energy rail too, and might entry the vector tightly coupled reminiscence straight.
Again to scalar threads, the matrix unit wants one as nicely, most likely two to be sincere, and that leaves two or three to do the entire different work the Hexagon must do. Within the final structure it was 6:4:1 scalar:vector:matrix threads, now it’s 12:8:1 however all the pieces is greater, extra and sooner. In brief the scalar threads are principally used to regulate the broad math items downstream from them. Plus at six threads per scalar core, if you happen to wished to do 10 you would want to do numerous work on the Scalar items for no actual acquire.
Hexagon useful resource utilization by AI mannequin
When you have a look at the graph above, you’re going to get eye pressure. Must you need to keep away from this, we will clarify it fairly merely. The X axis is a distribution of 300 oft used mannequin varieties, the Y axis is the share of time the Hexagon is ready on a specific unit. As you possibly can see the Matrix unit is used most however the others are nicely represented too. Each architect desires to attenuate waits of their gadget which is why Qualcomm constructed within the flexibility within the scalar and vector items, if one isn’t being utilized, these assets could be pointed in one other path.
To feed all this there may be +127% extra bus bandwidth to the Hexagon, greater L2 caches, and a extra highly effective DMA unit. It additionally has it’s personal reminiscence processor so it might type of grind by means of lengthy complicated jobs with minimal CPU supervision. If you consider how complicated a few of these matrix and tensor operations could be, you don’t need to have a CPU micro-managing the whole course of from afar do you? You need as a lot of the work the Hexagon does to be a closed loop, fireplace and neglect kind of affair. And it’s, say, ‘thanks scalar threads’. And it’s sooner, a claimed 80TOPS vs the 45TOPS within the X1. Mission achieved, now if solely there was helpful software program for it, however that isn’t Qualcomm’s fault.
Guardian:
Subsequent up is the Qualcomm Guardian expertise. It’s the {hardware} administration that the X1 programs lacked, Microsoft and OEMs lied about, and enterprises wanted. This can be a good factor, and a really unhealthy factor, relying on what elements of it you have a look at. Regardless of the case it might find and observe your PC, lock and wipe it, and handle it. Mainly Intel vPro with the addition of a mobile modem, which is an efficient factor. And really unhealthy too. Plus it’s locked to Home windows so insecurity is obligatory.
Guardian seems good on paper
The great is clear, these are desperately wanted options for any enterprise, anybody who buys a tool with out the administration facet for a fleet is, nicely, incompetent. Qualcomm knew this and has addressed the issue with the X2, and appears to have addressed it nicely. The one factor we are going to level out is that the providers Qualcomm gives haven’t any compatibility with something any sane enterprise has presently deployed. When you purchase an X2 gadget, you’ll have to present your IT ops with a model new secondary console to handle a subset to their fleet. They love that type of factor, simply ask them. That stated it’s far much less of a difficulty than not having any option to do primary gadget administration, so lets name it a internet good factor right here.
The unhealthy is twofold. First when you’ve got Guardian, you want both a 5G modem, X75 for the SDXE2, or a 4G IOT modem for primary messaging and GPS, WiFi is supported but it surely lacks location. Honest sufficient. Guardian communicates with a Qualcomm server and again finish for apparent causes, you want a standard level of contact to relay messages. Guardian additionally works over WiFi, RFC 1149 (and we presume RFC 2549 but it surely wasn’t explicitly acknowledged), and almost some other provider that may get the packets to the again finish, however as talked about earlier, some options could also be missing with no cell modem. This can be a good factor proper?
Not in SemiAccurate’s opinion. Why? Two huge points, value and having a 3rd get together concerned in your safety. Lets have a look at value first, beginning with the modem. An added modem provides value however you possibly can decide the actually ‘low cost’ 4G IOT modem if you happen to don’t desire a 5G modem in your PC. Given the uselessness of 5G in a Home windows PC, (Word: Qualcomm folks, you don’t pay in your service straight, most of humanity does and it’s simply too overpriced for mere mortals. That properly explains how gross sales of 4G/5G PCs have NOT taken off available in the market.) the 4G modem, or higher but no mobile modem might be a greater guess. Mainly why have an at all times on assault floor in your PC, a Home windows field no much less? It’s simply dumb and due to what it does, you possibly can’t flip it off. No that isn’t a joke, if you happen to may flip it off, it could kinda defeat the aim of lock and wipe for a stolen laptop computer.
However on high of that, the again finish prices cash to run, and Qualcomm expects to be paid for that. Honest sufficient, they do the work, they deserve the cash. Or the market failure, however lets be glass half-full folks right here. The issues is that the enterprise mannequin is {that a} price of about $20 a yr will probably be charged to the OEM, not the consumer. The OEM will roll it into the worth of the system, plus the price of the modem.
The consumer pays for it whether or not they use it or not. And they’re weak as a result of you possibly can’t flip it off. So you possibly can’t say no, you might be upcharged a reasonably large chunk of the BoM, and you don’t have any say in it. Assuming the X2, not like the X1, truly works this time round, this can be a deal breaker, a foul concept at a excessive value. Lets assume three years up entrance, so $60 to the OEMs, multiplied by no matter margins they take, and you’ve got a double digit proportion of your MSRP for compelled safety holes.
What’s a pleasant concept on paper needs to be averted in any respect prices on the checkout line. If a system has a Guardian emblem on it, simply say no. Intel’s vPro doesn’t have these points, and their PCs truly work proper. However again to the half-full factor, at the very least Qualcomm is aware of there’s a downside and is making an attempt to do one thing about it. Yay? Wish to guess excessive finish X2 Elite Excessive programs will solely include a modem and the attendant safety flaws?
Then there may be the deal breaker, Microsoft’s Pluton ‘safety’ block. All we will say is {that a} third get together remotely accessible CPU that may snoop any a part of your PC silently is unacceptable. The truth that it may be arbitrarily up to date with, nicely, something, can be unacceptable. Qualcomm didn’t touch upon the block and Microsoft lies about it by means of omission, at the very least in response to three sources SemiAccurate talked to who’ve carried out the block in launched merchandise. Intel did it proper, AMD and Qualcomm didn’t, and so any gadget that has Pluton from these distributors needs to be handled as insecurable and weak. Throw within the aforementioned at all times on cell modem and you’ve got a celebration!S|A
That is Half 3 of a sequence. Half 1 could be discovered right here, Half 2 could be discovered right here, and Half 4 could be discovered right here.




