Hot Chips Fujitsu has developed the blueprints for his powerful powerful Arm-based processors called A64FX, the brain of his Post-K supercomputer.
The designs were shown on Tuesday at a meeting of semiconductor engineers in Silicon Valley. The Post-K is a 1000-peta FLOPS sample – an exascale machine – that will replace the Japanese SPARC64-based K supercomputer. It has to go online in 2021 and has just completed a series of tests showing that the processors are working – to a certain extent, at least.
Post-K hopes to be the fastest, publicly known supercomputer in the world by the time it is fully started and consumes 30 to 40 MW. Today, the top slot is owned by the US government Summit machine using IBM POWER9 and Nvidia Volta GV100 processors, along with Mellanox networking equipment, to take advantage of 188 peta FLOPS.
HPE extracts sheets from the largest arm-based supercomputer Astra
Crucially, it is an exascale Arm-compatible supercomputer, a major milestone for the CPU architecture that is famous for being in virtually all phones, hard drives, smart cards and other embedded electronics and dreams of running laptops and servers.
So what does a supercomputer arm processor designed by Fujitsu look like? This is what we have learned from Fujitsu's Toshio Yoshida at the Hot Chips engineering conference in Santa Clara: the A64FX has 8.8 billion 7nm FinFET transistors in a package with 594 pins and 48 CPU cores plus four management cores . Each chip has a total of 32 GB of memory with high bandwidth (HBM2), 16 PCIe 3.0 lanes and a total memory bandwidth of 1024 GB / s and achieves at least 2.7 tera-FLOPS in terms of performance.
The 52 CPU cores are divided into four clusters of 12 main cores plus one management core, each group has 8 GB HBM2 rating of 256 GB / s and 8 MB shared L2 cache. There is cache coherence across the clusters and the entire chip.
The chips are interconnected via Fujitsu's second generation Tofu mesh-torus-like network. This interconnect can shift data, in and out of each processor chip, through 10 ports each with two lanes that maximize a maximum of 28 Gbps each.
The cache hierarchy and speeds of the A64FX, for the 12 nodes and management core per cluster, four clusters to a chip … Source: Fujitsu
click to enlarge
The CPU cores are 64-bit only – there is no 32-bit mode – and they use the Armv8.2-A instruction set. It supports Arm & # 39; s 512-bit-wide SIMD scalable vector extension (SVE) which we have described in detail here. It means that the chips can crate vector and matrix calculations in hardware – a must for applications for supercomputer and machine learning. It also supports 16 and 8-bit integer mathematics, as well as the usual floating-point precisions (FP16, 32 and 64), which are useful for AI deduction code.
The A64FX is a super-caliber, unsuitable performance beast and the first Armv8.2-A design has been told to us. People who have programmed 32 and 64 bits for arm assembly know that the architecture has fixed width instructions, one operation per instruction, according to the classic RISC school of thought. Interestingly, by implementing SVE, the A64FX has an instruction prefix for its four-operand instruction for fused multiplication (FMA4) – an incredibly useful operation – which reminds this vulture of x86-instruction prefixes.
To perform the calculation
r0 = r3 + r1 * r2, you use two instructions that are joined in the pre-decoding step and are executed in one step, even though you have retrieved them as two instructions. These are:
movprfx r0, r3; prefix next instruction fma3 r0, r0, r1, r2; r0 = r3 + r1 * r2, the r3 is substituted
The execution unit of each CPU core can perform two 512-bit SIMD operations at once. The input data is packaged in 512 bits and clustered in one go – such as Intel's AVX512 operations on the server components. So you can enter four 8-bit values, four corresponding 8-bit coefficients or weights, which are multiplied to get four answers and then are added to a 32-bit offset and written out in a register.
Fujitsu thinks the A64FX can hit 21.6 TOPS (trillions or tera operations per second) with 8-bit integer math; 10.8 TOPS with 16-bit integers; 5.4 TOPS with 32-bits; and 2.7 TOPS with 64 bits, all with simultaneous integer. It is generally said that the A64FX is at least 2.5 times faster than the previous Fujitsu supercomputer processor – the SPARC64 XIfx – on HPC and AI work.
Nvidia & # 39; s P4 and P40 accelerators for servers clocked in at 22 and 47 TOPS with an 8-bit integer, for what it's worth.
The L1 cache has a combined collection mechanism that can retrieve consecutive elements in arrays and copy them to a register. For example, you can use this to whip eight eight bytes of memory into one 64-bit register, with each byte being dragged into its register by its own byte position. The per-core four-way 64KB L1 data cache is read by the instructional motors at 230GB / s and written down at 115GB / s. The cache shared by L2 inputs data at 115 G / s and receives at 57 GB / s.
Pipeline stages of the A64FX … Source: Fujitsu
click to enlarge
The energy consumption per chip is checked and checked on a per millisecond basis and up to the nanosecond per core. Fujitsu claims that its A64FX has resilience at the mainframe level, with ECC or duplication on all caches, parity checks within the execution units, instructions are retried if something like error is detected, error recovery on the Tofu interconnect links and 128,000 error checking in total the chip.
The entire shebang runs Linux, with a Luster-based distributed file system and non-volatile memory for speeding up file input output. The toolchain supports C, C ++ and Fortran compilers, MPI, OpenMP, debuggers and other utilities and languages.
You will see that there are no accelerators of third parties: it is pure Arm, Fujitsu & # 39; s way. The goal is to design a chip that executes applications in supercomputer style – simulations, analysis of scientific experiments, machine learning and other number crunching – with a higher performance per watt than general CPUs.
Yoshida unfortunately did not want to talk about clock speeds and individual chip use. The machine has not been completed for many years and not all specifications and implementation details have yet to be nailed or unveiled. "We will continue to develop Arm processors," he told the conference. Despite his delays, Fujitsu has not been put off by Arm Great Iron.
And yes, maybe you can play Crysis on it. ®