Break the Sequential VonNuemann Bottleneck

Eighty years ago, as a result of computational work done in the mid-1940s and presented in June,1945 in 101-page First Draft Report on the EDVAC, John von Neumann developed theory behind computational devices which were fed programs and data while the computational device was running. Given the manufacturing technology available at the time, John von Neumann proposed a practical, manufacturable modular computer design which was based on the modularitiy of the stored program computer, in which program instructions and data are held in memory. Known as the von Neumann architecture – or sometimes the Princeton architecture – this model became the standard design paradigm used for computers and continues to dominate the architecture almost all of today’s microprocessors … as such, this architecture that was a great idea eighty years ago which was a breakthrough for computing machines that were manufacturable in the 1940s has now become the von Neumann bottleneck in which the communication busses between the CPU, memory and I/O can only access limited classes of memory at a time, even with mitigations, such as caches, scratchpad memory, CPU stacks or hierarchical memory architectures generally forcing waits and lowering throughput.

Given that our manufacturing technologies have advanced significantly … or, more correctly, we now can design and develop new mfg technologies to realize any design that can be contemplated given the molecular constraints operating on the atomic interactions at nanoscale level … what are the NEW compute paradigms that we can use to break the von Neumann bottleneck?

Some of these are relatively simple OR represent a practical, very achievable steps that can take us a very large part of the way past the VonNeumann Bottleneck. For example, the Processing In Memory (PIM) architectures breaks most of the sequential von Neumann bottleneck by integrating bit-serial processors within memory.

Moreover, it’s not JUST about putting processing into memory … it’s also about the structure of the hardware for particular types of sensing, computation and the immediate neuro-like response of sensing incorporated into computational or neurocognitive systems, ie as with the distributed neurology of the sense of smell. The increasing demand for more efficient, more intelligent IoT devices and speedier AI built directly into IoT sensor clouds and robotic swarms that highlights the importance of synthesizing, designing, optimizing and then building specifically-tailored Large Language Model (LLM) accelerators. Generally, optimizing new hardware, like LLM accelerators, involves synthesizing a Verilog-HDL hardware design and then simulating hardware models at the Resistor-Transistor Logic (RTL) level using something like the ModelSIM or another HDL simulator in order to optimize the performance of proposed architecture well before it is committed to actual manufacture.

There’s significant demand driving the capability now that LLMs have emerged such a gigantic component in the future of Artificial Intelligence and the Internet of Things (AIoT) strategies which enable the integration of natural language processing (NLP) capabilities more directly into various IoT applications. Computation using general-purpose graphics processing units (GPUs) inflicts unstructured, chaotic, even reckless demand for I/O bandwidth transferring intermediate calculation results back and forth between memories and processing units – thus, GPUs not specifically designed to address the self-attention module of the LLM transformers result in sub-optimal LLM accelerator designs. The self-attention module is the most compute-intensive sub-structure, occupying well over two-thirds of operations in the prevailing transformer-based LLM architectures. Given the priority of this need, work has been begun to develop a open-source self-attention accelerator, AttentionLego for constructing spatially expandable LLM processors incorporating PIM-based matrix-vector multiplication with a look-up-table-based Softmax approximation … which will require empircally evaluating the performance of the Softmax function … and entering the realm of Verilog-HDL synthesis and simulation.

It’s not just self-attention accelerators – there are PLENTY OF OTHER NEEDS driving greater use of these hardware synthesis and simulation toolchains. For example, there are several other different recent proposed hardware architectures that modify a Field Programmable Gate Array’s (FPGA) Block RAM (BRAM) architecture to form a next-generation PIM reconfigurable fabric overlay. PiCaSO, a Processor in/near Memory Scalable and Fast Overlay is an architecture that is representative of such a PIM overlay for FPGA BRAM. PiCaSO achieves up to 80% of the peak throughput of the custom PIM overlay designs with 2.56× shorter latency and 25% – 43% better BRAM memory utilization efficiency. Several key features of the PiCaSO overlay can be integrated into the custom PIM designs to further improve their throughput by 18%, latency by 19.5%, and memory efficiency by 6.2%.

If the previous examples of self-attention accelerator and different PIM overlay fabric designs were not sufficient to convince us that we need to be more knowledgeable about hardware synthesis and simulation toolchains, a new type of FPGA building block, the integrated ComputeRAM block, has been proposed to provide highly-parallel processing-inmemory (PIM) by combining computation and storage capabilities INTEGRATED into one block. The configurable building blocks of currently available FPGAs — Logic blocks (LBs), Digital Signal Processing (DSP) slices, and Block RAMs (BRAMs) — make FPGA efficient hardware AI accelerators … but communication between these blocks must happens through an interconnect fabric consisting of switching elements spread throughout the FPGA. ComputeRAM promises to provide the storage, the computation capability and the control logic integrated into one block. There are many advantages of using the integrated ComputeRAM FPGA building blocks:

Energy efficiency/Higher freqencies of operation – Because the computation happens inside the memory block, no wire and switching energy is spent in sending data to/from the compute units. Data movement between various blocks on the FPGA is significantly reduced. This leads to reduction in power consumption and an increase in energy efficiency. Another impact of the reduced dependence on the FPGA interconnect is that designs can now operate at higher frequencies of operation, thereby resulting in speeding up applications.

Configurability/Programmability – Any custom operation with any custom precision can be supported by a Compute RAM block. No hardware with hardcoded support for specific operations and fixed number of precisions is involved in a Compute RAM. For performing a different operation or for using a different precision, the instruction sequence needs to be modified. This can be done either at FPGA configuration time or at execution time. Changing the instruction sequence at execution time makes Compute RAMs programmable in a software-like manner.

Greater Bandwidth – Using a Compute RAM for compute reduces the limitations of the bandwidth available from a BRAM because of the limited geometries and number of ports. In a Compute RAM, there are as many operations in progress at a time as many columns. Users can avoid splitting data over multiple BRAMs to get more bandwidth and using only a few rows of each blocks. Now, the array can be fully utilized, and the total area of implementiing a circuit is greatly reduced.

Greater Compute Density – Using Compute RAMs leads to reduced area to implement a given circuit. In comparison to a BRAM, a Compute RAM has an area overhead of the instruction memory, controller and peripheral logic. However, this area overhead is smaller than using a BRAM, a DSP slice and several LBs for realizing a computation on a baseline FPGA. This also leads to reduced power consumption. More importantly, this means larger circuits can now fit on the same FPGA chip. Adding Compute RAMs to FPGAs leads to an increase in the compute density of the FPGA (GOPS/mm2).

Of course, greater integration means more specifically-tailored hardware … which ultimately propels us to a vision of accelerated technological obsolesence, which is naturally accompanied by an arms race in improving both our mfg capabilities and our hardware synthesis and simulation toolchains … and the continual innovation fog around the globe and into space will persist, as with the water vapor cycle evaporating with warmth over oceans, raining out, eroding and running off back into the sea.