Seija - Part 1. Fundamentals of Asynchronous Logic Design

written by Yukiko on 2023-12-18

The STG of the Register File in AMULET-1

Introduction

I. Asynchronous Circuits

Dichotomy of Quantization in the Time Domain

Overview of the Organization of Logic Pipelines in Async. and Sync. Digital Circuits [8]

An asynchronous circuit is a sequential digital logic circuit that does not use a global clock circuit or signal generator to synchronize its components. Instead, the components are driven by a handshaking circuit, which indicates the completion of a set of operations. [1, 2] Due to the elimination of a global clock, asynchronous circuits have the following advantages: [2]

No global clock: Getting a clock into a chip, and distributing that clock in a skew-free way is very taxing in modern chip design. Usually requires layers of PLL, DLL and high-power clock buffer circuits to condition the clock. [3]
The clock period can be flexible: the clock period of asynchronous logic is arbitrary and local. With the help of completion detection circuits, it’s possible to make the circuit acknowledge at the exact moment the operation is finished. By comparison, synchronous circuits need to share a global clock which is designed for the worst evaluation time for the slowest stage in the circuit.
Effectively, the clock and switching noise are randomly modulated: Looking at the RF spectrum, the switching noise and clock spectrum will be spread out like a noise instead of a clear tone. It makes the chip more secure against EMI probing attacks and power-line analysis attacks.

However, as is evident by the market share, Asynchronous circuits saw very limited use in practical chip designs. Here are some possible explanations:

As will be shown in the later parts of this report, asynchronous circuits are based on different theories and techniques. A lot of them are not compatible with the more easy-to-understand synchronous counterparts.
Hard to simulate: Asynchronous circuits must be simulated using event-based, time-based simulators that capture gate delay accurately. Compared to imperative line-by-line simulators like Verilator [5], those simulators are typically 100x slower or even worse. [4]
Higher speed is not always desirable: The gates in an asynchronous system tend to run at a much faster speed than their synchronous counterparts. This is valuable in the Intel Pentium (circa 1990~2000) era because, back then, processor performance was still heavily bottlenecked by logic evaluation speed and clock frequency. Today, the bottleneck has already shifted to DRAM/SRAM access speed and power dissipation limitations. Improving logic speed in advanced nodes is not only meaningless but, in some cases, may have adverse effects by generating too much heat. [6]
Due to the aforementioned difficulties, the asynchronous flow is nearly completely unsupported by contemporary commercial EDA tools.

To summarize, asynchronous circuit design is and probably will never be mainstream. However, I personally believe that understanding async techniques and using them strategically in certain building blocks in circuit design can yield a truly optimal system design.

Another note on the higher speed implication: I believe async circuits may still be relevant in trailing-edge CMOS technologies and non-CMOS processes such as thin-film-transistor technologies [7]. Because in those technologies, the logic speed is still heavily bottlenecked by the slow transistors; getting higher clock frequency is still desirable for the industry.

II. Quasi-Delay Insensitive (QDI)

A circuit that operates "correctly" with positive, bounded but unknown delays in wires as well as in gates is classified as delay-insensitive (DI). However, barely any useful circuits can be derived under this restriction. A more useful type of circuit called “Quasi-Delay-Insensitve” (QDI) relaxes the restriction a bit by allowing isochoric forks. [2]

The following is a diagram showing such a fork structure and the criteria it needs to meet to be qualified as QDI. It’s also possible to further relax the restriction to create new circuit families like “Scalable-Delay-Insensitive” (SDI) [9], but it is out of the scope of this writing.

Isochronic Fork

Isochronic Requirements of QDI and SDI Circuits [9]

III. Generalized Methodology of Implementing an Asynchronous System

People tend to associate asynchronous circuit design with “pipeline” type structures. This is evident based on the fact that most instructional materials [12] tend to illustrate asynchronous circuits using examples as shown below:

PS0 Style Dynamic Pipeline

"PS0-Style" Dynamic Pipeline [12, 13]

However, it turns out that this is a very early and primitive way of describing async data flow. There is a strong disconnect between this kind of dataflow and practical ones like the register file in the AMULET 1 microprocessor:

AMULET1 Register File

Organization of the Asynchronous Register File in AMULET1, Representing a More Practical Case [14]

It turns out that it’s possible to implement complex function blocks will only a couple of gates. To achieve this, a different technique needs to be employed.

In more recent publications, Petri Net (PN) or Signal Transition Graphs (STG) are the most common and more practical ways to describe more practical and complex asynchronous control structures. To illustrate this, I present a simple controller that takes data from external memory, performs some operations on it, and writes it back. When devising this circuit, the circuit designer has to take into account that it’s undetermined whether the data calculation will finish first, or the address calculation will finish first.

Example 1 - PN

Example: A PN of Simple Async Memory Operations

Example 1 - STG

The same Diagram is Represented in STG. The Ack & Req Signals are Explicitly Labeled

By the end of the 1990s, a number of mathematical techniques had been developed to synthesize PN and STG dataflow models into QDI circuits; methods of abstraction, analysis, formal verification, and circuit synthesis had already been established. [10, 11] With the help of those tools, I was able to get this result:

Example 1 - Circuit

The Control Circuit Represented by the STG Graph Is Converted into Technology-Mapped Circuit Elements using MPSat.

Example 1 - System

An Illustration of the Position of Controller and Functional Units in a General Asynchronous System

Implementation: Muller-C Element

I. Why do we need Muller-C?

The Muller C-element is a latch that synchronizes the phases of its inputs. A symbol for a 2-input C-element and its timing diagram are shown in the figures below. Initially, all the signals are in the low state. When both inputs in1 and in2 go high, the output out also switches to logical 1. It stays in this state until both inputs go low, at which stage the output switches to logical 0. [1, 2]

C - Timing

Timing and edge relation diagram of a C-Element

In the original publication, when constructing the ILLIAC II computer, Dr. Muller described the element as a “Cumulative State” (C-State) holder.[3] From a modern perspective, the C-element represents a node at which the branched events join. So, it’s also referred to as “rendezvous ” or “join” operations in some publications. [4] This element made larger, delay-insensitive asynchronous circuits possible because non-ideal timing jitter and skew can now be compensated at a gate level to ensure isochoricness.

In QDI implementations, it’s very common to find C-elements in synthesized circuits. Sometimes, it can be implicit. For example, the following circuit is a “complex-gate” implementation of a buck DC-DC controller from the Workcraft examples:

C - Buck STG

STG of a 3-state Buck Timing Controller [5]

C - Buck CG

“Complex Gate” Implementation of the STG, Generated using the MPSat Backend

Note that this circuit actually contains an implicit C-element, because the complex gate at the top has a feedback loop. The same circuit can be broken down into “standard-C” implementation, as shown below. [6] In this version, the C-Element is made explicit.

C - Buck SC

“Standard-C” Implementation of the same STG, Generated using the MPSat Backend

II. Composite Muller-C

In its essence, Muller-C is a special case of latch circuit and can be broken down into smaller elements. To ensure correct operation, the circuit must be able to traverse in the state space (i.e. the Karnaugh Map) without generating any glitches during the transition. This requirement makes it a bit tricky to implement with sub-gates: very often, special constructs and additional gates need to be added to ensure glitch-free operation.

C - Soviet

A Dual-Rail C-Elemented Constructed using only Nand Gates, [7]

III. Custom-Built Muller-C

For this project, I decided to implement the Muller-C element in a more traditional transistor-level approach. The layout is drawn in MAGIC VLSI in the GF180 open-source PDK.

C - Custom Schematic

Two Popular Gate-Level Implementations [8, 9]

C - Custom Layout

Implementation of the circuit in GF180 9-Track StdCell Format

Implementation: Combinational Logic

I. The Optimal Timing Problem

On the critical path that dictates the throughput of the system. It’s very common to find function units whose evaluation time is strongly data-dependent. It’s historically known that the biggest advantage of asynchronous circuit design comes from self-clocking based on completion detection: Instead of waiting for the slowest possible scenario of a combinational block, the clock edge is delivered at the monument the computation is done, yielding the fastest timing possible.

Optimal Timing

Illustration of a Problem with Conventional Pipelines: The Clock Period is Dictated by the Slowest case of the Slowest stage. [4]

II. Completion Detection

The discussion is based on “completion detection”, which seems to indicate some special kind of logic, such as dual-rail logic that naturally indicates where the evaluation is done [5], or uses the current flow waveform to determine if the evaluation is done [3]. To make implementation easier, there are two less-than-ideal but general methods that can be applied to any combinational block: fixed delay, and steady detection. [6]

This technique has been demonstrated in an asynchronous FPGA project [1], where most non-critical combinational operations are implemented using fixed delay.

Note that the possibilities of acquiring the completion signal are infinite and often need to be custom-made case-by-case. It’s also not restricted to the scope of combinational logic circuits. For example, a very common example is the controller of a SAR ADC. With the goal of checking if the clocked comparator in the loop has done comparing the input, the completion detector needs to figure out if the two outputs have diverged. [7] Other examples include asynchronous SRAM and DRAMs. [8]

SAR Timing

Completion Detection Problem in SAR ADCs [7]

Completion Detectors

An Illustration of Different Ways to Generate the Completion Signal

References

Introduction

Wikipedia contributors. "Asynchronous circuit." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 30 Oct. 2023. Web. 15 Dec. 2023.
J. Sparsø, S. Furber, Principles of Asynchronous Circuit Design: A Systems Perspective, Springer 2001
S. Li, A. Krishnakumar, E. Helder, R. Nicholson, and V. Jia, "Clock generation for a 32nm server processor with scalable cores," 2011 IEEE International Solid-State Circuits Conference, San Francisco, CA, USA, 2011, pp. 82-83, doi: 10.1109/ISSCC.2011.5746229.
S. Herbst, et al., An Open-Source Framework for FPGA Emulation of Analog/Mixed-Signal Integrated Circuit Designs, IEEE TCAD-ICS, 2022
https://www.veripool.org/verilator/
E. Sicard, S. D. Bendhia, Advanced CMOS Cell Design, McGraw-Hill Companies, 2007
J. Biggs, et al., A natively flexible 32-bit Arm microprocessor, Nature, 2021
E. Yahya, Modélisation, Analyse et Optimisation des Performances des Circuits Asynchrones Multi-Protocoles, Micro et Nano Electronique, Thesis, 2009
A. Takamura, et al., TITAC–2: An asynchronous 32-bit microprocessor based on Scalable-Delay-Insensitive model, S2CID, 1997
Workcraft Document - Backend Tools
J. Cortadella, et al., Petrify: a tool for manipulating concurrent specifications and synthesis of asynchronous controllers, IEICE TIS, 1997
S. M. Nowick, M. Singh, High-Performance Asynchronous Pipelines: An Overview, IEEE DTC, 2011
T. E. Williams, M. A. Horowitz, A Zero-Overhead Self-Timed 160-ns 54-b CMOS Divider, JSSC 1991
N. C. Paver, et al., Register locking in an asynchronous microprocessor, VLSICP 1992

Muller-C Element

N-P Nguyen, et al., Design and analysis of a robust genetic Muller C-element, Journal of Theoretical Biology, 2010
Workcraft Document - Design of C-element
Muller, David E, Theory of asynchronous circuits, Report (University of Illinois at Urbana-Champaign. Dept. of Computer Science) no. 66, 1955
M. Moreira, et al., Impact of C-Elements in Asynchronous Circuits, ISQED 2012
Workcraft Document - Design of basic buck controller
V. Khomenko, Derivation of Monotonic Covers for Standard-C Implementation Using STG Unfoldings, ASYNC 2008
V. Varshavskiy, et al., Functional completeness in the class of semimodular circuits, Soviet Journal of Computer and Systems Sciences, vol. 23, no. 6, pp. 70-80, 1985.
Wikipedia contributors. "C-element." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 30 Nov. 2023. Web. 15 Dec. 2023.
J. Sparsø, S. Furber, Principles of Asynchronous Circuit Design: A Systems Perspective, Springer 2001

Combinational Logic

C. LaFrieda, B. Hill, and R. Manohar. An Asynchronous FPGA with Two-Phase Enable-Scaled Routing, ASYNC 2010
R. Manohar. Reconfigurable Asynchronous Logic, CICC 2006
O. C. Akgun, et al., Design of completion detection circuits for self-timed systems operating in subthreshold regime, PRIME 2007
K. J. Nowka, High-Performance CMOS System Design Using Wave Pipelining, Stanford CSL-TR-96-693, 1996
T. E. Williams, M. A. Horowitz, A Zero-Overhead Self-Timed 160-ns 54-b CMOS Divider, JSSC 1991
P. Srivastava, Completion Detection in Asynchronous Circuits: Toward Solution of Clock-Related Design Challenges, Springer 2022
Cho, et al., A Two-Channel Asynchronous SAR ADC With Metastable-Then-Set Algorithm, Trans. VLSI 2012
V. Khomenko, et al., Formal Design and Verification of an Asynchronous SRAM Controller, ACSD, 2017