DRAFT PAGE - UNDER CONSTRUCTION JUNE-SEPT 2015 ... The order of this text will all be changed and much likely deleted ... and moved to the following URL HERE.

Q. Hardware Construction Languages - Do we need them?
A. Yes - and we need to build other familiar abstractions on them.

We show how a number of useful HDL and higher-level programming applications can readily be built on Chisel. Bluespec lovers will approve that we support a Bluespec-like style but with the oft-discussed multi-cycle method invocations.

Contents

Implementing SAFL as a Chisel Shim

Implementing BlueSpec as a Chisel Shim

Implementing RTL as a Chisel Shim

Implementing Occam/Handel-C as a Chisel Shim

Composite Demo

Downloads

Hardware Construction Languages - Do we need them ?

Hardware Construction Languages - Do we need them? SLIDES (PDF)

A Hardware Construction Language is here defined as a programming language for programs that 'print out' the circuit diagram, or netlist, for a hardware design. They can vary greatly in their expressive power. Of interest are those that ensure the generated netlist is not only syntactically correct but also semantically correct. The latter implies that design rules such as no two gate outputs are directly connected are obeyed, or that the widths of two busses being multiplexed is the same. Two notable recent players are Chisel and HardCaml. An earlier language was Lava. All three make great use of the standard combinators and paradigms found in functional programming languages, such as ML and Haskell. Bluespec, another recent language, also leverages this paradigm, but differs in a couple of ways. The first difference is that it reimplements the functional language itself with its own syntax and parser, whereas the other three are DSLs (domain-specific languages) that are embedded in a powerful host language. Chisel is embedded in Scala, Lava was embedded in Haskell and HardCaml is embedded in OCaml. We shall cover the second Bluespec difference later on.

The two everyday RTL languages, Verilog and VHDL, embody 'generate statements' and 'generate variables' which provide the hardware construction aspect just defined. These languages have staged execution (or embody metaprogramming) where part of the program is run at compile time and the remainder at run time. The compile-time execution is known in the field as the 'elaboration'. All generate statements and genvars only exist during elaboration and only influence the amount of hardware generated. The generated hardware, when switched on and clocked, provides the run-time execution. (VHDL differs slightly from Verilog since some VHDL designs can be a little tricky to work out whether a function is executed at run time or compile time, but generally it is obvious.)

The execution of the combinators and functional programming we find in the hardware construction languages already mentioned is constrained to the elaboration stage. They all generate a finite amount of hardware with static allocation of user variables to registers - there is no run-time dynamic storage allocation. Nonetheless, both stages of execution potentially embody 'if' statements and other control flow statements such as case or switch. When one of these statements alters which others get executed we have 'data-directed control flow'. None of Lava, HardCaml or Chisel provides data-directed control flow, although they do support relatively useable synthesis of multiplexor trees from nestable constructs like Chisel's 'when' statement. A full projection of the 'if' statements from the source code to the hardware requires a run-time program counter per thread that follows the control flow of the part of the program that survives beyond elaboration. In these hardware construction languages, the user must manually instantiate a finite-state machine to get this effect, whereas with HLS (high-level synthesis) tools, such as LegUp and Kiwi, this is their natural behaviour. Older Verilog compilers would support state machine inference from the Verilog's control flow, but this was discontinued (or at least made to flag warnings) when the synthesisable subset was defined.

In this talk, Greaves will illustrate the essential differences between these styles and discuss the motivation for separate languages, which is a surprising departure when one considers that a major selling point of Verilog and VHDL was they almost seamlessly integrated simulation, generative constructs and algorithm expression. He will also discuss the need for higher-level concurrency primitives and how they might be provided in hardware construction languages.

David Greaves, PhD, MIET, is a Senior Lecturer at the University of Cambridge with interests in compiler and hardware design. He has considerable industrial experience at the CTO/Chief Scientist level and has led the design of many hardware systems in areas such as professional audio and broadband access.

His current research area is tooling for high-level simulation and energy instrumentation for system-on-chip based on transaction-level modelling (TLM). These enable new hardware architectures to be rapidly explored under real-world workloads, including accelerators for scientific workloads.

Talk Overview

  1. VHDL + Verilog worked since they incorporate simulation, synthesis and are 'resource aware' meaning that the engineer knows what he's going to get as a circuit.

  2. Hardware generation languages in style of Lava and HardCaml, that have a lot of expressive power using combinators and so on for generating elaborate circuits can really only serve niche applications owing to the lack of data-dependent control flow and that when generating logic containing registers they contribute nothing since the data is unqualified (no valid tag) and there is no hazard control which means a small change in the design can inadvertently resolve RaW hazards the other way which is most likely not what is wanted.

  3. Chisel again has all all the expressive power for hardware generation of Lava and its descendants, owing to being embedded in Scala. Scala has all the power typically found in functional languages but has the advantage of its syntax looking like familiar imperative languages in the Algol/Modula/Java/C# tradition. Chisel inherits this expressivity for hardware generation and the primitives that are embedded are richer than Lava owing to the data-dependent conditional forms such as the 'when' clauses. Having these as a first-class construct is more natural than always having to use a predicated expression (aka multiplexor) as in Lava. But the transactional paradigm is totally lost in (the current dialect) of Chisel with the tutorial examples manually manipulating the handshake wires of every interface.

  4. HardCaml is essentially the same as Chisel, but is built on OCaml instead of Scala. OCaml is not as rich as Scala for DSL embedding.

  5. The SAFL-style function definitions of hardware from Mycroft+Sharp are elegant owing to being purely functional, yet allow the user or the compiler to flexibly and safely trade execution time for silicon area, but are severely limited in practical applicability owing to the RAM being a very important component in real hardware and the imperative array not being a first-class aspect in pure functional language. Ditto Dan Ghica's 'Geometry of Synthesis IV: Compiling Affine Recursion into Static Hardware' although that adds a stack.

  6. Bluespec again has elaborate, Lava-like generation capabilities but most of the elegance arising from the guarded atomic actions is lost unless one sticks to the FIFO interface throughout, since there is still no hazard control for registers and RAMs. But at least Bluespec really emphasises transactional interface (i.e. those with flow control) and tagged data storage (i.e. with a valid bit). This is a very useful abstraction raise compared with conventional HDL that has no manifestation of whether a register is live.

  7. What you really want is (Might also make an aside re FSM sub-language compare with guarded atomic actions. Could digress further and talk about CSP-based tools Occam/Handle-C and Esterel hardware ... but probably not). The point is that all of these paradigms can be easily constructed on top of the infrastructure offered by a standard HCL 'power platform'

  8. What you really want is
           a) yes, the powerful expressivity in your construction sub-language
           b) yes, the tagged live data and transactional interfaces of Bluespec
           c) yes, the amenability for manual or automatic refolding arising from pure languages
           d) yes RAMS, but need some explicit manifestation of the 'state edge' in the language that constrains the order of reads and writes 
    so that optimisations and folding can be automated.  It could also assert (undecidable) properties over address aliases. 
    This would be a language feature not so far seen, but is akin to a connected mesh of barrier instructions where each is tagged as 
    read, write, partial write or associative update (such as increment) and where the mesh is inferred within a single-threaded 
    imperative subprogram according to the normal semantics. I think I will sketch out some examples ...
      ... the applicative monad versus the full monad is what we need ...
    

Citations and External Links

Hardware Construction Languages - Do we need them? SLIDES (PDF)

Chisel: Constructing Hardware in a Scala Embedded Language.

BCS OSHUG — Open Source Hardware User Group

'The CSYN Verilog Compiler' DJ Greaves. At the International Workshop on Field Programmable Logic. In LNCS Volume 975, pp 198-207. ISBN 3540 60294-1. Oxford, September 1995. PDF. CSYN Manual PDF.

SAFL:Higher-Level Techniques for Hardware Description and Synthesis; ICALP 2000, Alan Mycroft, Richard Sharp. Related Materials.

http://www.csl.cornell.edu/~cbatten/pdfs/lockhart-hgls-reproduce2014.pdf


TNDJG:008: Transactional Design Expression (Bluespec/SAFL/TLM) Using Chisel HDL/HCL

David J Greaves, University of Cambridge, Computer Laboratory

DRAFT PAGE - UNDER CONSTRUCTION JUNE 2015 ... The order of this text will all be changed and much likely deleted ...

In this article we demonstrate various extensions to the Chisel hardware construction system that facilitate transactional design expression that is ameanable to folding ... in a further article we will/might demonstrate how this can be used as a basis for automated time/space tradeoff in various implementations.

Unpublished draft blogspot article

The register transfer languages VHDL and Verilog have enjoyed great success for hardware and testbench description. They remain the 'kingpin' between the front and back design flows in hardware design and chip manufacture. They are essentially equivalent in expressivity and are interchangeable. Their major strength was that they incorporated simulation and synthesis and are 'resource aware' meaning that the engineer knows what he's going to get as a circuit.

But both VHDL and Verilog are low-level languages in two important ways: 1./ there is no representation of whether a register contains a live value, and 2./ all handshaking wires and associated protocol logic is explicitly manifest.

Transaction-level modelling (TLM) is above RTL in both these two respects: flow control can be implicit and liveness is explicit. By flow control, we mean that a value or set of values is only transferred between two components when both are ready for the exchange. When we say a register is live, we mean it currently contains a value that will affect the outcome of the computation.

Net-level ports, interfaces and protocols

A hardware design consists of a number of modules interconnected by wires known as 'nets' (short for networks).

The interconnections between modules are typically structured as mating interfaces. An interface nominally consists of a number of terminals but these have no physical manifestation.

In a modern design flow, the protocol at an interface is specified once in a master file that is imported for the synthesis of each module that sports it.

A clock domain is a set of modules and a clock generator. Within a synchronous clock domain all flip-flops have their clocks commoned.

Chisel Baseline Facilities

Chisel is a hardware construction language, embedded in Scala as DSL (domain-specific language). It consists of a library package of 15Kloc of Scala that can generate Verilog RTL, SystemC or its own C model for its own fast simulator.

Chisel defines the product type 'Bundle' that acts like a Pascal RECORD or a C struct to group a number of named fields as a new type. These are used for the net-level baseline I/O facilities. For example, the GCD example in the Chisel Tutorial starts with the following Bundle definition that wraps up all of its I/O:

val io = new Bundle 
  { val e  = Bool(INPUT)         // Enable handshake wire - start the computation
    val a  = UInt(INPUT,  16)    // input arg a
    val b  = UInt(INPUT,  16)    // input arg b

    val v  = Bool(OUTPUT)        // Ready handshake wire - the result is ready
    val z  = UInt(OUTPUT, 16)    // result return value.
  } 

The above signature embodies the typical handshake pair of a request input and a ready output. If this were a Bluespec interface it could instead be defined as:

interface GCD
   UInt#(16) execute(UInt#(16) a, UInt#(16) b); 
endinterface

which does not explicitly mention the handshake wires, but the generated hardware would be the same. In both languages the clock and reset wires are implicit and so are also present on the generated component. (The execute identifier is not a keyword: multiple methods can be defined in a Bluespec interface and each method must have a name).

Wiring is required between instantiated modules. In Chisel, for simple one-to-one connections, this is provided via an overload for the ':=' operator such that when one Bundle is assigned to another all of the wires in one are connected to those of the other.

Multiplexed wiring is also supported by storing a number of interfaces in a Chisel 'Vec' and indexing it ...

Adding TLM handshake as a shim above Chisel ?

Handshaking wires can either be mostly explicit in the hardware construction language, perhaps wrapped up in a library, or else largely invisible and automatically generated as a core part of the compilation process.

Chisel provides the 'Valid' and 'Decoupled' I/O facilities in its library, but here we build our own to show how this roughly works and examine any shortcomings. Our naive first step is to extend the standard bundle with a TLM prefix that includes the primary handshaking wires as follows

  class TLM_port extends Bundle                                                                                      
  {                                                                                                                  
    val req = Bool(INPUT)                                                                                            
    val ack = Bool(OUTPUT)                                                                                           
  }                                                                                                                  

  class GCD_tlm extends Module {                                                                                     
    val io = new TLM_port {                                                                                          
      val a  = UInt(INPUT,  16)                                                                                      
      val b  = UInt(INPUT,  16)                                                                                      
      val z  = UInt(OUTPUT, 16)                                                                                      
    }                                                                                                                
    val x  = Reg(UInt())                                                                                             
    val y  = Reg(UInt())                                                                                             
    when   (x > y) { x := x - y }                                                                                    
    unless (x > y) { y := y - x }                                                                                    
    when (io.req) { x := io.a; y := io.b }                                                                           
    io.z := x                                                                                                        
    io.ack := y === UInt(0)                                                                                          
  }                                                                                                                  

We can refine the caller and callee code by adding methods to the TLM port for blocking call and return and sugaring up the argument passing as follows:

   class TLM_port_base extends Bundle
   { val req = Bool(INPUT)
     val ack = Bool(OUTPUT)
   }

   class TLM_port2(rvw :Int, a1w :Int, a2w:Int) extends TLM_port_base
   { val rv   = UInt(OUTPUT, rvw)
     val arg1 = UInt(INPUT, a1w)
     val arg2 = UInt(INPUT, a2w)

     def b_return (return_cond : Bool, r : Chisel.Data) : Unit =
     { ack := return_cond
       rv := r
     }

     def b_call2 (call_cond : Bool, a1 : Chisel.Data, a2 : Chisel.Data) : Unit =
     {
       arg1 := a1
       arg2 := a2
       req := call_cond
     }
   }

   class GCD_tlm extends Module // The GCD example as a TLM target.
   { val io = new TLM_port2(16, 16, 16)
     val x  = Reg(UInt())
     val y  = Reg(UInt())
     when   (x > y) { x := x - y }
     unless (x > y) { y := y - x }
     when (io.req) { x := io.arg1; y := io.arg2 }
     io.b_return (y === UInt(0), x)
   }

   class GCD_caller_tlm extends Module // A simple TLM wrapper that does not nothing.
   { val io = new TLM_port2(16, 16, 16)
     val gcd_unit = Module(new GCD_tlm())
     gcd_unit.io.b_call2(io.req, io.arg1, io.arg2)
     io.b_return(gcd_unit.io.ack, gcd_unit.io.rv) // Yuck - ugly reference to handshake wires.
   }

   class GCD_tlm_tests(c: GCD_caller_tlm) extends Tester(c)  // The old test bench
   { val (a, b, z) = (64, 48, 16) // Test vectors
     do {
       val first = if (t == 0) 1 else 0;
       poke(c.io.arg1, a)  // This stimulus call could also be TLM-ised...
       poke(c.io.arg2, b)
       poke(c.io.req, first)
       step(1)
     } while (t <= 1 || peek(c.io.ack) == 0)
     expect(c.io.rv, z)
   }

With this simple, starting approach, the use of the b_call and b_return is a little ugly and we can explore using Scala DSL features to make this nicer. But more importantly, the explicit reference to the gcd_unit.io.ack flow control wire in the last line of GCD_caller_tlm is really ugly. We would perhaps find it more natural if the last line of module were simply not run until the previous line had completed. This would then be behavioural hardware description, which I will enlarge on below.

Note also that the expressivity of a fully-synthesisable TLM coding style instantly gives rise to structural hazards. These hazards occur when a hardware component is busy with a computation and so a new one cannot be started. For example, if we try to sum the results of a pair of GCD results, computed by a single unit, with

  val sum = gcd_unit1(a, b) + gcd_unit1(c, d)

then, even if the unit is initially non-busy, one of the two executions must run after the other. Compiling code of this complexity requires high-level synthesis, as provided by Kiwi or LegUp. Multi-cycle schedules are needed. If the operations can take varying amounts of time, as is the case for GCD, then dynamic schedules are required. The simplest approach in this category is SAFL, where the arguments are evaluated left to right without exotic schedulling. The Bluespec approach is even simpler: it only supports method calls that complete in one clock cycle, which means that, when multiple clock cycles are needed for internal evaluation, separate methods are required to later collect the results from the instances. But Bluespec does make this very easy with a wide variety of FIFOs and server farm frameworks in its standard library.

TODO - add text on dynamic dispatch ... an indexed array of TLM interfaces ...

TODO - add text on live-tagged registers and FIFOs


Staged Behavioural Design Expression

Given that Chisel hardware components can now be wrapped up as functions or method invokations on state-bearing object instances, where synchronisation is represented by the call and return of a nominal execution threads leads us directly to consider the same form of expression within a function or method body: namely a nominal program counter that steps through the statements of the body in program order. The issue is that there are potentially two stages of execution, each with its own program counter: the one at compile time that `generates' the hardware and the potential second one that nominally operates at run time and is reflected in the hardware's behaviour. The former can be called the generative or elaborative flow. The latter is called the behavioural flow.

The essential semantic difference between hardware construction languages (like Lava and Chisel) and standard RTL logic synthesis is that hardware construction languages (mostly) lack behavioural flow. For those systems that support it, at compile time, behavioural flow is converted to a static circuit (a process also known as behavioural elaboration or symbolic evaluation) by a collation process that keeps track of assignments to variables and array locations. This process is exactly the procedure used to formally define imperative languages using denotational semantics. The most distinctive differentiator between behavioural description in HDLs is the presence of an infinite loop: hardware generally lasts for ever, repeating the same behaviour over and over. On the other hand, HCLs execute once with all loops exiting at compile time.

As a concrete example, in Chisel, the following Scala fragment creates a static-priority arbiter with n inputs and n outputs. By extending the Module class it becomes a separate hardware component (Verilog module) in the RTL output from Chisel. All Chisel modules have an I/O Bundle which commonly has a fixed set of terminals, but in this example all of the I/O pairs are generated by the higher-order functions map and fold, both of which are fully executed at compile time (elaboration time). The '&', '|' and '!' symbols are overloaded to generate real AND, OR and NOT gates and the 'new Bool' generates the wire from one level to the next.

   class genericPriEncoder(n_inputs : Int) extends Module
    {
      val io  = new Bundle {  }
      val terms = (0 until n_inputs).map (n => ("req" + n, "grant" + n))

      terms.foldLeft (Bool(false)){ case (sofar, (in, out)) =>
         val (req, grant) = (Bool(INPUT), Bool(OUTPUT))
         io.elements += ((in, req))
         io.elements += ((out, grant))
         grant := req & !sofar
         val next = new Bool
         next := sofar | req
         next
      }
    }

(For those familiar with Bluespec, or who have heard that it is a variant of Haskell, it is Bluespec's generative flow that borrows many constructs from Haskell. As the generative flow proceeds, constructs are added to the monadic environment which are essentially print statements of rules and finite-state machines that are later compiled to a hardware circuit. The finite-state machines are not unlike standard RTL constructs, where the current state is like a run-time program counter. On the other hand, the rules describe circuits that can require quite a lot of ancillary logic to check whether they can fire and a packing transform folds multiple rules that can potentially fire in the same clock cycle, assigning overlapping left-hand sides, using the same collation process found in standard RTL.)

A DSL Extension to Chisel for Behavioural RTL expression.

The main feature of behavioural register transfer language (RTL) is a thread of execution inside an infinite (or eternal) process loop. The behaviour of the hardware mirrors the behaviour of the thread. Synthesis standards and house styles in individual companies may restrict the coding styles used, but in general, any variable can be updated in any number of places in the loop and the loop can be explicitly paused at any point waiting for the next clock edge. The most recent update to each variable at that point is presented to the D-inputs of the registers and they take on that value at the clock edge. In Verilog, the pause is typically denoted with '@(posedge clk)'. Chisel provides an analogous 'step(n)' method in its testbench stimulus language but does not natively support such a construct in its synthesisable subset. Synthesisable code extends Chisel.Module but the step method is unbound inside Module. To avoid confusion, we defined 'STEP()' and 'STEP(n)' to pause for 1 and n clock cycles respectively.

Turning now to the process of finding the most recent assignment to a variable, the packing transform, we find Chisel already fully supports this with its 'when/else/otherwise' and 'switch' statements. However, when a succession of these are normally placed in a Module body (or nested inside each other) they behave in parallel, with only the WaW (write after write) hazards being resolved in serial order. Of course, this contrasts with everyday imperative programming where they act in serial. The same difference, with respect to values on the right-hand sides appears with Verilog's two different assignment operators, blocking and non-blocking.

Packing Transform

In conventional imperative languages we expect to be able to freely read and write variables and for the value read to always be the last value read. Of course, HLS (high-level synthesis) tools, such as LegUp and Kiwi implement this fully.

In Verilog and VHDL RTL we can also get the same effect. In Verilog we use the blocking update operator (=) in behavioural blocks. In VHDL we can assign to variables instead of VHDL signals to get the same effect. This style can be used for sequential logic or for combinational logic. To get this in Verilog we use the fully-sensitive @* syntax where an always thread is blocked on all support variables. For example:

     x1 = (din) ? 1:2;
     x2 = x1 + 1
     x1 = 100

After the packing transform (aka update resolution) compilation to hardware, this becomes

     x1 <= 100
     x2 <= ((din) ? 1:2) + 1

We can try to write this in raw Chisel as follows, where the two branches are for combinational and sequential logic respectively:

class PackingTest extends Module {
  val io  = new Bundle { 
   val mon = UInt(OUTPUT, 32) 
   val foo = Bool(INPUT)
   }

  if (...)
  {
     // Combinational logic test 
     val w1 = UInt(32) // Declare a 32-bit bus
     val w2 = UInt(32) // and another one.
     w1 := UInt(1) 
     when (io.foo) { w1 :=  UInt(2) } 
     w2 := w1 + UInt(5)
     w1 := UInt(100)
     io.mon := w2
     // Chisel gives the error that the wire is reassigned.
  }
  else
  {
     // Sequential logic test 
     val w1 = Reg(init=UInt(10, 32)) // Define 32-bit registers
     val w2 = Reg(init=UInt(10, 32)) // with initial value 10.
     w1 := UInt(1) 
     when (io.foo) { w1 :=  UInt(2) } 
     w2 := w1 + UInt(5)
     w1 := UInt(100)

     io.mon := w2
     // Chisel gives 10, 15, 105, 105, ... as the output sequence, regardless of foo.
  }
}

The combinational form is not allowed. Chisel gives:

[error] Test8.scala:35: reassignment to Wire /*? in < (class PackingTest)>*/ 
    Chisel.UInt(OUTPUT, width=None, connect to 1 inputs: (0x20[Chisel.Literal] in PackingTest)) with inputs 0x20 RHS: /*? in < (class PackingTest)>*/ 
    Chisel.UInt(OUTPUT, width=None, connect to 1 inputs: (0x1[Chisel.Literal] in PackingTest)) in class PackingTest

The sequential form is allowed, but has different behaviour from the blocking assignments of Verilog. Chisel gives 10, 15, 105, 105, ... as the output sequence, regardless of foo. This demonstrates that the second write to w1 correctly takes precedence and that the value picked up for w1 on the r.h.s. is the value from the previous cycle, which is what we would have using Verilog's non-blocking assignment operator (<=) (or VHDL's signal assignments).

Adding blocking assignments to Chisel

So a useful facility to add to Chisel is standard imperative programming that is transformed to hardware in this well-established manner. The Chisel native operator ':=' gives the non-blocking Verilog semantics for sequential logic. We follow Verilog's design pattern and define a second operator '<-=' for blocking assignments. Currently, as can be seen, we also have to wrap the declarations of the nets to be assigned in this manner with a 'Pwrap' prefix, but this could be avoided by folding our extension into the main Chisel distribution.

Our example now looks like this (note we have not include an 'always' statement here):

class PackingTest extends Module {
  val io  = new Bundle {
   val mon = UInt(OUTPUT, 32)
   val foo = Bool(INPUT)
   }
  if (...)  // Sequential blocking assignments test
  {
     val w1 = Pwrap(UInt(32)) // Declare combinational 32-bit bus
     val w2 = Pwrap(UInt(32)) // and another one.
     w1 <-= UInt(1)
     when (io.foo) { w1 <-=  UInt(2) }
     w2 <-= w1 + Pwrap(UInt(5))
     w1 <-= UInt(100)
     io.mon := w2
  }
  else      // Combinational blocking assignments test
  {
     val w1 = Pwrap(UInt(10, 32)) // Define a sequential register
     val w2 = Pwrap(UInt(40, 32)) // and another one.
     w1 <-= UInt(1)
     when (io.foo) { w1 <-=  UInt(2) }5D
     w2 <-= w1 + UInt(5)
     w1 <-= UInt(100)
     io.mon := w2
  }
}

Note, in the above example, we could have freely used either the Chisel HCL 'when' statement for conditional assignment, or else our new 'IF' family of control flow statements. The difference arises when we have multiple 'STEP' calls where the control flow between runtime PC states is directed by the runtime 'IF' statements. Both are staged to execution time, whereas if we had used Scala's own 'if' this would be evaluated one way or the other at elaboration (compile) time.

For scalars within a basic block, this is easy - for arrays/vectors where subscripts are not known at compile time it is possible to accidentally generate a lot of code this way ...

Owing to the lack of an 'always' keyword, the above design style is using what, in a little-known corner of the Verilog language, is called 'procedural continuous assigns'...

Adding RTL-style eternal process loops to Chisel

Owing to Scala's wonderful extensibility, it is rather easy to introduce some 'always' syntax that introduces a run-time process loop. The keyword 'always' is used in Verilog to define a thread that infinitely loops: it is short for 'initial while (1)'. The necessary Scala is

object always
{
  def apply(codeBlock:Unit) =
  {
    ... csyn-style code
  }
}

Also, to enable Scala's own if statement to still be freely used during the elaborate phase, we defined our own IF, ELIF and ELSE constructs for use inside always blocks for run time data-dependent control flow.

This now enables us to write the following RTL-style module. It generates an alternate pattern of 1 and 2 on its output that is interspersed with the number 3 for two cycles after the 2 when the input is asserted.

class RtlExample extends Module {
  val io  = new Bundle {  
    val din = Bool(INPUT)
    val mon = UInt(OUTPUT, 3) 
  }
  always {
     io.mon := UInt(1)
     STEP(1)
     io.mon := UInt(2)
     STEP(1)
     IF (io.din) { io.mon := UInt(3); STEP(2); }
  }
}

With a suitable stimulus on the din input, a simulation of the generated RTL gives the following waveform on the mon output, as expected.

The apply method for always contains an RTL compiler along the lines of CSYN and many other software-to-hardware compilers. We use a one-hot encoding of the program counter where each PC state starts with a STEP call. At the hardware level, each region of code has an activate input that makes it run. Using the same mechanisms that Chisel already provides for its 'when' statement, the side effects of assignments are only enabled when the region is active. The 'IF' statements fork and join the activate path around the conditional code. FPGA tools or Design Compiler will re-encode the state machine to a more optimum form for the target technology, so there is probably little point putting more optimisation effort into our extensions.

All Chisel loops are run at compile time by default and there is no behavioural 'IF' statement apart from the 'when/else/otherwise' and 'switch' statements that ... TODO than applying to nested blocks of general source code with arbitrary side effects... But, owing to the powerful features of Scala, extending the language with logic synthesis is straightforward, provided a small amount of new syntax is allowable. C-to-gates-like imperative programs, SAFL-style functional programs, Bluespec and Handel-C can all be embedded in Scala as DSL (domain-specific language) quotations and exploit the Chisel power platform for hardware generation as their backend .

Miscellaneous Text for Tidy Up ...

Systems vary a lot in terms of how manifest is the separation of the compile time and the run time flow. In Verilog and VHDL, the 'generate' keyword is inserted in the syntax of variable declarations and certain loop constructs to insist that a loop is completely run at compile time with fresh hardware generated for each iteration. Otherwise they are left to run time (or not allowed in strict synthesis subsets.) However, by just looking at the body of a function definition in VHDL or Verilog one cannot readily tell whether that function is being executed by a compile-time generate or expresses run-time behaviour.

...

Chisel already supports conditional update to registers in the way it infers multiplexors from `when' statements --- it is/could be only a small enhancement to expand the scope of these conditionals to cover blocks of code ... eliminating the difference between hardware construction languages and regular logic synthesis HDLs.


A DSL Extension to Chisel that Generates SAFL-like Transactional Circuits

SAFL, the Statically-Allocated Functional Language (Mycroft and Sharp), is a hardware synthesis language that uses a subset of ML syntax; in particular, any recursion used must be tail recursion. This ensures finite circuits are generated. (Dan Ghica's 'Geometry of Synthesis IV: Compiling Affine Recursion into Static Hardware' is also functional but added a stack to the generated hardware to support recursion.) SAFL is here classed as a synthesis language since the function applications and the `if' statements are projected through to the execution phase. It uses its own compiler and, in contrast to HardCaml, is not a DSL embedded in ML. The language was described as 'resource aware' meaning that the engineers know exactly how much hardware they will generate with each statement they write. Being functional, it is highly-ameanable to time/space folding since, without imperative updates, there are no RaW hazards that might get re-ordered.

In the SAFL-style, there are two baseline rules that control the amount of hardware generated:

Fold/unfold in SAFL style

Example:
   fun cmult x  y = 
      let ans_re = x.re*y.re - x.im*y.im
      let ans_im = x.im*y.re + x.re*y.im
      in (ans_re, ans_im) // 4 multipliers, 2 adders.

A function replicator, such as UF, enables control of time/space folding, giving a fresh
copy of a function.

  let use_time  = g(cmult a b, cmult c d)             // 4 multipliers, 2 adders + resources for g

  let use_space = g(cmult a b, (UF cmult) c d)         // 8 multipliers, 4 adders + resources for g

Server farms are a natural extension, provided everything remains stateless.

Problems arise with registers and RAMs, which suffer from WaW hazards etc, owing to imperative writes.

Folding requires preservation of the 'state edge' ordering.

Chisel has an underlying datatype for generated hardware called Bits. Chisel also provides the 'Valid' and 'Decoupled' interface wrappers in its library that add request and ready signals to an interface. But these wrappers do not automate the wiring inside a leaf copmponent; they only help with inter-module wiring. For a SAFL-style portion of a circuit description, every expression needs to be associated with handshake wires.

Accordingly, we implemented a guardedBits class that adds handshake wires to every subexpression in an expression. One might think this will force us to redo all of the work in the Chisel library that already overloads every arithmetic and logic operators, but owing to Scala's facility of `implicit functions' to provide automatic coercion from one form to another when appropriate, this has not been a problem so far. For instance, a constant expression, such as UInt(42), can be freely used as a guarded expression provided the appropriate implicit coercion has been imported with 'using SAFL._' at the head of the file. When used, on a simple constant, the implicit coercion will add an 'always ready' acknowledge output signal, that is wired to logic 1, and a request input signal that is simply ignored.

SAFL-style operators, including simple primitives like addition, are treated as function calls. This is already how Chisel works. SAFL uses strict call-by-value, so all function calls can only execute their bodies when all of their input arguments are ready. Inputs are acknowledged when the function apply is acknowledged ...

By defining a transactional bundle that extends the net-level bundle of Chisel, and performing operations on that, we can enrich the language so that the compiler performs automatic synthesis of handshaking nets. But we do not wish to completely lose the 'resource-aware' aspect where the user has complete control over how much hardware gets generated for a given program or coding style. To start with, we will stick to strict SAFL elaboration rules to retain control over the time/space fold.

A Baseline TLM callable Module - One method, Always Ready, No Handshakes Really Needed

To wrap up a simple Chisel function so that its I/O is fully transactional we can use the following style:

class Foo_AX extends Module
{
  val io = new TLM_bundle_lc
  {
     tlmBind1(fax _)
  }
  def fax(x:UInt) = x + UInt(10)
  // Note fax returns immediately, but, most importantly,  owing to our handshake
  // wires, blocking methods are also supported.
}

The function fax has been registered as a callable method in the module Foo_AX. It can now be called as 'x1' in the following fragment. It can also be called as 'fax' owing to Scala's introspection, but we will use 'x1' for now to call the unique TLM-callable method with one argument of that module.

Aside: tlmBind installs a method for later calling. The method body may be blocking in which case it has Decoupled-style handshake nets or it may be non-blocking and assumed to execute within a clock cycle or else fixed pipeline latency, in which case it is assumed to be fully pipelined and hence ready to accept new parameters on every clock cycle without further handshaking. To support a future optimising scheduller, an estimated latency can also be supplied for the blocking instance.

  val target_a = Module(new Foo_AX())
  val target_b = Module(new Bar_BX())
  val ... = target_b.io.x2(target_a.io.x1(e1),  target_a.io.x1(e2)) 

The above fragment illustrates a key aspect of SAFL semantics, that the target is called twice, potentially in parallel, before target_b's TLM-method of arity two is invoked. SAFL is strictly call by value. The executions of target_a are serialised in the time domain. A holding register is quietly generated to store the output of the first execution, but visible in the generated RTL.

If we manually add a second instance of Foo_AX then SAFL will evaluate the two argments of Bar_BX in parallel and BX can run when both have completed.

  val target_a1 = Module(new Foo_AX())
  val target_a2 = Module(new Foo_AX()) // Second, manual instance of AX.
  val target_b = Module(new Bar_BX())
  val ... = target_b.io.x2(target_a1.io.x1(e1),  target_a2.io.x1(e2)) 

Each call site has its own req and valid pair. A priority encoder selects one contending requester for service. There are various possible semantics, but in the richest, the request is forwarded to all argument expressions in parallel. If the valids come back in differing clock cycles then the results from the earlier ones are registered locally and the requests de-asserted. This ensures that resources that may be tying up completion of the other ones are freed, avoiding deadlock. AlwaysReady expressions are ignored in this consideration. Registering completely constant values is a waste that should be cleaned up in back end tools by constant folding optimisation, but this may not be the case for FPGA where the unavoidable global/power-on reset has an observable consequence, so AlwaysReady flagging enables this to be done inside the Chisel/SAFL system and avoids relying on the backend, as well as usefully simplifying the circuit which helps with debugging owing to less complex handshake wires.

class Targer_AX extends Module
{
  val io = new TLM_bundle_lc
  { // Register TLM callable function with one pipeline delay.
     tlmBind_a1(ax_fun _, 1)
  }
  def ax_fun(x:UInt) = Reg(UInt(32), x + UInt(10))
}

class Targer_DX extends Module
{
  val io = new TLM_bundle_lc
  { // Register TLM callable diadic function with two pipeline delays.
     tlmBind_a2(dx_fun _, 2)
  }
  def dx_fun(x:UInt, y:UInt)= Reg(Reg(UInt(32), x + y))
}

  val unit_a = Module(new Targer_AX())
  val unit_b = Module(new Targer_BX())
  val unit_d = Module(new Targer_DX())

// Diadic - single use test
//   val answer = unit_d.io.run2(unit_a.io.run1(arg1K), unit_b.io.run1(arg2K)

// Diadic - reuse of same component AX
   val answer = unit_d.io.run2(unit_a.io.run1(arg1K), unit_a.io.run1(arg2K))


// Invoke and downconvert to unguarded for rest of design
   val answer1 = SAFLImplicitx.ex_drop_chisel_data_from_guarded(answer)
   io.z := answer1
   io.v := answer.isValid()

In this example, AX is a method with a fixed delay of one pipeline stage and DX has two stages of delay. This would normally encapsulate a much richer computation than the simple additions used for example purposes. The timing diagram shows the answer 820 finally being generated, after the single instance of AX is used in turn for each sub-expression. The top-level request is invoked by the ex_drop_chisel_data_from_guarded method which here is explicitly called but could be inserted by Scala's implicit mechanism.

Variable Delays - Not fully pipelined

The alternate design style for callable targets is illustread in this example where the callbacks can have variable delay and use a start/busy handshake. Note that fixed latency does not imply fully-pipelined. For instance, a multiplier might always take 4 clock cycles but internally be using a shift-and-add paradigm using accumulators that are dedicated to the current operation. A new one cannot start while it is busy.

In this coding style we supply -1 for the fixed latency operand to the tlmBind methods, thereby informing the SAFL/TLM subsystem not to generate its own shift register between the request and valid handshake wires.

class GCDunit extends Module {
  val io  = new TLM_bundle {
     tlmBind_a2(gcd_fun _, -1)
  }
  
  def gcd_fun(arg_a :UInt, arg_b :UInt):UInt = {
     val x = Reg(UInt())
     val y = Reg(UInt())
     val p = Reg(init=Bool(false))

     when (io.getReq() && !p) {
       x := arg_a
       y := arg_b
       p := Bool(true)
     }
  
     when (p) {
       when (x > y)  { x := y; y := x }
       .otherwise    { y := y - x }
     }

     when (!io.getReq()) { p := Bool(false) }
     io.getValid() := (y === Bits(0) && p)
     x
  }
}

When there ..

Going Further ... adding overloads and multiple methods per module

TODO

UF Replication Operator - Provide a fresh unique copy/instance of a component

As is now apparent, in our SAFL on Chisel implementation we do not need a special uniquify combinator. Instead, the number of instances of a callable target is controlled, in the obvious way, by the number of times Scala's new operator is applied to the target class. When different instances are specified in a parallel context, they are invoked in parallel, but when an instance reference is repeated the operation is automatically serialised, despite it being in a potentially parallel context.

Loops in the original SAFL were always expressed as tail-recursive calls, but by combining styles presented in this blogg, ... but here we support imperative programming ... ... TODO


Adding an Occam/Handel-C-like Extension to Chisel

Handel-C is a parallel programming language for hardware design based on Occam and CSP. Process loops are again used to define eternal hardware. It uses explicit message passing between each process using a flat, global namespace of channels. Shared variables are banned, giving a pureness and amenability to time/space folding, as also found in Erlang.

A notable feature of Handel-C is the PAR block constructor that places statements in parallel. There is also the SEQ constructor which provides conventional sequential execution of statements, like a begin/end construct in RTL processes and other imperative programming languages. Power arises from the ability to arbitrarily nest these blocks inside each other. Handel-C had originally a very strict and easy-to-implement approach to projecting these statements into hardware. Each leaf assignment statement, to a local variable or output channel, consumed exactly one clock cycle. PAR blocks nested inside other PAR blocks have no effect and can be flattened. The same applies for SEQ blocks nested inside SEQ blocks. Therefore, the outer process loop, without loss of generality can be considered always to start in SEQ mode and the RTL process loop compilation strategy is followed. As explained, this uses, hardware generation is then a matter of using a one-hot encoded program counter for each SEQ step. When a PAR block is encountered, the surrounding SEQ block activity net is wired to the activate inputs of all sub-blocks within the PAR statement. A hardware barrier is placed at the exit from the PAR block that waits for the last arriving thread before proceeding with the next SEQ statement. The hardware barrier follows the same design paradigm as the all-inputs-ready block for the SAFL TLM arguments.

Our implementation, being based on the RTL elaborator, implements the RTL assignment packing transform, where more than one update is made per clock cycle. This is an optimisation with respect to standard Handel-C which could be disabled when desired.

Handel-C has blocking send and receive primitives operators which each take a named channel as an argument. The output operation is diadic, taking also the value to be sent. The input operator is monadic and is a leaf expression. The conventional syntax is to use the exclamation mark for output and the question mark for input. For embedding ins scala we used straightforward overloading of the channel class, with 'chan.send(val)' for output and 'chan()' for input. The channels behave as FIFOs and, in our implementation, their depth is set in the constructor.

We defined the class HChan to serve as Handel-C's channels. They are a wrapper around the Decoupled Queue FIFO found in Chisel's ChiselUtil library. Unlike the Chisel queues, they do not have net-level connections in their I/O bundle, since the queue and dequeue are method calls.

Here is an example of Handel-C coding style using just one channel that is written in one process and read in a second process. No 'step()' calls are required in the always blocks owing to the I/O operations being blocking. The body of the sending process is a SEQ block as that is the default for processes. It first sends the number on the channel and then sends vv. It increments vv n parallel. In strict Occam, by doing the increment in parallel we save a clock cycle, whereas the packing transform, when in its default enabled condition, will do this anyway and the PAR construct makes no difference in this tiny example. The variable 'vv' could be operated on by both processes using this coding style, leading to potential race conditions. Better style would be to restrict the scope of all variables by defining them inside the always blocks.

class HandelExample extends Module {
  val io  = new Bundle {  val mon = UInt(OUTPUT, 3) }
  val porta = new HChan(UInt(32), 1)
  val vv = Reg(UInt(32), init=UInt(0)) 
  always {
     porta.send(UInt(4))
     PAR { porta.send(vv)
           vv := vv + UInt(1)
         }
  }
  always {
     io.mon := porta() & UInt(7);
  }
}

Handel-C style Synchronous Channels.

The neat things are to allow the channels to span Chisel bundle boundaries and for the behavioural thread to be blocked until the read or write is ready to proceed.

The channels are slightly richer than the TLM calling allowed in SAFL and Bluespec. The extension in Handel-C is the upcall. A handle on a Handel-C-like channel is passed into a module. The channel itself is instantiated at or above the lowest-common parent module of the modules that make operations on the channel. A reference is passed down to each customer module. The customers make upcalls (send or receive) on that reference from the body of their local imperative code. This code must be inside an 'always' / 'SEQTOP' thread where that thread blocks if the channel operation cannot proceed.

To implement the channels, as in the SAFL and Bluespec styles, we could use a walk of the abstract syntax tree to collect all the implicit guards for when an expression or side-effecting command is ready to run. All the relevant operators are already overloaded in Chisel to build an in-heap AST. But we can also put mechanisms in the implicit functions that convert a guarded expression to a normal one (Chisel.Bits) and where the downcast function drops a guard we push it as a new conjunct in a global guard collection that is flushed at each advance in the behavioural state machine for the thread.

...


Adding a Bluespec-like Extension to Chisel

The same infrastructure, with a little tweaking, can also provide the guarded atomic updates of Bluespec logic synthesis.

Although SAFL and Bluespec both automatically generate bi-directional handshake wires for every callable method, the time-domain semantics are completely different. Bluespec proactive behaviour all originates from Bluespec rules and all code is enclosed in the body of a rule or the body of a method that is called from a rule, or indirectly via further methods.

If a method is registered in a TLM_bundle with the bluespecBind callback, the Bluespec semantics are used. Standard Bluespec embodies a scheduller that packs operations into a clock cycle, but which does not make multi-cycle schedules. Hence all method calls must be non-blocking so that execution never extends beyond one clock cycle.

Bluespec associates every expression with an implicit guard that is the conjunction of the implicit guards of all the sub-expressions within it. Further guards arise from language-instrisic conditions which include a register can be written at most once in a clock cycle and a rule can fire at most once in a clock cycle. A method call is only executable when the implicit guard of the method itself holds and also the guards of all the argument expressions passed in. Hence, in the following fragment, in regular Bluespec, the guards for e1 and e2 must simultaneously hold, together with the guards for all the resources used in the bodies of the three methods mentioned, then the rule which embodies this code can fire.

  val ... = target_b.io.x2(target_a1.io.x1(e1),  target_a2.io.x1(e2)) 

A Bluespec rule is an unordered list of commands to be fired in parallel (atomically) and optional an explicit rule guard that is added to the conjunction of all the implicit guards of the commands. The principle idea of Bluespec is that a rule fires atomically and this has a natural parallel in the generated synchronous hardware where all flip-flop master sections within a clock domain are simulataneously applied to their slaves. The update value become visible atomically.

GCD in Bluespec:

//http://csg.csail.mit.edu/6.375
//A simple example Euclid’s algorithm for computing the Greatest Common Divisor (GCD):
module mkGCD (I_GCD);
  Reg#(int) x <- mkRegU;
  Reg#(int) y <- mkReg(0);
  
  rule swap ((x > y) &&  (y != 0));
       x <= y;  y <= x;
  endrule
  
  rule subtract ((x <= y) && (y != 0));
     y <= y – x;
  endrule
  
  method Action start(int a, int b) if (y==0);
     x <= a;  y <= b;
  endmethod
  
  method int result() if (y==0);
    return x;
  endmethod
endmodule

Bluespec-like rules can be embedded in Scala as follows. We used a similar method binding scheme to the one used for SAFL. Likewise, a bi-directional pair of handshake wires is automatically generated in the same way, but their associated protocol is different. SAFLs request/grant handshakes provide resource sharing over multiple clock cycles whereas, for Bluespec, the upwards ready nets contribute to a conjunction over all the resources used in a rule and the downwards request net is a broadcast to all those resources to fire at once. In both systems, the downwards nets controls multiplexors that route the arguments to the resources. We defined the Scala DSL extension 'rule' and extended the Chisel I/O Bundle with suitable methods for registering Bluespec methods. The Chisel 'when' conditional assignments are already semantically correct for Bluespec's conditional execution owing to their parallel composition. We arrive at the following concrete syntax:

class GCDunitBSV extends Module {
  val io  = new BSV_bundle {
     bsvBind_a2v(gcd_start _) // 2 args, void return
     bsvBind_a0(gcd_result _) // 0 args, data return.
  }
  val x = Reg(outType=UInt(32))
  val y = Reg(outType=UInt(32))

  rule ("swap") ((x > y) & (y != UInt(0)))
     {
       x := y
       y := x
     }

  rule ("subtract") ((x <= y) & (y != UInt(0)))
    {
      y := y - x;
    }

  def gcd_start(arg_a :UInt, arg_b :UInt) =
  {
     x := arg_a
     y := arg_b
  }

  def gcd_result() = x
}

We thought that the introspection features of Chisel (a sugaring of the Java Reflection API) would enable us to explore the abstract syntax tree of the arguments used in a rule and its descendent methods to form the implicit guard. Indeed, Chisel already relies on introspection; for instance, for when it infers the name for registers and other structural resources. However, all of the operators in these expressions are already overloaded by Chisel's DSL to build the hardware circuit as an abstract syntax tree. Chisel provides tree folding walker methods for such trees. So the conjunction of such guards is readily computed and ANDed with the explicit guard of the rule (and the implicit guard of the explicit guard!).

Bluespec supports a number of pragmas that can annotate rules and methods to alter the behaviour of its global scheduller. For instance, Bluespec provides two alternate semantics for reverse multiplexing the intrinsic guards of sub-expressions in different branches of a conditional expression: split or combined. Also, Bluespec implements a packing transform on complete rules, running many in the same clock cycle. This follows the same principle as the one we implemented for RTL. Further work on the Bluespec layer is needed to implement these features, but simple designs, like the GCD illustrated work fine without them.

Bluespec with Multi-Cycle Rules and Mixing SAFL and Bluespec

Since we have associated the calling protocol with the way the method is registered, our calling syntax is neutral to whether Bluespec, Decoupled or SAFL semantics are used. We can combine them freely in application code, but what happens? Let's first speak about a general extension to Bluespec that involves multi-cycle schedules, since this will be needed if SAFL's multi-cycle method bodies can be run from within a Bluespec rule. We can then consider the remaining issues.

TODO ... further text

A DSL Extension to Chisel that Provides HLS

We have shown how manual control over the time space trade of is possible using the UF operator. This is fine for simple systems where hand crafting is a good approach. But when operations can take a variable number of clock cycles, such as reading from cached DRAM or floating point addition, relying on static, hand-crafted schedules is no longer appealing. Automated assignment of work to clock cycles is the remit of High-Level Synthesis (HLS). The number of instances of structural components, such as DRAM banks and floating point units still needs to be broadly bounded by the engineer, but their precise assignment to operations in the input program is left to the compiler.

Perhaps a future blog entry ...

Postfix: Logic Minimisation

Need support for the 'dont care' X value, then we can leave this essentially to back end tools.

Sorry if this re-opens a closed issue, but I wonder if people tend to forget the important role of don't cares in logic synthesis? We teach first year students all about Karnaugh maps, but then forget to assign don't care in switch statements for the don't care cases. Chances for logic optimisation are then lost. So I'd be interested to know whether assigning a '?' in the otherwise case of a Chisel switch is recommended and whether it propagates through to the generated RTL so the back end tools do the right thing.

https://github.com/ucb-bar/chisel/issues/407

... to be continued ... DRAFT... June 2015 ...


(C) 2015 DJ GREAVES. END

web counter
web counter

Additional Resources for MemoCode Presentation

The paper "Layering RTL, SAFL, Handel-C and Bluespec Constructs on Chisel HCL" will be presented at Memocode 2015 in Austin.

Standard Chisel GCD Example

The Chisel distro contains the following GCD implementation in its tutorial.

This example uses Chisel's 'decoupled' interface paradigm for its input argument. This is simply the addition of an additional valid flag alongside the data wires. Chisel has quite powerful connection primtives for inter-module plumbing. But, unlike SAFL or BSV, the current Chisel distro does nothing to automate the wiring up of the handshakes in end sources or sinks. Hence, the code below has the manually-added flip-flop 'p' and additional lines of code to operate on it and on the 'valid' and 'Valid' flags.

lass RealGCD extends Module {
  val io  = new Bundle {
    val in  = Decoupled(new RealGCDInput()).flip()
    val out = Valid(Bits(width = 16))
  }
  
  val x = Reg(UInt())
  val y = Reg(UInt())
  val p = Reg(init=Bool(false))
  
  io.in.ready := !p
  
  when (io.in.valid && !p) {
    x := io.in.bits.a
    y := io.in.bits.b
    p := Bool(true)
  }

  when (p) {
    when (x > y)  { x := y; y := x }
    .otherwise    { y := y - x }
  }

  io.out.bits  := x
  io.out.valid := y === Bits(0) && p
  when (io.out.valid) {
    p := Bool(false)
  }
}

Bluespec GCD Demo

The following Bluespec version of GCD was found online:
// http://csg.csail.mit.edu/6.375
// A simple example Euclid's algorithm for computing the Greatest Common Divisor (GCD):

module mkGCD (I_GCD);
  Reg#(int) x <- mkRegU;
  Reg#(int) y <- mkReg(0);

  rule swap ((x > y) &&  (y != 0));
       x <= y;  y <= x;
  endrule

  rule subtract ((x <= y) && (y != 0));
     y <= y – x;
  endrule

  method Action start(int a, int b) if (y==0);
     x <= a;  y <= b;
  endmethod

  method int result() if (y==0);
    return x;
  endmethod
endmodule

It was manually coded as Bluespec-like rules using the Build-on-Chisel library as follows, with separate put and get methods to provide the arguments and retrieve the results:

class GCDunitBSV extends Module {
  val x = Reg(init=UInt(192, 32)) // Give it some initial work
  val y = Reg(init=UInt(36, 32)) // to do before it is ready to serve.


  rule ("swap") ((x > y) & (y != UInt(0))) {
       x := y
       y := x
     }

  rule ("subtract") ((x <= y) & (y != UInt(0))) {
      y := y - x;
    }

   // A put
   def gcd_start(arg_a :UInt, arg_b :UInt) = WHENV (y === UInt(0)) {
     x := arg_a
     y := arg_b
   }

   // A get
  def gcd_result():UInt =  WHEN(y === UInt(0)) { x }

  val io = new BSV_bundle
  {
     bsvBind_a2v(gcd_start _) // 2 args, void return
     bsvBind_a0(gcd_result _) // 0 args, date return
  }
}

The generated RTL is here : Test7Caller.v.

Block diagram of GCD unit with Bluespec-style handshake nets for each method.
Block diagram of GCD unit with Bluespec-style handshake nets for each method.

A plot of the simulation shows it running its initial hardcoded GCD(192.36) before accepting the three tests from the Chisel example testbench.

Example Bluespec Scheduler Required

The following fragment invokes the GCD component using a pair of Bluespec-style rules. The implicit guards associated with the GCD method bodies control when the invoking rules may fire.

// Rules that invoke methods on instantiated components:
class GCDspan extends Module {

  val a0 = Reg(init=UInt(10, 32)) 
  val a1 = Reg(init=UInt(20, 32))
  val sums = Reg(init=UInt(0, 32))
  val gcd2 = Module(new GCDunitBSV())

  rule ("runtest") (a0 < UInt(100)) {
       gcd2.io.run2(a0, a1)
       a0 := a0 + UInt(31)
     }

  rule ("tallyout") { 
      sums := sums + gcd2.io.run0();
    }

   // A get
  def tally_result():UInt =  { sums }

  val io = new BSV_bundle 
  {
     bsvBind_a0(tally_result _) // 0 args, data return
  }

The above example does not require explicit Bluespec schedulling since there is no resource that is shared between rules such that it is potentially non-deterministic which rule should operate on it next.

An example where a scheduler is needed is here:

... TBD


Composite Experiment 1

Interworking demonstration: four styles combined relatively seamlessly, except for SAFL/BSV interworking which requires transactors.

In this composite set up, we show all four styles interacting. The example system checks that calls to GCD are commutative in their arguments by tring them both ways around.

  1. The GCD component is implemented in Bluespec style. The argument and result are conveyed in atomic method invocations in different clock cycles. Bluespec methods have guards regarding when they can be called, but do not block while being called.
  2. The GCD component is invoked twice from a section coded in the SAFL. SAFL function calls are blocking with the 'thread' not returning until the answer is ready. Therefore a 'transactor' shim is required to convert between the blocking SAFL style and the non-blocking Bluespec style. The SAFL-style code is written in a parallelisable form, but there is only one instance of the GCD unit provided, so it is invoked twice in the time domain with the first answer cached in an automatically generated holding register.
  3. The work and result collection from the SAFL section is implemented by a behavioural-RTL style that is also blocking and where the thread pauses at multiple STEP points.
  4. The results are conveyed to a Handel-C printing process over a blocking Handel-C channel.

Downloads

Greaves Build-on-Chisel library download .. cbg-bog.zip includes examples - added here.


End of page.