View on GitHub

Dexter's Blog

📎 Reading 'Spectre is here to stay'

14 Mar 2019

Spectre is here to stay: An analysis of side-channels and speculative execution

This paper addresses three open problems made more relevant since Spectre:

This paper begins by introducing a formal model for CPU architectures in order to study spectre. They distinguish the observable state of an architecture (what a developer might see from its API) from the micro architecture (extra state kept by the CPU). They then study the model and a class of attacks known as side-channel attacks. Whenever state is exposed from the micro architecture, there is a possibility of an attacker to gain information they should not have. For example, speculative execution can cause data to be read into a cache. Based on timing, an attacker might gain some information about the, perhaps confidential, data.

A main result from their study is that in most programming languages with timers, speculative vulnerabilities on most modern CPUs allow for the construction of a (well-typed) function with signature read(address: int, bit: int) → bit. Frightening.

The paper then describes possible mitigations and their runtime impact. The authors experimented with mitigations in the V8 JavaScript engine. For instance, one mitigation was to augment every branch with an LFENCE instruction (as suggested by Intel). This led to a 2.8x slowdown on the Octane JavaScript benchmark. Another mitigation, retpoline, led to a 1.52x slowdown on Octane. Unfortunately, none of the mitigations completely protect against Spectre.

📎 Reading 'A Generalised Solution to Distributed Consensus'

21 Feb 2019

A Generalised Solution to Distributed Consensus

This paper generalizes Paxos and Fast Paxos into a model of write-once registers and provides three consensus algorithms for this model.

What caught my eye in this paper is this quote from the first page

This paper aims to improve performance by offering a generalised solution allowing engineers the flexibility to choose their own trade-offs according to the needs of their particular application and deployment environment.

This jibes with the goal of reflective consistency. Developers using reflective consistency can choose their own tradeoffs on the correctness/time curve. They can sacrifice the consistency of their system to improve its reactiveness.

Instead of the correctness/time curve, each of the alternative algorithms proposed in this paper fall somewhere on a curve balancing tradeoffs between Paxos and Fast Paxos.

Generalized Distributed Consensus

The authors reframe distributed consensus into the processing of write-once registers. They separate processes into servers, storing values, and clients, getting and putting values. They state that solving consensus means ensuring all of the following:

Non-triviality - All output values must have been the input value of a client.

Agreement - All clients that output a value must output the same value.

Progress - All clients must eventually output a value if the system is reliable and synchronous for a sufficient period.

Their general solution organizes write-once registers into the following taxonomy

They then go on to define a general algorithm to establish consensus. I could not keep up with the discussion of the algorithm, but it seemed well-grounded. In the end they offer proofs for their algorithms but passingly mention that one goal of this work is to create a general enough model for distributed consensus so that the correctness of its algorithms become obvious. Sounds tough to me, but what a noble goal!


Modern distributed consensus often means Raft. I’m curious if this could generalize raft in addition to Paxos and Fast Paxos.

It’d be interesting to see a benchmark comparing the different example consensus algorithms proposed in section 6.

📎 Reflective Consistency

14 Jan 2019

Reflective consistency is a consistency model which exposes anomaly statistics to the user. The anomalies are defined by some underlying consistency model. For example, a system could run under eventual consistency but track anomalies as defined by linearizability. Statistics on these anomalies can then be used by the system or the user of such a system. One example of using the statistics is for modifying the underlying consistency model.

Example: Cassandra with Reflective Consistency

Cassandra offers many consistency levels1 under which every read and write can be performed. A user could request Cassandra to perform all writes under a strong consistency level, such as requiring a quorum of replicas to acknowledge every write, and all reads under a relatively weaker consistency level.

Using the anomaly statistics, a user could modify their use of Cassandra based on anomalies. A user could perform all writes under the same, weak consistency level as reads, and when anomalies start to increase more than the user wants they could direct Cassandra to use the stronger consistency level until anomalies being to decrease.

Statistics could even be tracked per key, per replica, or per anything. This gives the user fine or coarse grained control.

📎 Example of Type-Based Optimizations in Factor

30 Jun 2018

Factor is dynamically typed, yet with words such as declare and TYPED: we can guide the compiler to generate more efficient code.

Consider the following two words and their definitions:

: add-untyped ( x y -- z ) + ;
TYPED: add-typed ( x: fixnum y: fixnum -- z: fixnum ) + ;

We will use the disassemble word for viewing assembly. Let’s now compare the difference between add-untyped and add-typed. First, add-untyped:

00007f134c061820: 8905da27f4fe    mov [rip-0x10bd826], eax
00007f134c061826: 8905d427f4fe    mov [rip-0x10bd82c], eax
00007f134c06182c: 488d1d05000000  lea rbx, [rip+0x5]
00007f134c061833: e9e8f115ff      jmp 0x7f134b1c0a20 (+)

Very simple, this jumps to the generic word + which will dynamically dispatch to the appropriate + definition for fixnum. And now add-typed:

00007f134c041c90: 89056a23f6fe    mov [rip-0x109dc96], eax
00007f134c041c96: 53              push rbx
00007f134c041c97: 498b06          mov rax, [r14]
00007f134c041c9a: 4983ee08        sub r14, 0x8
00007f134c041c9e: 4983c708        add r15, 0x8
00007f134c041ca2: 498907          mov [r15], rax
00007f134c041ca5: e8d65dfefe      call 0x7f134b027a80 ([ \ integer>fixnum-strict ~array~ 0 ~array~ inline-cache-miss ])
00007f134c041caa: 498b07          mov rax, [r15]
00007f134c041cad: 4983c608        add r14, 0x8
00007f134c041cb1: 4983ef08        sub r15, 0x8
00007f134c041cb5: 498906          mov [r14], rax
00007f134c041cb8: e8c35dfefe      call 0x7f134b027a80 ([ \ integer>fixnum-strict ~array~ 0 ~array~ inline-cache-miss ])
00007f134c041cbd: 89053d23f6fe    mov [rip-0x109dcc3], eax
00007f134c041cc3: 5b              pop rbx
00007f134c041cc4: 488d1d05000000  lea rbx, [rip+0x5]
00007f134c041ccb: e950ffffff      jmp 0x7f134c041c20 (( typed add-typed ))

This first tries to coerce the two inputs to fixnum and then jumps to an address marked as (( typed add-typed )). The assembly for (( typed add-typed )) is

00007f134c041c20: 8905da23f6fe    mov [rip-0x109dc26], eax
00007f134c041c26: 53              push rbx
00007f134c041c27: 498b06          mov rax, [r14]
00007f134c041c2a: 498b5ef8        mov rbx, [r14-0x8]
00007f134c041c2e: 4801c3          add rbx, rax
00007f134c041c31: 0f801a000000    jo 0x7f134c041c51 (( typed add-typed ) + 0x31)
00007f134c041c37: 4983ee08        sub r14, 0x8
00007f134c041c3b: 49891e          mov [r14], rbx
00007f134c041c3e: 8905bc23f6fe    mov [rip-0x109dc44], eax
00007f134c041c44: 5b              pop rbx
00007f134c041c45: 488d1d05000000  lea rbx, [rip+0x5]
00007f134c041c4c: e96f5dfefe      jmp 0x7f134b0279c0 ([ \ integer>fixnum-strict ~array~ 0 ~array~...)
00007f134c041c51: e80a1a45ff      call 0x7f134b493660 (fixnum+overflow)
00007f134c041c56: 8905a423f6fe    mov [rip-0x109dc5c], eax
00007f134c041c5c: 5b              pop rbx
00007f134c041c5d: 488d1d05000000  lea rbx, [rip+0x5]
00007f134c041c64: e9575dfefe      jmp 0x7f134b0279c0 ([ \ integer>fixnum-strict ~array~ 0 ~array~...)

This will do an add instruction directly and then check for overflow. No dynamic dispatch involved.

Let’s test these out with some input values. We’ll look at the assembly for [1 2 add-untyped] and then [1 2 add-typed]. First, using add-untyped:

00007f134c148370: 89058abce5fe      mov [rip-0x11a4376], eax
00007f134c148376: 4983c610          add r14, 0x10
00007f134c14837a: 49c70620000000    mov qword [r14], 0x20
00007f134c148381: 49c746f810000000  mov qword [r14-0x8], 0x10
00007f134c148389: 890571bce5fe      mov [rip-0x11a438f], eax
00007f134c14838f: 488d1d05000000    lea rbx, [rip+0x5]
00007f134c148396: e98594f1ff        jmp 0x7f134c061820 (add-untyped)
00007f134c14839b: 0000              add [rax], al
00007f134c14839d: 0000              add [rax], al
00007f134c14839f: 00                invalid

As expected, this jumps to add-untyped. And now using add-typed:

00007f134c146850: 8905aad7e5fe      mov [rip-0x11a2856], eax
00007f134c146856: 4983c610          add r14, 0x10
00007f134c14685a: 49c70620000000    mov qword [r14], 0x20
00007f134c146861: 49c746f810000000  mov qword [r14-0x8], 0x10
00007f134c146869: 890591d7e5fe      mov [rip-0x11a286f], eax
00007f134c14686f: 488d1d05000000    lea rbx, [rip+0x5]
00007f134c146876: e9a5b3efff        jmp 0x7f134c041c20 (( typed add-typed ))
00007f134c14687b: 0000              add [rax], al
00007f134c14687d: 0000              add [rax], al
00007f134c14687f: 00                invalid

This directly jumps to (( typed add-typed )) which will directly do the add instruction. The compiler safely skipped the type coercion as it knows the inputs are both fixnum.

📎 A Short Guide to Adding a Keyword to Go

13 Jan 2018

Go is a programming language.

Sometimes code looks cleaner when variable declarations are listed after code that uses them. Imagine writing Go in this style.

$ cat test.go
package main

import (

func main() {

  andwhere a = 1
  andwhere f = func(a int) string { return strconv.Itoa(a+1) }
$ ./bin/go run test.go

Below are the modifications I made to the Go compiler.

--- a/src/cmd/compile/internal/syntax/tokens.go
+++ b/src/cmd/compile/internal/syntax/tokens.go
@@ -46,6 +46,7 @@ const (
+       _Andwhere
@@ -118,6 +119,7 @@ var tokstrings = [...]string{
        _Continue:    "continue",
        _Default:     "default",
        _Defer:       "defer",
+       _Andwhere:    "andwhere",
        _Else:        "else",
        _Fallthrough: "fallthrough",
        _For:         "for",
--- a/src/cmd/compile/internal/syntax/nodes.go
+++ b/src/cmd/compile/internal/syntax/nodes.go
@@ -371,6 +371,11 @@ type (

+       Andwhere struct {
+               S *DeclStmt
+               stmt
+       }
        ReturnStmt struct {
                Results Expr // nil means no explicit return values
--- a/src/cmd/compile/internal/syntax/parser.go
+++ b/src/cmd/compile/internal/syntax/parser.go
@@ -2034,6 +2053,11 @@ func (p *parser) stmtOrNil() Stmt {
        case _Go, _Defer:
                return p.callStmt()

+       case _Andwhere:
+               s := new(Andwhere)
+               s.S = p.declStmt(p.varDecl)
+               return s
        case _Goto:
                s := new(BranchStmt)
                s.pos = p.pos()
@@ -542,9 +543,27 @@ func (p *parser) funcDeclOrNil() *FuncDecl {
        f.Pragma = p.pragma

+       // Lift all Andwhere to the top
+       if f.Body != nil {
+               f.Body.List = lift_andwheres(f.Body.List)
+       }
        return f

+func lift_andwheres(list []Stmt) []Stmt {
+       for i, x := range list {
+               switch x.(type) {
+               case *Andwhere:
+                       removed := append(list[:i], list[i+1:]...)
+                       modified := x.(*Andwhere).S
+                       prepended := append([]Stmt{modified}, removed...)
+                       return lift_andwheres(prepended)
+               }
+       }
+       return list
 func (p *parser) funcBody() *BlockStmt {
        errcnt := p.errcnt

📎 A Short Guide to Creating Babel Plugins

12 Jan 2018

Babel is a JavaScript compiler with plugin support.

Imagine a Babel plugin that transforms all string literals into "babel". The source below does that.

module.exports = function () {
  return {
    visitor: {
      StringLiteral: function a(path) {
        path.node.value = 'babel';

Here’s how to use it:

$ npm install babel babel-cli
$ ./node_modules/.bin/babel --plugins ./a.js
a = 'hi';
a = 'babel';

It works with files too:

$ cat b.js
var s = 'start';

function go() {
  var e = 'end';
  return s + e;

$ ./node_modules/.bin/babel --plugins ./a.js ./b.js
var s = 'babel';

function go() {
  var e = 'babel';
  return s + e;

$ ./node_modules/.bin/babel --plugins ./a.js ./b.js | node

Now imagine the same babel plugin that, in addition, transforms every variable declaration in a function initialized to '*' to be abstracted as a function argument. The source below does that.

module.exports = function (babel) {
  t = babel.types

  return {
    visitor: {
      StringLiteral: function a(path) {
        path.node.value = 'babel';
      VariableDeclaration: function a(path) {
        var var_name = path.node.declarations[0];
        var var_val = path.node.declarations[0].init.value;
        if (var_val === '*') {
          path.node.declarations[0].init = t.identifier(var_name);

Here’s how to use it:

$ cat c.js
function go(t) {
  var c = 'hi';
  var a = '*';
  for (var i = 0; i < t; ++i) {
    console.log(c + ' ' + a);

go(5, 'stranger')
$ ./node_modules/.bin/babel --plugins ./a.js ./c.js | node
$ ./node_modules/.bin/babel --plugins ./a.js ./c.js
function go(t, a) {
  var c = 'babel';
  var a = a;
  for (var i = 0; i < t; ++i) {
    console.log(c + 'babel' + a);

go(5, 'babel');

📎 Exploring Loop Perforation Opportunities in OCaml Programs

09 Aug 2016

Approximate computing is sacrificing accuracy in the hopes of reducing resource usage. Loop perforation is an approximate programming technique with a simple story: skip loop iterations whenever we can. This simple idea is why loop perforation is one of the simplest approximate programming techniques to understand and apply to existing programs. Computations often use loops to aggregate values or to repeat subcomputations until a convergence. Often, it is acceptable to simply skip some of the values in an aggregation or to stop a computation before it has a chance to converge.

This summer I’ve been working on a handful of projects at OCaml Labs in the pleasant city of Cambridge, UK. Part of my work has been in creating a tool for loop perforation.


aperf (opam, github) is a research quality tool for exploring loop perforation opportunities in OCaml programs.

Aperf takes a user-prepared source file, instructions on how to build the program, and a way to evaluate its accuracy, and will try to find a good balance between high speedup and low accuracy loss. It does this by testing perforation configurations one by one. A perforation configuration is a perforation percentage for each for loop which the user requests for testing. A computation with four perforation hooks will have configurations like [0.13; 1.0; 0.50; 0.75], etc. Aperf searches for configurations to try either by exhaustively enumerating all possibilities or by pseudo-smartly using optimization techniques.

Aperf offers hooks into its system in the form of ppx annotations and Aperf module functions in a source file. A user annotates for loops as

for i = 1 to 10 do
  work ()
done [@perforate]

Upon request, aperf can store the amount of perforation into a reference variable:

let p = ref 1. in
for i = 1 to 10 do
  work ()
done [@perforatenote p]

This could be useful for extrapolating an answer. Right now, the only Aperf module function used for hooking perforation is Aperf.array_iter_approx which has the same type as Array.iter. If this tool makes it out of the research stage then we will add more perforation hooks. (They are easy to add, thanks to ppx_metaquot.)


Right now we only support exhaustive search and hill climbing. Better strategies, including interacting with nlopt and its ocaml bindings, could be a future direction.

Our hill climbing uses the fitness function described in Managing Performance vs. Accuracy Trade-offs with Loop Perforation1:

The b parameter (which we also call the accuracy loss bound) is for controlling the maximum acceptable accuracy loss as well as giving more weight to the accuracy part of the computation. In other words, a perforation solution must work harder at maintaining low accuracy loss than in speeding up the computation.


OCaml programs don’t usually use loops. This is what initially led us to offer Aperf.array_iter_approx

Programs must have a way to calculate the accuracy of perforated answers. This is a huge limitation in practice because usually this means the user of aperf must knowing the answer ahead of time. In a lot of domains this is unfeasible. Finding applicable domains is an ongoing problem in approximate computing.



Summing $m$ out of $n$ numbers sounds boring but it’s a good place to start. This problem is interesting because there are theoretical bounds to compare results to. Hoeffding’s Inequality tells us what we can expect for absolute error. The following graph gives maximum absolute error, with confidence of 95%, when keeping and summing $m$ variables out of 100 where each variable is drawn uniformly from the range $[0,100{,}000]$.

theoretical absolute bounds

We can also give the same graph with percentage bounds (assuming the result is 5,000,000).

theoretical percent bounds

Our code for summing is a little awkward because aperf cannot currently work with most computations. The best option we can fit our computation into is Aperf.array_iter_approx. We also use [@perforatenote perforated] which hooks into aperf and saves the used perforation to the variable perforated. We use this to extrapolate the final summation value. The final, awkward code:

Aperf.array_iter_approx (fun a -> answer := !answer + a) arr [@perforatenote perforated] ;
answer := int_of_float ((float_of_int !answer) *. (1. /. !perforated)) ;


We perform initial exploration using 30 input sets of 100 numbers between 0 and 100,000. Below is a graph of the accuracy loss of an exhaustive search of perforation while keeping $x$ out of the 100 numbers.

exhaustive training results

Note that we are well below the theoretical loss. Below is a graph of the accuracy loss of a hill climbing run of the perforation with an accuracy loss bound of 30%.

explore training results

Our exploration finds that a perforation of 15% is adequate. This means that out of 100 numbers we are only summing 15 of them. From this we get a speedup of 3.6x and an accuracy loss of 11%.

It specifically chose 15% because it had a good combination of high speedup and low accuracy loss. Among the other configurations the system tried was 35%. It had a speedup of 2.3x and an accuracy loss of 9.5%. Why wasn’t it chosen? The program saw that with just a 1.5% further sacrifice of accuracy the system was able to decrease run time by over one factor. Another configuration the system tried was 6% which has an impressive speedup of 7.9x at an accuracy loss of 16%. The system, as programmed now, did not want to sacrifice an extra 5% accuracy loss for the extra speed. Perhaps if aperf supported user defined climbing strategies then someone could program a strategy which accepted that tradeoff and explored further.

If we require an accuracy loss bound of 5% then aperf finds a perforation of 50% to be the best result which gives a speedup of 1.3x and an accuracy loss of 3.5%.


We test the two perforation results from the 30% accuracy bound and the 5% accuracy bound on 10 testing sets each of 30 input sets of 100 numbers between 0 and 100,000. This is to test the generalization of the perforation—we want the perforation to work equally well on data it wasn’t trained for. Over 10 runs, with perforation 15% the average accuracy loss is 11.8% and with perforation 50% the average accuracy loss is 5.2%.

We then run the same experiment but with inputs drawn from the range $[0, 1{,}000{,}000]$. This is to test the scalability of the perforation—we want the perforation to work well for larger ranges of data than those it was trained for. Over 10 runs, with perforation 15% the average accuracy loss is 10.4% and with perforation 50% the average accuracy loss is 4.8%.

It looks like the perforation is generalizing and scaling nicely. Of course this is a simple problem and we’d like to explore some results from a more complex domain.


We perforate two loops in K-means

  1. the loop over the points when finding new centers
  2. the number of iterations of the step algorithm


We initially train on 30 inputs of 1,000 2-dimensional points where values in each dimension range from 0 to 500. Below is a graph of the speedup vs. accuracy loss of an exhaustive search of perforation.

exhaustive training results

Below is a graph of the speedup vs. accuracy loss of a hill climbing run of the perforation with an accuracy loss bound of 30%.

explore training results

The result aperf picked in the end was 44% perforation for the points loop and 33% perforation for the step algorithm. This configuration had a speedup of 2.8x and an accuracy loss of 15.5%. If the user wishes to move upwards into the 3x speedup range then there are some results there that offer 17% accuracy loss and above as seen on the graph. Again, this is all about efficiently exploring the perforation configuration space. Different strategies could be a future feature.


We test on 30 inputs of 1,000 2-dimensional points where values in each dimension range from 0 to 1,000. This tests whether the perforation scales with larger ranges on the dimensions. Over 10 runs, the average speedup is 2.8x and the average accuracy loss is 16.8%.

We also test on 30 inputs of 10,000 2-dimensional points with the same dimension ranges. This tests if the perforation scales with the number of points in each input set. Over 10 runs, the average speedup is 3.2x and the average accuracy loss is 14.3%.

Finally, we test on 30 inputs of 10,000 2-dimensional points where values in each dimension range from 0 to 5,000. This tests if the perforation scales with larger ranges in the dimensions as well as the number of points in each input set. Over 30 runs, the average speedup is 3.2x and the average accuracy loss is 13.0%.

The perforation seems to be scaling nicely.


We should test more computations with aperf. I’m excited to see how the community may use this tool and I’m willing to work with anyone who wants to use aperf.

📎 The Difficulty in a General Approximate Computing Framework

15 Jul 2016

Approximate computing is sacrificing accuracy in the hopes of reducing resource usage.

Application specific techniques in this space require too much domain specific knowledge to be used as generic solutions for approximate programming.

I will go through some common techniques in approximate computing, why they don’t work well as generic solutions, and then offer a potential future direction.

Statically Bounding Error

One commonly sought after goal in approximate computing is to statically bound the absolute error of a computation. The most applied approach for accomplishing this is to statically recognize patterns of computation and use statistical techniques to derive bounds1 2.

For example, we can use Hoeffding’s Inequality to approximate the summation of variables. Unfortunately it’s use is constrained in that the variables it sums must be independent and bounded by some interval $[a,b]$. If we meet these constraints then we can give the following error bound, with a probability of at least $1-\epsilon$, on a summation of only $m$ numbers out of an original $n$-number summation:

With further use of statistical techniques, it is possible to derive static absolute error bounds when calculating mean, argmin-sum, and ratio. However, depending on which statistical technique you use, you might have to prove or assume i.i.d.-ness of random variables, that sequences of random variables are random walks, the distribution of random variables, or other properties. These are all very domain dependent properties which don’t fit well into a generic programming framework. Further, the patterns of computation these techniques work for (i.e. sum, mean, argmin-sum, and ratio) are very low level; it’s unclear whether these techniques can scale to high-level algorithms without running into very advanced statistical techniques. For example, how would the error bounds on a sum help you give error bounds on a Fourier transform?

Reasoning and Proving

There has been at least one effort in offering programmers the chance to reason and prove properties of approximate programs3. The dream is that you would be able to prove more complex properties about more complex calculations than the pattern recognition tools offer.

For example, the referenced paper gives an example of reasoning over a program which skips locks to speed up execution time. The program is 14 lines. The reasoning, done in Coq, statically bounds the error on the calculation. The proof is 223 lines (not counting comments, blank lines, or the almost 5k lines of Coq which make up the library which powers the proof). I am not pointing out the line counts because I do not believe in this approach. On the contrary, I do believe in this approach—I just don’t think it’s feasible to expect an average programmer today to reason about their approximate programs using this framework. Maybe in a few years; not now.

A Hands on Approach

I do not believe that static analysis via pattern detection is a good fit for generic approximate computing. I do believe that proving and reasoning could become a good generic solution if and when proof automation becomes more mainstream. I believe the closest we can get today to a generic solution is to offer programmers a toolbox which allows them to explore the effects of approximate programming techniques in a hands on manner.

I’ve started writing a user-guided loop perforation system called aperf. The code is on github. Right now it only supports loop perforation. Loop perforation is when some iterations of a process are dropped in the name of approximate computing. The goal is to drop as much as possible while still maintaining an acceptable error/resource tradeoff. aperf takes an annotated source program as input and automatically searches the error/time tradeoff space for a pareto curve. A detailed report of the perforation options are returned the user.