Profile Lucas Sta Maria

I'm a third-year Computer Science and Mathematics student at Northeastern University interested in programming languages and developer tooling.

(λx.x x) (λx.x x)

Brainfart

Brainfuck is an esoteric programming language that consists of 8 characters that make its instruction set. It is Turing-complete, using an array of cells to store information. A single pointer accesses these cells. The 8 instructions provide the following actions:

General Implementation Details

Some implementations have cells with max values of 255, where incrementing at 255 causes the value to wrap around to zero (and decrementing at 0 wraps to 255). Some also do the same for the cell array, forming a cyclical structure.

My implementation used unsigned 32-bit integers, meaning each cell can store massive numbers. I also disallowed wrapping for both pointer location and value, emitting errors when performed. Further, my cell array dynamically expands as the pointer moves further to the right.

This implementation detail allows for flexible storage of large information, such as unicode characters, and many of them because of the array's reallocation to expand. However, a certain number of brainfuck algorithms rely on wrapping values. Brainfart does not implement wrapping, and even if it did, it would take 2^32 - 1 of + instructions to wrap around in the positive direction.

Tokens

A Token is a struct that contains an enum TokenType, as well as the line and column of where the token is located. Storing the line and column allows for more explicit error reporting. Each instruction is mapped to a TokenType, and if the character isn't a valid instruction, it is simply ignored. This allows for comments to be added.

enum TokenType {
    PointInc,
    PointDec,
    ValInc,
    ValDec,
    Output,
    Input,
    IfZero,
    IfNonZero
}

The lexer takes every character from the brainfuck program and maps it to a Token with its corresponding TokenType.

Optimizing with Exprs

It is simple to run a brainfuck program using only the vector of Token – in fact, that's what version 1.0.0 did. However, we can observe in certain programs that there is a lot of repetition. Particularly in programs with many [ and ] (which I shall refer to as "loop blocks") and consecutive adds/subtracts.

There is obviously an avenue for optimization here: we can introduce preprocessing time that merges these consecutive instructions together. Following that, we will observe an increase in efficiency, for adding x once is faster than adding one x times. This is especially apparent in programs that involve a lot of loop blocks.

To achieve this optimization, I introduced a struct Expr and its enum ExprType. There is also a struct LoopBlock, which stores the expressions enclosed within the brackets, including other loop blocks.

enum ExprType {
    Set(u32),
    Add(u32),
    Sub(u32),
    MoveRight(u32),
    MoveLeft(u32),
    Output(u32),
    Input(u32),
    LoopBlock(Box<LoopBlock>),
}

struct LoopBlock {
    exprs: Vec<Expr>,
}

Most of the ExprType members store the number of consecutive occurences. Brainfart's parser converts +++++ into an Expr with ExprType::Add(5). This allows for us to add the contained value in constant time to the current cell.

Noticeably, in ExprType, we have the member LoopBlock(Box<LoopBlock>) that refers to the LoopBlock struct. We Box it (allocate it to the heap) so that the compiler can determine the size of an ExprType, since there are cyclical references between LoopBlock, ExprType, and Expr.

The parser converts all the Token structs into Expr structs, and all we have to do is evaluate those Expr structs.

Our ProgState (program state, or context) is a struct that comprises simply of a vector of u32 integers and an index that represents the pointer. Evaluating the expressions in order mutates the ProgState. Since an Expr is comprised of several ~Token~s, it does so considerably efficiently.

Benchmark

I ran a simple test on the unoptimized vs optimized versions of the Brainfart interpreter.

# optimized
$ time bft test.bf
bft test.bf  9.92s user 0.00s system 99% cpu 9.929 total

# unoptimized
$ time bft test.bf
bft test.bf  687.40s user 0.30s system 99% cpu 11:29.18 total

test.bf

++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
+++++++++++[>+++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++[>++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
+++++++++++++++++[>+
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++[>++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
++++++++++++++++++++
+++<-]<-]<-]<-]

While obviously it's unlikely to encounter such a brainfuck program with that many consecutive adds and nested loop blocks, it does demonstrate how parsing optimizes such programs.

In programs, however, that consist of mostly nonconsecutive instructions and scarce loop blocks, parsing does have an overhead in efficiency, although mostly negligible.