Archive

Archive for October, 2013

per: composable forward-passing processor functions

October 31, 2013 Leave a comment

The other day I tried implementing SICP streams in JavaScript as part of a fun project I’m tinkering with. I noticed that (unsurprisingly, given how they work) they generate a lot of memory garbage during the main loop of whatever operation you’re doing, and the overhead of this can actually become significant.

So I blogged them, sighed to myself, and then ripped them out of my project. What should I use instead? I want a combination of high performance and convenience. Also I miss generators, which of course I can’t (yet) assume are available on my target platforms, as they include browsers dating as far back as the ancient IE9 (ask your grandmother) and Chrome 30 without the about:flags experimental JavaScript features enabled.

My building block will be a function of this shape, which I’ll refer to as a processor:

function timesTwo(emit, value) {
    emit(value * 2); // this part is optional
}

It takes two parameters, the first being a function to which it can pass values forward, and then second being a value for it to process. So you have to call it to give it a value, and it may or may not emit values onwards to wherever you tell it.

Of course it can emit nothing, or many values:

function triplicate(emit, value) {
    emit(value);
    emit(value);
    emit(value);
}

If you’re into Ruby this will be very familiar to you as a way of representing a sequence of values, and if you implement this kind of function it’s nice and easy, like using yield in a true generator.

The difference here from the Ruby pattern is that we’re discussing a pure function, rather than a method on an object. So we take a second argument as an input that can be used to determine what we emit. For example, we could assume our values will be arrays, and “flatten” them:

function flatten(emit, value) {
    value.forEach(emit); // JS array's forEach fits perfectly
}

If you called this three times passing an array each time, emit would get all the elements of the three arrays as a flat stream of elements (in separate calls), not knowing which arrays they were originally from.

Alternatively we can ignore the second parameter (not even bothering to declare it) and so define a pure source of data that emits its sequence of values when called:

function chatter(emit) {
    emit('hi');
    emit('whatever');
    emit('bye');
}

So far, so patterny. But the pain with these kinds of building blocks is that they don’t directly compose. An intermediate processor like flatten wants two arguments: where to emit its output and what value to process. But any preceding step just wants a function called emit that accepts one argument: a value.

We can take any two processors and turn them into a single processor that chains the two (put on your higher-order functional programming spectacles now):

function compose(first, second) {
    return function(emit, value) {
        return first(function(firstValue) {
            return second(emit, firstValue);
        }, value);
    };
}

(Note: I pass back the return value through the layers because it has a use that we’ll go into later.)

See how it works? compose creates and returns a new function. It weaves first and second together. The new wrapper receives the outer emit, and that is given to second, so that’s where the final results will be sent. The first function is passed another new function to emit to, which is where we call second.

That’s the beautiful version. But it means that we have allocate a function object every time a value is passed in. And let’s be crazy C++ programmers for a moment and assume that’s Just Not Good Enough. We could rewrite compose to be butt-ugly and yet still do what we want, without doing any dynamic allocation after initial setup:

function compose(first, second) {
    var secondEmit;
    function firstEmit(firstVal) {
        return second(secondEmit, firstVal);
    }
    return function(emit, value) {
        secondEmit = emit;
        return first(firstEmit, value);            
    };
}

Yes, EEWWW indeed. We use a local variable, secondEmit, in which we stash the outer emit as soon as we know it. And we create a firstEmit function once, so we can reuse it.

In simple scenarios this will behave the same as the beautiful version. But not always:

function troubleMaker(emit, value) {
    setInterval(function() { emit(value); }, 100);
};

var doubleTrouble = compose(troubleMaker, timesTwo);

doubleTrouble(function(value) { console.log(value); }, 3);
doubleTrouble(function(value) { document.write(value); }, 4);

Now we have the value 6 being printed to the console and the value 8 being appended to the document. Except… not if we use the second version of compose, because then the second call to foo would redirect both streams to the new destination (by updating that secondEmit local variable). What a mess.

Fortunately, we’re not C++ programmers! Phew! This doesn’t mean that we don’t care about performance. It just means that we do some measuring before going crazy about imaginary overhead. And on V8 I find that the “faster” version is approximately… 1% faster. Screw that. Let’s stick with the beautiful, easy-to-predict version.

The one other interesting feature of the pattern is what we can use the return value for: to signal that the receiver doesn’t want to receive any more data. To do this they return true. So our flatten example should actually look like this:

function flatten(emit, value) {
    return value.some(emit);
}

That way, we stop looping unnecessarily when there’s no need to keep going (because that’s what a JavaScript array’s some method does: quits and returns true when the function you gave it returns true).

So that’s the pattern. What more is there to say? Well, although writing a processor (or pure value generator) is very easy, because you just write imperative code and call emit whenever you like, it’s not so convenient to use them as components. Yes, we have combine to efficiently pipeline processors together, but there are a lot of “standard” things we tend to do on sequences where it would be nice not to have to write boilerplate code. Especially when writing tests (shudder).

To make this really easy and succinct, I’ve cooked up a little library called per:

npm install per

(Source code)

It’s reminiscent of jQuery, in that it puts a lightweight wrapper around something so that we can call useful operations on it. Most operations return another instance of the wrapper, so calls can be chained. The only entry point into the whole library is a single function called per, so in node (or browserify or webmake) you’d say:

var per = require('per');

In the browser you could just load per.js via the script tag and then you get a global per. Then you can wrap functions like this:

var numbers = per(function(emit) {
    for (var n = 0; n < 100; n++) {
        emit(n);
    }
});

The simplest methods on a per are ways to capture and return the emitted values:

var f = numbers.first(),    // f == 0
    l = numbers.last(),     // l == 99
    a = numbers.all()       // array [0... 99]

The function wrapped by a per is available in a property called forEach, so named because for a simple generator you can treat it a little like an array:

numbers.forEach(function(value) {
    console.log(value);
});

To compose functions, there is a per method which is exactly like the compose function we figured out above.

var odds = numbers.per(function(emit, value) {
                           if (value % 2) {
                               emit(value);
                           }
                       });

The above pattern is an example of a filter, which evaluates an expression to decide whether or not to forward a value. This is such an obvious pattern that we should have a built-in formalization of it:

var odds = numbers.filter(function(value) {
                              return value % 2
                          });

For extra brevity we support passing a string expression in terms of x:

var odds = numbers.filter('x%2');

Another common pattern is to transform a value and pass it on, which is captured by the map method:

var evens = numbers.filter('x%2').map('x-1');

How can we use the resulting combination? As always it has a forEach property, and because numbers doesn’t need a value parameter we only need to pass it a function to emit to (or we can use first, last or all).

There are operators skip and take that work like their corresponding Linq operators:

odds.skip(3).all()          // [7, 9, 11, 13...]
odds.skip(2).take(3).all()  // [5, 7, 9]

These are examples of stateful transformers, because they have an internal counter that controls their behaviour.

We can also construct an initial per by passing an array:

var a = per([1, 2, 3, 4, 5]);

console.log(a.first()); // 1
console.log(a.last()); // 5

You can even pass per a simple value and it will act as a function that merely emits that one value. For example that value might be something complex, such as a whole document that you’re going to pass through several parsing stages.

The above examples all use a fairly simple structure: the first call to per provides an initial value (or stream of values), then there is a chain of transformations, then the final step collects the results. This is fine if you want to process a stream of data in a single quick operation.

But what if you have a source of data that occasionally delivers you a value, and you want to send it through a pipeline of transformers? That is, rather than a single deluge you have more of a drip, drip, drip of information. You can prepare a chain of transformers:

var p = per(foo1).per(foo2)
                 .map(bar1)
                 .filter(bar2)
                 .listen(function(value) {
                     console.log(value);
                 });

The listen is used to inject a function in the pipeline that just receives values, but otherwise has no effect on the stream. So in fact it’s like per but you don’t have to write emit(value) – that just happens automatically.

So now you need a way to pass values into the front, for which you can use submit:

p.submit(5);
p.submit(41);
p.submit('hi');

Sometimes you need to direct the flow of information through multiple separate sections simultaneously. This is where multicast comes in handy:

var p = per(foo1).per(foo2).multicast(
    per(foo1).filter('!(x%2)').take(72).into(results1),
    per(foo2).filter('x%2').take(68).into(results2)
);

You can pass as many arguments as you like to multicast (they can be instances of per or plain transformer functions) and they will each receive every value. By the way, multicast is built on listen, so it doesn’t affect the stream, except that when all the recipients have returned true to indicate that they are full, then multicast itself will do the same.

But really the core idea is break up a complex multi-stage task into functions like this:

function transformer(emit, value) {
    // do something interesting and maybe call emit
    // a few times, passing forward values to the next stage
}

The per library merely provides some common ways to stitch such functions together.

Categories: Uncategorized Tags: ,

SICP-style Streams in JavaScript

October 29, 2013 1 comment

In the not-famous-enough book Structure and Interpretation of Computer Programs (Abelson & Sussman, or “The Wizard book”) we learn about streams.

A stream is a tempting variation on the old school Lisp style of linked list. To get a plain old list, we can set up objects like this:

var a = {
    value: 'apple',
    next: null
};

var b = {
    value: 'banana',
    next: a
};

var c = {
    value: 'cantaloupe',
    next: b
};

So here our whole list is represented by c, and we can loop through it and print all the fruits:

for (var i = c; i != null; i = i.next) {
    console.log(i.value);
}

So far, so boring. The idea with a stream is very simple. Instead of storing the next object in the next property, we store a function that, if called, will return the next object. That is, we make it lazy. Note that our loop would still look much the same:

for (var i = c; i != null; i = i.next()) {
    console.log(i.value);
}

The only difference is we call next() instead of just reading it. And to set up the objects we’d have to say:

var a = {
    value: 'apple',
    next: function() { return null; }
};

var b = {
    value: 'banana',
    next: function() { return a; }
};

var c = {
    value: 'cantaloupe',
    next: function() { return b; }
};

So far, so pointless. But the value of this does not come from silly hand-built examples. In real software you would use this to generate streams from other data sources, or from other streams. It’s like Linq-to-objects in C#, but the foundations are actually more purely functional, because even the iteration process involves only immutable objects, and so everything is repeatable, nothing is destroyed merely by using it. Part-way through a stream you can stash the current node, and come back to it later. It will still represent “the rest of the stream”, even though you already used it once.

It is this extreme level of generality that persuaded me try using streams in a real JavaScript library. I want to write a rich text editor for HTML Canvas (more of that in a later post, hopefully). So I would have streams of characters, streams of words, streams of lines, etc. It seemed to fit, and also I have a week off work and it’s fun to re-invent the wheel.

I start with an object representing the empty stream. This is nicer than using null, because I want to provide member functions on streams. If you had to check whether a stream was null before calling methods on it, that would suck mightily.

var empty = {};

function getEmpty() {
    return empty;
}

Then we need a way to make a non-empty stream:

function create(value, next) {
    return Object.create(empty, {
        value: { value: value },
        next: { value: next || getEmpty }
    });
}

It uses the empty stream as its prototype, and adds immutable properties for value and the next function. If no next function is passed, we substitute getEmpty. So calling create('banana') would make a stream of just one item.

One very handy building block is range:

var range = function(start, limit) {
    return start >= limit ? empty : create(start, function() {
        return range(start + 1, limit);
    });
};

Note the pattern, as it is typical: the next works by calling the outer function with the arguments needed to make it do the next step. And you may be thinking – AHGGHGH! Stack overflow! But no, as long as we loop through the stream using our for-loop pattern, the stack will not get arbitrarily deep.

Here’s a favourite of mine, so often forgotten about:

var unfold = function(seed, increment, terminator) {
    return create(seed, function() {
        var next = increment(seed);
        return next === terminator ? empty :
            unfold(next, increment, terminator);
    });
};

You call it with a seed value, which becomes the first value of the stream, and also an increment function that knows how to get from one value to the next, and a terminator value that would be returned by the increment function when it has no more values. So in fact you could implement range in terms of unfold:

var range = function(start, limit) {
    return unfold(start, function(v) { return v + 1; }, limit);
};

It can also turn a traditional linked list into a stream:

var fromList = function(front) {
    return unfold(front, function(i) { return i.next; }, null);
};

Groovy! Now we have several ways to originate a stream, so lets add some methods. Recall that empty is the prototype for streams, so:

empty.forEach = function(each) {
    for (var s = this; s !== empty; s = s.next()) {
        each(s.value)
    }
};

Nothing to it! And we can use forEach to get a stream into an array:

empty.toArray = function() {
    var ar = [];
    this.forEach(function(i) { ar.push(i); });
    return ar;
};

Of course, how could we live without the awesome power of map?

empty.map = function(mapFunc) {
    var self = this;
    return self === empty ? empty : create(mapFunc(self.value), function() {
        return self.next().map(mapFunc);
    });
};

Again, that lazy-recursive pattern. And now we can very easily implement converting an array into a stream:

var fromArray = function(ar) {
    return range(0, ar.length).map(function(i) {
        return ar[i];
    });
}

How about concat? Well, this has a slight wrinkle in that if the argument is a function, I treat it as a lazy way to get the second sequence:

empty.concat = function(other) {
    function next(item) {
        return item === empty
            ? (typeof other === 'function' ? other() : other)
            : create(item.value, function() { return next(item.next()); });
    }
    return next(this);
};

And with concat we can easily implement the holy grail of methods, bind (known as SelectMany in Linq and flatMap in Scala):

empty.bind = function(bindFunc) {
    var self = this;
    return self === empty ? empty : bindFunc(self.value).concat(function() {
        return self.next().bind(bindFunc);
    });
};

Think that one through – it’s a mind-bender. The bindFunc returns a sub-stream for each item in the outer stream, and we join them all together. So:

assertEqual(

    // ordinary array of numbers
    [1, 2, 3, 4, 5, 6, 7, 8, 9],

    // making that same array in an interesting way
    Stream.fromArray(
        [[1, 2, 3], [4], [5, 6], [], [7], [], [], [8, 9]]
    ).bind(function(ar) {
        return Stream.fromArray(ar);
    }).toArray()

);

Anyway, I wrote my rich text layout engine using this stream foundation, and (as I like to do with these things) I set up an animation loop and watched it repeatedly carry out the entire word-break and line-wrap process from scratch in every frame, to see what frame rate I could get. Sadly, according to the browsers’ profilers, the runtime was spending a LOT of time creating and throwing away temporary objects, collecting garbage and all the other housekeeping tasks that I’d set for it just so I could use this cool stream concept. Interestingly, in terms of this raw crunching through objects, IE 10 was faster than Chrome 30. But I know that by using a simpler basic abstraction it would be much faster in both browsers.

How do I know? Well, because I found that I could speed up my program very easily by caching the stream of words in an ordinary array. And guess what… I could just use arrays in the first place. I am only scanning forward through streams and I definitely want to cache all my intermediate results. So I may as well just build arrays. (Even though I haven’t started the rewrite yet, I know it will be way faster because of what the profilers told me).

So, for now, we say: fair well streams, we hardly knew ye.