Tellumar Kampiva: Why I Won't Use core.async In My Next Project

Why I Won't Use core.async in Production Code Again

In late 2015, I finally managed to jump the train and start to use core.async. It had been hyped for very long, I watched several talks and tutorials which told a story that totally made sense to me.

In this article, I will describe some of the pitfalls that wait for the new user of core.async. It certainly feels like I fell into all of them.

This article was written in April 2017, after some 18 months of working with core.async.

The Project

The tool I already had a prototype for up and running was a little server that waited for incoming HTTP request and when one case it would talk to a few other servers, gather some information and finally POST some result back to the initiator of the request. All servers had nice JSON HTTP APIs and just needed some glue that connected them.

Having a queue in this seemed like a good idea. It should drop old requests when new ones come in. No back-pressure here.

Enter core.async

As already said, there was a working prototype. Adding a queue was the next step. Having read about core.async a lot before, plus all the talks, it wasn't really too hard to add it.

In fact, it took only like an hour or two in two or three namespaces to change things to use channels and go-blocks and it almost worked on the first try. No incremental compilation in between.

I was pretty impressed how approachable core.async turned out to be.

And I also liked how separate parts of the program where now just sitting between two channels and doing their thing. It felt like it would be easy to extend and reason about once it grew into something bigger.

Exceptions

Once on a staging system, the program bumped into the normal problems. Network hiccups here, servers temporarily shut down there, and the inevitable encoding errors.

The real problem was, that I did not see this neither in logging nor on STDOUT. Same during development.

The reason for this is, that go-loops will silently discard your exceptions and just exit the loop, basically turning your program off.

I totally get the reasoning behind this, and I certainly do not know any better way of doing it. I read many blog posts and articles on this with Stuart Sierra's Clojure Do's: Uncaught Exceptions probably being the most prominent one.

The gist is:

Catch all exceptions in all of your go-blocks and core.async/threads.
Create a default handler for uncaught exceptions.

It took me way too long to get to a state where I (apparently) had all areas covered.

When To Create Channels

I ended up writing two alternative approaches how to set up channels in my application.

The first one created a network of channels at program start and then just piped stuff through the channels. It was easy to set up, even if a bit awkward because I had to make sure that different parts of the software knew about the channels of other parts that they wanted to communicate with.

Again, probably reasonable, it just felt weird to me. Also, this was no very stable as it relied heavily on all channels and go-loops to be very robust.

That's why I tried an alternative approach which would set up all the necessary channels only when a request came in. This relying less on me catching all possible exceptions.

When searching and asking around, I could find no authoritative information which was the better approach. That made me less comfortable with what I had done, because maybe I was just programming against the concepts and ideas of core.async without knowing it.

I ended up removing all channels but one and this one I set up at the beginning of the program. It only required one exception handler, so I felt comfortable with it. But all the neat flow of data through my program logic was now gone.

Kind of an anti-climax.

I had now spent quite some time on this and only had one channel remaining.

Internal Information

Now, the remaining central channel was the program's big queue. It was important. Whenever the number of items in the queue would grow, it was a hint that some of the other servers the program talked to where too slow.

The queue should be monitored. And more importantly, I wanted to get a big warning log message should there ever be an item dropped.

But the internals of the buffers inside the channels are considered just that: internal.

My first implementation tried to accept it and did some counting and logging on both ends of the channel. But that was neither elegant nor robust.

In the end, I wrote my own buffer implementation inspired by SlidingBuffer but with more bells and whistles. Oh, and I access the internals all the time, just to count the items in the queue.

Maybe that is the right way to go, maybe not. To me, it felt like the only thing left I got easily from core.async, the queue, was now my own implementation. I could have done that with a vector of futures in the first place.

Testing

I lost two full days until I figured out why the fake HTTP routes I set up with clj-http-fake did not work.

Of course that does not mean, core.async is bad, it just shows how stupid I can be at times. Since the library was new to me, I was searching in that area first. Only when I started to trace the namespace of that library I found out that the routes did not make it into my threads. That finally got me in the right direction. Tim Baldrige kindly pointed me to this blog post discussing the intricacies of dynamic Var binding in Clojure. The solution is simple: use with-fake-routes instead of with-global-fake-routes, even though the latter is advertised as being the one for a multi-threaded environment.

Another reason it took so long was that I started writing those tests when my program was in a state that I felt I needed them: I had changed too much at once to be sure of myself anymore. Yeah, classic, I know. Programmer hybris.

But all my stupidity aside, there were some subtle things at work that it took me very long to figure out. And to a certain degree they were caused by core.async.

In the next few days I will have to write some tests in a way that they gather the information created by happy functions in cheerful go-loops on some joyful threads somewhere, sometime in the JVM. I do not expect too much fun ahead.

Now What?

It is possible that this program is just not a good match for core.async.

I think, I could have written the queue quite easily with a few managed futures or something. Some of the problems are just the concurrent nature, yes I know. But it would have meant that I mostly knew the code I had written. I think it would have made my debugging easier. Maybe I am wrong.

My gut feeling is, that I had to know just too much about too many aspects of Clojure to be able to use core.async successfully. As easy as it was to get it up and running, as hard was it to get it to run robustly.

And maybe in my next project I'll think that I know all about it now and use it again. Who knows.