Rewriting IRewritable
I got married yesterday! This post is about my C# generic programming library Sawmill. Have a read of my earlier post for an introduction.
How things were
I recently made a substantial change to Sawmill’s core IRewritable
interface. Here’s what it used to look like:
interface IRewritable<T> where T : IRewritable<T>
{
<T> GetChildren();
ChildrenSetChildren(Children<T> newChildren);
T RewriteChildren(Func<T, T> transformer);
T }
It’s a pleasingly simple interface with two core operations — GetChildren
returns the current object’s immediate children and SetChildren
replaces them. But there are also some warts.
Wart 1: Children<T>
Children<T>
is a custom struct, not a standard collection type:
struct Children<T>
{
public NumberOfChildren NumberOfChildren { get; }
public T First { get; }
public T Second { get; }
public ImmutableList<T> Many { get; }
}
enum NumberOfChildren
{
, One, Two, Many
None}
I wanted to avoid boxing when the object has a small number of children (which is fairly common in practice). So GetChildren
returns a Children<T>
, passing up to two children on the stack and the rest in an ImmutableList
. The NumberOfChildren
property tells the library how many of the struct’s fields are filled in.
This custom collection type is an extra hurdle to understanding IRewritable
’s API. It also makes certain parts of Sawmill’s implementation more complex — many internal methods have to switch
on the NumberOfChildren
they’re working with and do the same work in four different ways. It’s also relatively large for a struct
(at least 16 bytes and probably more, depending on your processor architecture) so there is a marginal performance cost associated with copying it around.
Wart 2: Collections
In fact, having GetChildren
return a collection at all is problematic, because it forces implementations to allocate memory. If your object has three or more children (and they’re not already stored in an ImmutableList
) then you have to create a new ImmutableList
whenever GetChildren
is called:
class ThreeChildren : IRewritable<ThreeChildren>
{
private readonly ThreeChildren _child1;
private readonly ThreeChildren _child2;
private readonly ThreeChildren _child3;
public Children<ThreeChildren> GetChildren()
=> new Children<ThreeChildren>(
.Many,
Children.Create(_child1, _child2, _child3)
ImmutableList);
public ThreeChildren SetChildren(Children<ThreeChildren> newChildren)
=> new ThreeChildren(
.Many[0],
newChildren.Many[1],
newChildren.Many[2]
newChildren);
}
When Sawmill traverses a tree it typically calls GetChildren()
for every node in the tree. That’s a lot of throwaway ImmutableList
s!
One way to avoid creating all this garbage might be to redesign IRewritable
to be buffer-oriented.
interface IRewritable<T> where T : IRewritable<T>
{
int CountChildren();
void GetChildren(T[] buffer);
SetChildren(T[] newChildren);
T }
With this design, Sawmill passes an array into GetChildren
(after calling CountChildren
to find out how big the array needs to be) and asks the object to copy its children into the array. Implementations of IRewritable
no longer have to allocate memory for their return value. The memory is allocated (and hopefully reused) by the library. But this API is less safe — if an IRewritable
implementation stores a reference to the buffer then it could get unexpectedly mutated.
class Bad : IRewritable<Bad>
{
private readonly Bad[] _children;
public int CountChildren() => _children.Length;
public void GetChildren(Bad[] buffer) => _children.CopyTo(buffer);
public Bad SetChildren(Bad[] newChildren) => new Bad(newChildren); // BUG
}
This is not a “pit of success” API — SetChildren
looks sensible but could go wrong if anyone else has a reference to the array. Sawmill would have to allocate a new array for each GetChildren
/SetChildren
call in order to be safe. So this design would end up allocating just as much as the ImmutableList
version.
(Keep the buffer idea in your head, though, because we’ll be coming back to it in a minute.)
Wart 3: RewriteChildren
RewriteChildren
, which applies a transformation function to the object’s immediate children (in other words, a one-level Rewrite
), has a sensible implementation in terms of the other two methods: get the children, transform them, and put them back.
public static T DefaultRewriteChildren<T>(this T value, Func<T, T> transformer)
: IRewritable<T>
where T {
var children = value.GetChildren();
var newChildren = children.Select(transformer);
return value.SetChildren(newChildren);
}
(The real DefaultRewriteChildren
was a bit more complex than this, because it tried to avoid calling SetChildren
if transformer
didn’t actually change anything.)
In fact, I expect that most IRewritable
s would just delegate RewriteChildren
to DefaultRewriteChildren
. So why did I put RewriteChildren
on the interface? It’s because of Wart 2 — GetChildren
can be expensive because it might have to create an ImmutableList
. (And applying transformer
to that ImmutableList
using eg ConvertAll
would have to create another one.) So DefaultRewriteChildren
is slow; the idea of RewriteChildren
was for objects to transform their children directly, without going via an ImmutableList
.
class ThreeChildren : IRewritable<ThreeChildren>
{
// ...
public ThreeChildren RewriteChildren(Func<ThreeChildren, ThreeChildren> f)
=> new ThreeChildren(f(_child1), f(_child2), f(_child3)); // no ImmutableList
}
The extra method makes IRewritable
harder to understand and harder to implement. Implementing RewriteChildren
(without using DefaultRewriteChildren
) can be tricky; if the transformer function doesn’t change anything you should avoid recreating the object.
The new way
I’m pleased to say that IRewritable
’s new design addresses all three of these warts. Remember the “buffer” idea from earlier?
interface IRewritable<T> where T : IRewritable<T>
{
int CountChildren();
void GetChildren(Span<T> buffer);
SetChildren(ReadOnlySpan<T> newChildren);
T }
This design is almost identical to the one I outlined earlier — it uses Span
s instead of arrays. (For the uninitiated, a Span
is basically a “slice” of an array. It can also be backed by unmanaged memory, though, making it more flexible than ArraySegment
.)
CountChildren
tells Sawmill how much space is needed to copy the children.GetChildren
copies the current object’s immediate children intobuffer
.SetChildren
creates a copy of the current object with its immediate children replaced.
Here’s the important difference. Unlike an array, a Span
can’t be stored on the heap. Span
is defined using the ref
keyword, which tells the compiler to check that Span
s are confined to the stack. If you pass a Span
into a method as an argument, you can be confident that the Span
won’t leave the scope of that method’s stack frame (just like ref
).
So Span
thoroughly solves the safety issue with the array-oriented API. I don’t have to worry about mutating a Span
which got stored inside a client object, because the Span
can’t be stored! This allows Sawmill to safely reuse the memory behind a Span
, rather than allocating a new array for each GetChildren
/SetChildren
call.
So Sawmill’s methods like RewriteChildren
can be implemented without allocating memory. In this example I’m using ArrayPool
to avoid creating a new array for every RewriteChildren
call.
public static T RewriteChildren<T>(this T value, Func<T, T> transformer)
: IRewritable<T>
where T {
var count = value.CountChildren();
var array = ArrayPool<T>.Shared.Rent(count);
// ArrayPool can return arrays bigger than you asked for
var span = array.AsSpan().Slice(count);
.GetChildren(span);
value
for (var i = 0; i < span.Length; i++)
{
[i] = transformer(span[i]);
span}
var result = value.SetChildren(newChildren);
<T>.Shared.Return(array);
ArrayPoolreturn result;
}
The main reason for having RewriteChildren
be a method on the interface was to avoid allocating memory (for the GetChildren
calls). So we don’t need it on the interface any more — this extension method serves as a single universal implementation. Likewise, Children<T>
’s purpose was also to avoid allocating memory, so we can do away with it too.
Relieving ArrayPool
pressure
There’s one operational problem with this implementation: it can end up renting a large number of arrays from the pool. Rewrite
, which applies a transformation function to every node in a tree (not just one layer), is implemented something like this:
public static T Rewrite(this T value, Func<T, T> transformer) where T : IRewritable<T>
=> transformer(t.RewriteChildren(child => child.Rewrite(transformer)));
The lambda which is passed to RewriteChildren
contains a recursive call to Rewrite
. Let’s think through the operational behaviour of Rewrite
:
Rewrite
callsRewriteChildren
RewriteChildren
rents an array from the array pool and callsGetChildren
RewriteChildren
calls thechild => child.Rewrite(transformer)
lambda function for each child- The lambda function recursively calls
Rewrite
; steps 1-3 are repeated until you encounter a node with no children RewriteChildren
callsSetChildren
and returns its array to the array poolRewriteChildren
returns and step 5 is repeated as you return up the call stack
Since steps 1-3 are repeated before step 5 happens, you can end up renting many arrays (a number equal to the height of the tree) before returning any of them to the pool. So the array pool could run out of arrays!
To fix this problem, we want to rent a small number of large arrays from the array pool, rather than a large number of small ones. We can lean on the fact that each array only lives as long as a single method — the array is Rent
ed at the start of RewriteChildren
and then Return
ed at the end. The memory usage is stack-shaped.
So here’s the plan. We’re going to rent a large array from the pool at Rewrite
’s beginning, and RewriteChildren
will take a chunk from that array each time it’s called. Each chunk will be freed up before any previously-allocated chunks are freed.
public static T Rewrite<T>(this T value, Func<T, T> transformer)
: IRewritable<T>
where T {
using (var chunks = new ChunkStack<T>())
{
Go(T x)
T => transformer(t.RewriteChildrenInternal(Go, chunks));
return Go(value);
}
}
public static T RewriteChildren<T>(this T value, Func<T, T> transformer)
: IRewritable<T>
where T {
using (var chunks = new ChunkStack<T>())
{
return RewriteChildrenInternal(t, transformer, chunks);
}
}
private static T RewriteChildrenInternal<T>(
this T value,
<T, T> transformer,
Func<T> chunks
ChunkStack) where T : IRewritable<T>
{
var count = value.CountChildren();
var span = chunks.Allocate(count);
.GetChildren(span);
value
for (var i = 0; i < span.Length; i++)
{
[i] = transformer(span[i]);
span}
var result = value.SetChildren(newChildren);
.Free(span);
chunksreturn result;
}
ChunkStack
contains an array and a count of how much of that array is in use. Allocate
and Free
increase and decrease that count.
class ChunkStack<T> : IDisposable
{
private T[] _array = ArrayPool<T>.Shared.Rent(512);
private int _used = 0;
public Span<T> Allocate(int count)
{
var span = _array.Slice(_used, count);
+= count;
_used return span;
}
public void Free(Span<T> span)
{
-= span.Length;
_used }
public void Dispose()
{
if (_array != null)
{
<T>.Shared.Return(_array);
ArrayPool= null;
_array }
}
}
I’ve glossed over an important part of ChunkStack
’s implementation: what happens when the array fills up? The real implementation manages a collection of “regions”, taking a new region from the array pool when existing regions fill up.
In the pathological case of a very large tree, this version of Rewrite
can still exhaust the array pool, but it’ll happen much less quickly.
Hackalloc
I mentioned earlier that a Span
is not necessarily backed by an array. Span
represents a contiguous block of memory, with no assumptions about where that memory is or how it’s managed. A Span
could be a slice of an array, or it could be a chunk of unmanaged memory, or it could be an area of the stack. Under the hood it’s just a pointer; Span
doesn’t care exactly where the pointer points.
In fact, C# has built in support for that last case. The stackalloc
keyword works like C’s alloca
: it carves out a chunk of memory directly in the current stack frame, which becomes invalid when the current method returns. Until recently, stackalloc
was only available in an unsafe
context, but today it’s available in safe code thanks to Span
.
public void StackallocExample()
{
// allocate space for three ints in the current stack frame
<int> myInts = stackalloc int[3];
Span[0] = 123;
myInts[1] = 456;
myInts[2] = myInts[0] + myInts[1];
myInts.WriteLine(myInts[2]); // prints 579
Console}
I want to use stackalloc
to avoid taking memory from the ChunkStack
when an object has a small number (say, 4) of children. This is fairly common in practice. Stack memory tends to be marginally faster than heap memory because it’s more likely to be in the processor cache, so this may have a modest performance benefit as well as relieving pressure on the array pool. If there are more than 4 children we can just fall back on the ChunkStack
.
Here’s the new RewriteChildrenInternal
:
private static T RewriteChildrenInternal<T>(
this T value,
<T, T> transformer,
Func<T> chunks
ChunkStack) where T : IRewritable<T>
{
var count = value.CountChildren();
<T> span = stackalloc T[4];
Spanif (count > 4)
{
= chunks.Allocate(count);
span }
.GetChildren(span);
value
for (var i = 0; i < span.Length; i++)
{
[i] = transformer(span[i]);
span}
var result = value.SetChildren(newChildren);
if (count > 4)
{
.Free(span);
chunks}
return result;
}
Sadly this doesn’t work. The compiler complains about the stackalloc T
line: “Cannot take the address of, get the size of, or declare a pointer to a managed type (‘T’)”. Basically, the CLR doesn’t support stackalloc
with reference types — you can only use stackalloc
with primitives or structs containing primitives. (A type parameter T
might be a reference type, so you still can’t use it with stackalloc
.) Under the hood, stackalloc
is untyped; the garbage collector doesn’t know how to follow pointers that are stored in stackalloc
ed memory because it doesn’t even know there are pointers there.
I still think the idea’s a good one, though. Can we unsafely hack it up?
I’m going to use the following struct
as a “poor man’s stackalloc[4]
”:
struct Four<T>
{
public T First;
public T Second;
public T Third;
public T Fourth;
}
A variable of type Four<T>
has enough room for four T
s — so when the variable is a local variable (in an ordinary method) it’s functionally equivalent to a stackalloc T[4]
. We won’t be using the First
, Second
, Third
and Fourth
properties directly — we’ll be (unsafely) addressing them relative to the start of the struct. In this example I’m using System.Runtime.CompilerServices.Unsafe
to address Third
by looking 2 elements beyond First
:
var four = new Four<T>();
ref T third = ref Unsafe.Add(ref four.First, 2);
.True(Unsafe.AreSame(ref four.Third, ref third)); Assert
The plan is to create a Span
whose pointer refers to the start of a Four<T>
on the stack. span[0]
will address four.First
, span[1]
will address Second
, and so on. My first idea to implement this was to use System.Runtime.CompilerServices.Unsafe
to coerce a ref Four<T>
to an unmanaged pointer, and then put that in a Span
:
unsafe
{
var four = new Four<T>();
void* ptr = Unsafe.AsPointer(ref four);
var span = new Span<T>(ptr, 4);
}
Sadly the Span
constructor throws an exception when T
is a reference type. At this point I went for a poke around in the .NET source code. I wanted to know how array.AsSpan()
works. I found an internal constructor which takes a ref T
. We can illictly call that constructor using reflection, although of course we want to avoid the performance costs of reflection. So the actual plan is to use runtime code generation to call the internal Span
constructor. Ordinarily I’d use Expression
to do this runtime code generation, but Expression
doesn’t support ref
parameters, so we have to write the IL by hand.
private static class SpanFactory<T>
{
private delegate Span<T> SpanCtor(ref T value, int length);
private static readonly SpanCtor _spanCtor;
static SpanFactory()
{
var ctor = typeof(Span<T>)
.GetConstructors(BindingFlags.NonPublic | BindingFlags.Instance)
.Single(c =>
.GetParameters().Length == 2
c&& c.GetParameters()[0].ParameterType.IsByRef
);
var method = new DynamicMethod(
"",
typeof(Span<T>),
new[] { typeof(T).MakeByRefType(), typeof(int) }
);
var il = method.GetILGenerator();
.Emit(OpCodes.Ldarg_0);
il.Emit(OpCodes.Ldarg_1);
il.Emit(OpCodes.Newobj, ctor);
il.Emit(OpCodes.Ret);
il
= (SpanCtor)method.CreateDelegate(typeof(SpanCtor));
_spanCtor }
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Span<T> Create(ref T value, int length)
=> _spanCtor(ref value, length);
}
Obviously relying on BCL internals like this is risky. The internal constructor could be removed, or changed to work differently, in which case my code could stop working or even segfault. That said, I think the likelihood of the internal constructor changing is quite low in this case.
Update
On .NET Core, there is an officially-supported API to do this:
MemoryMarshal.CreateSpan
. I didn’t know it existed at the time that I wrote this.
There are also risks associated with mixing pointers and references like this. You have to be very careful that the Span
doesn’t live longer than the Four
it points to. That means the Four
has to be discarded at the end of the method along with the Span
, and it has to be stored in a “real” local variable, not in temporary storage on the evaluation stack. I’ll address that by mentioning the variable (as a parameter to a non-inlined “keep-alive” method) at the end of the method.
You also need to be certain that the Four
is stored on the stack and not the heap. Data stored on the heap is liable to get moved by the garbage collector, which would invalidate the pointer inside the Span
. Beware that local variables are not always safe from being moved! Methods containing await
s, yield
s, and lambdas are liable to store their local variables on the heap, so if RewriteChildrenInternal
were not an ordinary method this hack would not be safe.
Here’s the final implementation of RewriteChildrenInternal
. var four = new Four<T>();
allocates space for four T
s in RewriteChildrenInternal
’s stack frame. Then, when I call GetSpan
, I’m passing in the address of the start of that Four
using the ref
keyword. GetSpan
returns a Span
which either points at the start of four
or at a chunk taken from the ChunkStack
, depending on how many children we need to store. ReleaseSpan
returns the Span
to the ChunkStack
if it came from there, and the KeepAlive
call ensures the four
isn’t deallocated too early.
private static T RewriteChildrenInternal<T>(
this T value,
<T, T> transformer,
Func<T> chunks
ChunkStack) where T : IRewritable<T>
{
var count = value.CountChildren();
var four = new Four<T>();
var span = GetSpan(count, chunks, ref four);
.GetChildren(span);
value
for (var i = 0; i < span.Length; i++)
{
[i] = transformer(span[i]);
span}
var result = value.SetChildren(newChildren);
ReleaseSpan(span, chunks);
KeepAlive(ref four);
return result;
}
private static Span<T> GetSpan<T>(int count, ChunkStack<T> chunks, ref Four<T> four)
{
if (count == 0)
{
return new Span<T>();
}
else if (count <= 4)
{
return SpanFactory<T>.Create(ref four.First, count);
}
else
{
return chunks.Allocate(count);
}
}
private static void ReleaseSpan<T>(Span<T> span, ChunkStack<T> chunks)
{
if (span.Length > 4)
{
.Free(span);
chunks}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void KeepAlive<T>(ref Four<T> four)
{
}
As far as I know, the designers of Span
were thinking primarily about applications such as serialisation and parsing — the sort of low-level code you’d find in a high performance web server. But Span
also really shines in this high-level library of recursion patterns. Its guarantees about storage proved crucial to the safety of my IRewritable
abstraction, but I’m also leaning on its flexibility to implement that abstraction as efficiently as possible.
Sawmill version 3.0 is now available on Nuget, and you can read all of this code in the GitHub repo.