Memoization and CAFs
Memoization is a dynamic programming technique where intermediate results are saved and later reused. Many string and graph algorithms make use of memoization. Calculating the Fibonacci sequence, instances of the knapsack problem, and many bioinformatics algorithms are almost inherently solvable only with dynamic programming. A classic example in Haskell is the algorithm for the nth Fibonacci number, of which one variant is the following:
-- file: fib.hs fib_mem :: Int -> Integer fib_mem = (map fib [0..] !!) where fib 0 = 1 fib 1 = 1 fib n = fib_mem (n-2) + fib_mem (n-1)
Try it with a reasonable input size (10000
) to confirm it does memoize the intermediate numbers. The time for lookups grows in size with larger numbers though, so a linked list is not a very appropriate data structure here. But let's ignore that for the time being and focus on what actually enables the values of this function to be memoized.
Looking at the top level, fib_mem
looks like a normal function that takes input, does a computation, returns a result, and forgets everything it did with regard to its internal state. But in reality, fib_mem
will memoize the results of all inputs it will ever be called with during its lifetime. So if fib_mem
is defined at the top level, the results will persist in memory over the lifetime of the program itself!
The short story of why memoization is taking place in fib_mem
stems from the fact that in Haskell functions exist at the same level with normal values such as integers and characters; that is, they are all values. Because the parameter of fib_mem
does not occur in the function body, the body can be reduced irrespective of the parameter value. Compare fib_mem
to this fib_mem_arg
:
fib_mem_arg :: Int -> Integer fib_mem_arg x = map fib [0..] !! x where fib 0 = 1 fib 1 = 1 fib n = fib_mem_arg (n-2) + fib_mem_arg (n-1)
Running fib_mem_arg
with anything but very small arguments, one can confirm it does no memoization. Even though we can see that map fib [0..]
does not depend on the argument number and could be memorized, it will not be, because applying an argument to a function will create a new expression that cannot implicitly have pointers to expressions from previous function applications. This is equally true with lambda abstractions as well, so this fib_mem_lambda
is similarly stateless:
fib_mem_lambda :: Int -> Integer fib_mem_lambda = \x -> map fib [0..] !! x where fib 0 = 1 fib 1 = 1 fib n = fib_mem_lambda (n-2) + fib_mem_lambda (n-1)
With optimizations, both fib_mem_arg
and fib_mem_lambda
will get rewritten into a form similar to fib_mem
. So in simple cases, the compiler will conveniently fix our mistakes, but sometimes it is necessary to reorder complex computations so that different parts are memoized correctly.
Tip
Be wary of memoization and compiler optimizations. GHC performs aggressive inlining (Explained in the section, Inlining and stream fusion) as a routine optimization, so it's very likely that values (and functions) get recalculated more often than was intended.
Constant applicative form
The formal difference between fib_mem
and the others is that the fib_mem
is something called a
constant applicative form, or CAF for short. The compact definition of a CAF is as follows: a supercombinator that is not a lambda abstraction. We already covered the not-a-lambda abstraction, but what is a supercombinator?
A supercombinator is either a constant, say 1.5
or ['a'..'z']
, or a combinator whose subexpressions are supercombinators. These are all supercombinators:
\n -> 1 + n \f n -> f 1 n \f -> f 1 . (\g n -> g 2 n)
But this one is not a supercombinator:
\f g -> f 1 . (\n -> g 2 n)
This is because g
is not a free variable of the inner lambda abstraction.
CAFs are constant in the sense that they contain no free variables, which guarantees that all thunks a CAF references directly are also constants. Actually, the constant subvalues are a part of the value. Subvalues are automatically memoized within the value itself.
A top-level [Int]
, say, is just as valid a value as the fib_mem
function for holding references to other values. You should pay attention to CAFs in your code because memoized values are space leaks when the memoization was unintended. All code that allocates lots of memory should be wrapped in functions that take one or more parameters.