%-----------------------------------------------------------------------------
%
% Name:         containers.tex
%
% Purpose:      Paper for Haskell Symposium 2010
%
% Author:       Milan Straka <fox@ucw.cz>
%
% Created:      18 May 2010
%
%-----------------------------------------------------------------------------

\documentclass[preprint,authoryear]{sigplanconf}

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{alltt}

\renewcommand{\O}{{\cal O}}
\DeclareFontShape{OT1}{cmtt}{bx}{n}{<5><6><7><8><9><10><10.95><12><14.4><17.28><20.74><24.88>cmttb10}{}
\newenvironment{citemize}{\begin{list}{$\bullet$}{\topsep=\smallskipamount\itemsep=0pt}}{\end{list}}
\newcommand{\pkg}[1]{{\sc #1}}

\begin{document}

\conferenceinfo{Haskell Symposium '10}{30th September 2010, Baltimore MD.}
\copyrightyear{2010}
\copyrightdata{[to be supplied]}

\titlebanner{Draft}
\preprintfooter{Draft}

\title{The performance of Haskell \pkg{containers} package}
%\subtitle{}

\authorinfo{Milan Straka}
           {Charles University in Prague}
           {fox@ucw.cz}
\maketitle

\begin{abstract}
In this paper, we perform a~thorough performance analysis of the
\pkg{containers} package, a~de facto standard Haskell containers library,
comparing it to the most of existing alternatives. We then significantly
improve its performance, making it comparable to the best implementations
available. Additionally, we describe a~new persistent data structure based on
hashing, which offers the best performance out of available data structures
containing {\tt String}s and {\tt ByteString}s.
\end{abstract}

\keywords
Haskell, containers, data structures, benchmarking

\section{Introduction}

In almost every computer language there are libraries providing various data
structures, an important tool of a~programmer. Programmers benefit from
well written libraries, because these libraries
\begin{citemize}
  \item free the programmer from repeated data structure implementation and allow
  them to focus on the high level development,
  \item prevent bugs in the data structure implementation,
  \item can provide high performance.
\end{citemize}
For some languages, standardized data structure libraries exist (STL for
C++~\cite{stl}, Java Collections Framework, .NET System.Collections, etc.),
which provide common and effective option in many cases.


Being the only data structure package coming with GHC
and the Haskell Platform (the standard Haskell development environment),
the \pkg{containers} package has become a~``standard'' data structure library for
Haskell programmers. It is used by almost every third package on the HackageDB
(674 out of 2083, 21st May 2010), which is a~public collection of packages
released by Haskell community.

The \pkg{containers} package contains the implementations of
\begin{citemize}
  \item {\em sets} of elements (the elements must be comparable),
  \item {\em maps} of key and value pairs (the keys must be comparable),
  \item ordered {\em sequences} of any elements,
  \item {\em trees} and {\em graphs}.
\end{citemize}
All data structures in this package work persistently, ie. they can be
shared~\cite{persistence}.

Our decision to compare and improve the \pkg{containers} package was
motivated not only by the wide accessibility of the package, but also by our
intention to replace the GHC internal data structures with the \pkg{containers}
package. Therefore we wanted to confirm that the performance offered by the
package is the best possible, both for small and big volume of data stored in
the structure, and possibly to improve it.

\begin{citemize}
  \item We present the first comprehensive performance measurements of the
  widely-used \pkg{containers} package, including head-to-head comparisons
  against half a~dozen other popular container libraries
  (Section~\ref{sec:benchmarking}).

  \item We describe optimisations to containers that improve the performance of
  {\tt IntSet} by 10-20\% and the performance of {\tt Set} by 30-50\% in common
  cases (Section~\ref{sec:improving}).

  \item We describe a~new data structure that uses hashing to improve
  performance.  Hash tables are usually thought of as mutable structures, but
  our new data structure is fully persistent.  Compared to other optimised
  containers, performance is improved up to three times for string elements
  (Section~\ref{sec:hashset}).
\end{citemize}

\section{The \pkg{containers} package}

In this section we describe the data structures available in the
\pkg{containers} package. We tried to cover the basic and most frequent usage,
for the eventual performance boost to be worthwhile.
Focusing on basic usage is beneficial for the sake of comparison
too, as the basic functionality is offered by nearly all implementations.

\subsection{Sets and maps}

A~{\em set} is any data structure providing operations {\tt empty},
{\tt member}, {\tt insert}, {\tt delete} and {\tt union}
as listed on Figure~\ref{fig:set}. Real implementations certainly offer richer
interface, but for our purposes we will be interested only in these methods.
\begin{figure}[h]
  \begin{alltt}
  \textbf{data} \textit{Set e}
  \textit{empty}   ::           Set e
  \textit{member}  ::  Ord e => e -> Set e -> Bool
  \textit{insert}  ::  Ord e => e -> Set e -> Set e
  \textit{delete}  ::  Ord e => e -> Set e -> Set e
  \textit{union}   ::  Ord e => Set e -> Set e -> Set e\end{alltt}
  \caption{A~{\em set} implementation provided by the \pkg{containers} package}
  \label{fig:set}
\end{figure}

A~{\em map} from keys to values is a~set of pairs (key, value), which are
compared using the key only. To prevent duplication we discuss only
sets from now on, but everything applies to maps too\footnote{In reality it
works the other way around -- a~set is a~special case of map that has no
associated value for a~key. We could use a~{\tt Map e ()}, where {\tt ()} is
a~unit type with only one value, as a~{\tt Set e}. But the unit values would
still take space, which is why a~{\tt Set e} is provided.}.

We were of course interested in the performance of a~sequence of {\tt member}
operations, in the performance of a~sequence of {\tt insert} operations,
a~sequence of {\tt delete} operations and a~sequence of {\tt union} operations.
Additionaly, we were interested in a~{\em tree union} operation: given a~tree
with values in the leaves, the tree union build a~set by performing {\tt union}
operations in the inner vertices of the tree.

The tree union benchmark was of particular importance to us, as in GHC a~set is
often built in such a~manner. We therefore tried to improve the {\tt union}
performance in this setting, even if it did not help in the ``merge two very large
sets'' case.

\subsection{Intsets}

A~set of {\tt Int}s is used so frequently, that the \pkg{containers} package
offers a~specialized implementation. By an {\em intset} we therefore mean
a~specialized implementation of a~set of {\tt Int}s\footnote{During one GHC
compilation, the internal intmap operations take 5-15 times more than the map operations
(depending on the code generator used), which we measured with the GHC-head on
26th~March 2010.}. It should of course be faster than a~regular set of {\tt
Int}s, otherwise there would be no point in using it.

\subsection{Sequences}

The \pkg{containers} package also provides an implementation of a~{\em
sequence} of elements called a~{\tt Seq} with operations listed on
Figure~\ref{fig:seq}.
\begin{figure}[h]
  \begin{alltt}
  \textbf{data} \textit{Seq a}
  \textbf{data} ViewL a = EmptyL | a :< (Seq a)
  \textbf{data} ViewR a = EmptyR | (Seq a) :> a
  \textit{empty}  :: Seq a
  \textit{(<|)}   :: a -> Seq a -> Seq a
  \textit{(|>)}   :: Seq a -> a -> Seq a
  \textit{viewl}  :: Seq a -> ViewL a
  \textit{viewr}  :: Seq a -> ViewR a
  \textit{index}  :: Seq a -> Int -> a
  \textit{update} :: Int -> a -> Seq a -> Seq a\end{alltt}
  \caption{An implementation of a {\em sequence} of elements provided by the
           \pkg{containers} package}
  \label{fig:seq}
\end{figure}
A~{\tt Seq} is similar to a~list, but elements can be added ({\tt<|} and {\tt|>}) and
removed ({\tt viewl} and {\tt viewr}) to the front and also to the back in
constant time, allowing to use this structure as a~{\em double-ended queue}.
Elements can be also indexed and updated in logarithmic time and two sequences
can be concatenated also in logarithmic time.

We were interested in the benchmark of repeated {\tt index} operation and also
in a~benchmarked of repeated {\tt update} operation.

Because there are no other implementations of queues and deques in standard Haskell
libraries, a~{\tt Seq} is frequently used as a~queue. That is why we were also
interested in the comparison of a~{\tt Seq} and queue and deque
implementations, which could potentially be much faster, as they do not allow
access to the elements in the middle.

\subsection{The rest of the \pkg{containers} package}

The \pkg{containers} package also contains a~data type of a~multi-way tree.
Aside from the definition of this type, it contains only trivial methods (folds),
so there is no point in benchmarking those.

The last data structure offered by the package is a~graph, which is build on
top of \pkg{array} package, and some simple graph algorithms. We perform no
graph benchmarks, as the most similar \pkg{fgl} package is very different in
design. We only describe some simple performance improvements.

\section{The benchmarks}
\label{sec:benchmarking}

To benchmark a~program written in a~language performing lazy evaluation is
a~tricky business. Luckily there are powerful benchmarking frameworks
available.  We used the \pkg{criterion} package for benchmarking and
\pkg{progression} package for running the benchmarks of different
implementations and grouping the results together.

All benchmarks were performed on a~dedicated machine with Intel Xeon processor
and 4GB RAM, using 32-bit GHC 6.12.2. We tried to benchmark all different
implementations on the HackageDB. The list of packages used, together with
their versions, can be found in Appendix~\ref{sec:pkglist}.

\medskip

The benchmarking process works by calling a~benchmarked method on given input
data and forcing the evaluation of the result. The evaluation forcing can be
done conveniently using a~\pkg{deepseq} package. But as the representation of
the data structures is usually hidden from its users, we could not provide {\tt
NFData} instances directly and had to resort to a~fold which performs an
evaluation of all elements in the structure.

Because the benchmarked method can take only microseconds to execute, the
benchmarking framework repeats the execution of a~method until it takes
reasonable time (imagine 50ms) and then divides the elapsed time by the number
of iterations.

This process is repeated 100 times to get the whole distribution of the time
needed, and the mean and confidence interval are produced.

The results are displayed as graphs, one for each benchmark (Figures~\ref{fig:bench-setsI}
to~\ref{fig:bench-hashset-union}). One implementation
is chosen as a~baseline and the elapsed times are normalized with respect to the
selected baseline. For each implementation and input, the mean time of 100 iterations is
displayed, together with 95\% confidence interval (which is usually not visible
on the graphs as it is nearly identical to the mean). For every implementation
a~geometric mean of all times is computed and displayed in the legend. The
implementations except for the baseline are ordered according to this mean.

Each benchmark consists of several inputs. The size of input data is always
measured in binary logarithms (so the input of size 10 contains 1024 elements).
This size is always the first part of description of the input, which is displayed on
the $x$ axis. The input elements are of type {\tt Int} unless stated otherwise
({\tt String}s and {\tt ByteString}s will be used with the {\tt HashSet} in
Section~\ref{sec:hashset}). Where any order or elements in the input data could
be used, we tried ascending and random order ({\tt asc} and {\tt rnd} in the
description of the input) to fully test the data structure behaviour.

All graphs presented here are available on the author's website {\tt
http://fox.ucw.cz/papers/containers/}, together with the numerical data.

\subsection{Sets}

As the only element operation available is a~comparison, nearly all
implementations use some kind of balanced search tree. We will not describe the
algorithms used, but will provide references for interested readers.

We benchmarked the following set implementations:
\begin{citemize}
  \item {\tt Set} and {\tt Map} from \pkg{containers} package, which uses
  bounded balance trees~\cite{adams},
  \item {\tt FiniteMap} from GHC 6.12.2 sources, which also uses bounded
  balance trees~\cite{adams},
  \item {\tt AVL} from \pkg{AvlTree} package, which uses well-known AVL trees~\cite{avl},
  \item {\tt AVL} from \pkg{TreeStructures} package, which we denote as {\tt AVL2}
  in the benchmarks,
  \item {\tt RBSet} implemented by us which uses well-known red-black trees~\cite{rbtree}.
\end{citemize}

\noindent We performed these benchmarks:
\begin{citemize}
  \item {\em lookup benchmark}: perform a~{\tt member} operation on every
  element of the given set, either in ascending order ({\tt asc} in
  the input descripton) or in random order ({\tt rnd} in the input
  description),

  \item {\em insert benchmark}: build a~set by sequentially calling {\tt
  insert}, either in ascending ({\tt asc} in the input descripton) or random
  order of elements ({\tt rnd} in the input description),

  \item {\em delete benchmark}: sequentially {\tt delete} all elements of
  a~given set, either in ascending ({\tt asc} in the input descripton) or
  random order of elements ({\tt rnd} in the input description),

  \item {\em union benchmark}: perform a~{\tt union} of two sets of given sizes
  (the sizes are the first and second part of input description).  The input
  description {\tt asc} means the elements in one set are all smaller than the
  elements in the other set. The description {\tt e\char`_o} stands for an
  input where one set consists of the even numbers and the other of odd
  numbers. The last option {\tt mix} represents an input with elements grouped
  in continuous runs, and there runs are split between the two sets.

  \item {\em tree union benchmark}: given a~tree with elements in the leaves,
  perform {\tt} union on all internal vertices to get one resulting set. The
  input description {\tt asc} and {\tt rnd} specify the order of the elements
  in the leaves. The shape of the tree is specified by the last letter of the
  input description. The letter {\tt b} stands for perfectly balanced binary tree,
  {\tt u} denotes unbalanced binary tree (one son is six times the size of the
  other son) and {\tt p} stands for a~centipede, see Figure~\ref{fig:centipede}.
\end{citemize}

\begin{figure}[t]
  \begin{center}\includegraphics{centipede.eps}\end{center}
  \caption{A~tree called the {\em centipede}.}
  \label{fig:centipede}
\end{figure}

The results of the benchmarks are plotted on Figures~\ref{fig:bench-setsI}
and~\ref{fig:bench-setsII}. The performance of the {\tt Set} is comparable to the {\tt FiniteMap}, but it is much worse
than {\tt AVL} and {\tt RBSet}, which are implemented with speed being the first design goal.
This makes a~lot of space for improvements of the {\tt Set} implementation to make it comparable
to the {\tt AVL} and {\tt RBSet}. We describe such improvements in Section~\ref{sec:improving}.

\begin{figure}
  \begin{center}
    \includegraphics{graphs/map/lookup.eps}
    \includegraphics{graphs/map/insert.eps}
    \includegraphics{graphs/map/delete.eps}
  \end{center}
  \caption{Benchmark of sets operations I}
  \label{fig:bench-setsI}
\end{figure}
\begin{figure}
  \begin{center}
    \includegraphics{graphs/map/union.eps}
    \includegraphics{graphs/map/treeunion.eps}
  \end{center}
  \caption{Benchmark of sets operations II}
  \label{fig:bench-setsII}
\end{figure}

\subsection{Intsets}

The purpose of an intset implementations is to outperform a~set of
\texttt{Int}s. This can be achieved by allowing other operations on
\texttt{Int}s in addition to a~comparison.  All mentioned implementations
exploit the fact that an \texttt{Int} is a~sequence of 32 or 64 bits.

We have benchmarked following intset implementations:
\begin{citemize}
  \item {\tt IntSet} from \pkg{containers} package which implements big-endian
  Patricia trees~\cite{intset},
  \item {\tt UniqueFM} from GHC 6.12.2 sources which implements also big-endian
  Patricia trees,
  \item {\tt PatriciaLoMap} from {\tt EdisonCore} package, called {\tt
  EdisonMap} in the benchmark, which implements little-endian Patricia
  trees~\cite{intset}.
\end{citemize}
We also include ordinary {\tt Set}~{\tt Int} from the \pkg{containers} package in the
benchmarks. The problem with this implementation is that its methods work with
any comparable type and thus do not use integer comparison directly. This is
a~considerable performance loss and therefore we also include another set
implementation based on {\tt Set}~{\tt Int}, called {\tt SetInlined}, which inline
benchmarked methods to the call sites. This allows every call to be specialized
at the cost of code growth.

\begin{figure}
  \begin{center}
    \includegraphics{graphs/intmap/lookup.eps}
    \includegraphics{graphs/intmap/insert.eps}
    \includegraphics{graphs/intmap/delete.eps}
  \end{center}
  \caption{Benchmark of intsets operations I}
  \label{fig:bench-intsetsI}
\end{figure}
\begin{figure}
  \begin{center}
    \includegraphics{graphs/intmap/union.eps}
    \includegraphics{graphs/intmap/treeunion.eps}
  \end{center}
  \caption{Benchmark of intsets operations II}
  \label{fig:bench-intsetsII}
\end{figure}

The benchmarks performed are the same as in the case of generic set
implementations. The results can be found on Figures~\ref{fig:bench-intsetsI}
and~\ref{fig:bench-intsetsII}.

The {\tt IntSet} outperforms all the presented implementations, except for the
lookup benchmark, where the {\tt UniqueFM} performs 10\% faster. The {\tt
IntSet} is considerably faster than a~{\tt Set}~{\tt Int}, especially in the
tree union benchmark, where it runs more than four times faster.

We were still able to improve the effective {\tt IntSet} implementation
performance. We describe these improvements in Section~\ref{sec:improving}.

\subsection{Sequence}

Because the {\tt Seq} is a~multifunctional data structure, we compared it to
various implementations in various benchmarks.

The {\em queue} benchmark consists of two phases, at first a~certain number of
elements is added to the queue (the number of the elements added is the first
part of the input description) and then some of the previously added elements
are removed from the queue (the second part of the input description). We also
tried mixing the additions and deletions, but there were hardly any
differences in performance, so we do not present these.

In the queue benchmark we tested the following implementations:

\begin{citemize}
  \item {\tt Seq} from the \pkg{containers} package, which implements
  2-3 finger trees annotated with sizes~\cite{fingertrees},
  \item {\tt Trivial}, which is a~non-persistent queue with amortized bounds,
  described in Section 5.2 of~\cite{okasaki},
  \item {\tt Amortized}, which is a~persistent queue with amortized bounds,
  described in Section 6.3.2 of~\cite{okasaki},
  \item {\tt Realtime}, which is a~persistent queue with worst-case bounds,
  described in Section 7.2 of~\cite{okasaki},
  \item {\tt Ed\char`_Trivial}, {\tt Ed\char`_Amortized} and {\tt
  Ed\char`_Seq} from the \pkg{EdisonCore}~package, which implement the same
  algorithms as {\tt Trivial}, {\tt Amortized} and {\tt Seq}, respectively.
\end{citemize}

The results are displayed on Figure~\ref{fig:bench-queue}. The {\tt Ed\char`_Seq} is
missing, as it was roughly 20 times slower than the {\tt Seq}
implementation. Because the {\tt Trivial} queue implementation is not persistent
(cannot be shared), we do not consider it to be a~practical alternative.
That means the {\tt Seq} implementation is only 50\% slower than the
fastest queue implementation available. That is a~solid result, considering the
additional functionality it provides.

\begin{figure}
  \begin{center}
    \includegraphics{graphs/queue/queue.eps}
  \end{center}
  \caption{Benchmark of queue operations}
  \label{fig:bench-queue}
\end{figure}

\medskip

The {\em index} and {\em update} benchmark perform a sequence of {\tt index}
and {\tt update} operations, respectively, one for each element in the
structure (the size of this structure is in the input description).
In~these benchmarks we used the following implementations:
\begin{citemize}
  \item {\tt Seq} from the \pkg{containers} package,
  \item {\tt Array} from the \pkg{array} package for the index benchmark only,
  \item {\tt RandList} from the \pkg{random-access-list} package, which
  implements the skew binary random-access list from Section 9.3
  of~\cite{okasaki},
  \item {\tt Ed\char`_RandList} from the \pkg{EdisonCore} package, which
  implements the same algorithm,
  \item {\tt Ed\char`_BinRandList} from the \pkg{EdisonCore} package, which
  implements bootstrapped binary random-access list from Section 10.1.2
  of~\cite{okasaki},
  \item {\tt Ed\char`_Seq} from the \pkg{EdisonCore} package,
  \item {\tt IntMap} from the \pkg{containers} package.
\end{citemize}

The results are presented on Figure~\ref{fig:bench-sequences}. The {\tt
Ed\char`_Seq} is once again not present, because it was 10-20 times slower than
the {\tt Seq}.  The {\tt IntMap} was used as a~map from the {\tt Int} indexes
to the desired values. Despite the surplus indexes, it outperformed most of the
other implementations. The {\tt Array} is present only in the lookup benchmark,
because the whole array has to be copied when modified and thus the {\tt
update} operation modifying only one element is very ineffective.

\begin{figure}
  \begin{center}
    \includegraphics{graphs/ral/index.eps}
    \includegraphics{graphs/ral/update.eps}
  \end{center}
  \caption{Benchmark of sequence operations}
  \label{fig:bench-sequences}
\end{figure}

The {\tt Seq} is not the fastest queue nor random access list, but it excels
when both these qualities are required. For comparison, when an {\tt IntMap} is
used in the queue benchmark, it is 2.5-times slower than {\tt Seq}, and {\tt
Ed\char`_RandList} and {\tt Ed\char`_BinRandList} are 5-times and 7-times
slower, respectively.

\section{Improving the \pkg{containers} performance}
\label{sec:improving}

There are several methods of improving an existing code.

The simplest is probably the ``look and see'' method -- after carefully exploring
the properties of the implementation (practically ``staring at the source code for
some time'') some obvious deficiencies can be found.

As an example, consider the following definitions:\\
\llap{}{\tt~~~~\textbf{data} Tree a~= Node a~(Forest a)}\\
\llap{}{\tt~~~~\textbf{type} Forest a~= [Tree a]}\\
In the {\tt Data.Graph} module, function for pre-order and post-order
{\tt Tree} traversal are provided. The reader is welcome to consider what is wrong about {\em both}
of these implementations:

\begin{verbatim}
  preorder            :: Tree a -> [a]
  preorder (Node a ts) = a : preorderF ts
  preorderF           :: Forest a -> [a]
  preorderF ts         = concat (map preorder ts)

  postorder            :: Tree a -> [a]
  postorder (Node a ts) = postorderF ts ++ [a]
  postorderF           :: Forest a -> [a]
  postorderF ts         = concat (map postorder ts)
\end{verbatim}

The {\tt postorder} case is straightforward -- the list concatenation is linear in the length
of the first list, so the time complexity of {\tt postorder} performed on a~path
is quadratic.

The {\tt preorder} is a~bit more challenging -- the {\tt concat} takes the time
of the length of all but the last list given. This also results in quadratic
behaviour, for example when the {\tt preorder} is executed on a~centipede
(Figure~\ref{fig:centipede}). The same mistake is also present in the {\tt
postorder} function.

It is trivial to reimplement both these functions to have linear time
complexity.

However, potential performance improvements are usually not found merely by
examining the source code. Another method is to use profiling to see which part
of the code takes long to execute and which would be beneficial to improve.

Having two implementations, we can also examine why one is faster. In the
simplest case it can be done at the level of Haskell sources. But if the reason
for different performance is not apparent, we can inspect the differences at the
level of Core Haskell~\cite{core} using for example the {\tt -ddump-stranal}
GHC flag, which shows the results of strictness analysis. If this is not enough, we can
examine the C{\tt-}{\tt-} code~\cite{cmm} using the {\tt -ddump-cmm} GHC flag. We had to
resort to analysis on all these levels when improving the performance of the \pkg{containers}.

We now briefly describe the changes we made to improve the performance and
present the benchmark results of the new implementations. The patches
are available on the author's website {\tt http://fox.ucw.cz/papers/containers/}
and will soon be submitted for inclusion to the upstream.

\subsection{Sets}

\begin{figure}
  \begin{center}
    \includegraphics{graphs/mymap/lookup.eps}
    \includegraphics{graphs/mymap/insert.eps}
    \includegraphics{graphs/mymap/delete.eps}
  \end{center}
  \caption{Benchmark of improved sets operations I}
  \label{fig:bench-mysetsI}
\end{figure}
\begin{figure}
  \begin{center}
    \includegraphics{graphs/mymap/union.eps}
    \includegraphics{graphs/mymap/treeunion.eps}
  \end{center}
  \caption{Benchmark of improved sets operations II}
  \label{fig:bench-mysetsII}
\end{figure}

We decided not to change the algorithm, but only to improve the existing
implementation, as such a~profound change did not promise to yield significant
performance benefit. We made the following improvements:

\begin{itemize}
  \item As already mentioned, the methods of a~{\tt Set} works for any
  comparable type and therefore use generic comparison method. That
  hurts performance in case of methods which spend a~lot of time comparing
  the elements, like {\tt member} or {\tt insert}. We inline these methods to
  the call site (only the pieces of the tree navigation, the rebalancing code
  is not duplicated to keep the code growth at minimum).

  \item When balancing a~node, the function {\tt balance} checked the balancing
  condition and called one of the four rotating functions, which rebuilt the
  tree using smart constructors. This resulted in a~repeated pattern matching,
  which was unnecessary. We rewrote the {\tt balance} function to contain all
  the logic and to use as few pattern matches as possible. That resulted in
  significant performance improvements in all {\tt Set} methods that modify
  a~given set.

  \item When a~recursive method access its parameter at different recursion
  levels, Haskell usually has to check that it is evaluated each time it is
  accessed. For a~{\tt member} or {\tt insert}, that is a~measurable slowdown.
  We rewrote these methods so that they evaluate the parameter at most once.
  To illustrate, we changed the original {\tt member} method
  \vspace{-.8\baselineskip}\begin{alltt}
  \textit{member :: Ord a => a -> Set a -> Bool}
  member x t = \textbf{case} t of Tip -> False
                         Bin _ y l r ->
                           case compare x y of
                             LT -> member x l
                             GT -> member x r
                             EQ -> True
  \end{alltt}\vspace{-1.7\baselineskip}to the following
  \vspace{-.8\baselineskip}\begin{alltt}
  member _ Tip = False
  member x t = x `seq` member' t \textbf{where}
    member' Tip = False
    member' (Bin _ y l r) = \textbf{case} compare x y of
                              LT -> member' l
                              GT -> member' r
                              EQ -> True
  \end{alltt}\vspace{-1.7\baselineskip}

  \item We improved the {\tt union} to handle small cases -- merging a~set of
  size one is the same as inserting that one element. We achieved that by adding the
  following cases to the definition of a {\tt union}:
  \vspace{-.8\baselineskip}\begin{alltt}
  union (Bin _ x Tip Tip) t = insert x t
  union t (Bin _ x Tip Tip) = insertR\footnote{The \texttt{insertR} method works just like an \texttt{insert}, but it does not insert the element if it is already present in the set.} x t
  \end{alltt}\vspace{-1.7\baselineskip}
  That helped significantly in the tree union benchmark. We tried to use this rule on sets of size 2 and
  3, but the performance did not improve further.

  \item In the {\tt union} method, a~comparison with a~possibly infinite
  element must be performed. That was originally done by supplying a~comparison
  function, which was constant for the infinite bound. Supplying a~value {\tt
  Maybe elem} with infinity represented as {\tt Nothing} improved the
  performance notably. To demonstrate the changes, consider the {\tt filterGt} method,
  which keeps in the set only the elements greater than the given bound (which could
  be $-\infty$):
  \vspace{-.8\baselineskip}\begin{alltt}
  \textit{filterGt :: (a -> Ordering) -> Set a -> Set a}
  filterGt _   Tip           = Tip
  filterGt cmp (Bin _ x l r) = \textbf{case} cmp x of
    LT -> join x (filterGt cmp l) r
    GT -> filterGt cmp r
    EQ -> r
  \end{alltt}\vspace{-1.7\baselineskip} We altered it to:
  \vspace{-.8\baselineskip}\begin{alltt}
  filterGt Nothing t = t
  filterGt (Just b) t = b `seq` filter' t \textbf{where}
    filter' Tip = Tip
    filter' (Bin _ x l r) = \textbf{case} compare b x of
      LT -> join x (filter' l) r
      GT -> filter' r
      EQ -> r
  \end{alltt}\vspace{-1.7\baselineskip}

\end{itemize}

The results are displayed on Figures~\ref{fig:bench-mysetsI}
and~\ref{fig:bench-mysetsII}. The improved implementations are called {\tt
NewSet} and {\tt NewMap}. We were able to reach the {\tt AVL} implementation
performance, except for the union benchmark. Yet we outperformed it on the tree
union benchmark, which was our objective.

Note that using the existing {\tt AVL} implementation as a~{\tt Map} is not
trivial, because it does not allow to implement all the functionality
of a~{\tt Map} efficiently (notably {\tt elemAt}, {\tt deleteAt} etc.).

\subsection{IntSets}

\begin{figure}
  \begin{center}
    \includegraphics{graphs/myintmap/lookup.eps}
    \includegraphics{graphs/myintmap/insert.eps}
    \includegraphics{graphs/myintmap/delete.eps}
  \end{center}
  \caption{Benchmark of improved intsets operations I}
  \label{fig:bench-myintsetsI}
\end{figure}
\begin{figure}
  \begin{center}
    \includegraphics{graphs/myintmap/union.eps}
    \includegraphics{graphs/myintmap/treeunion.eps}
  \end{center}
  \caption{Benchmark of improved intsets operations II}
  \label{fig:bench-myintsetsII}
\end{figure}

The {\tt IntSet} implementation was already extensively tuned and difficult to
improve. We performed only minor optimizations:

\begin{itemize}
  \item As with the {\tt Set}s, some recursive functions checked whether the
  parameters were evaluated multiple times. We made sure it is done at most
  once.  Because some functions were already strict in the key, it was enough
  to add the {\tt seq} calls to appropriate places. This improved the {\tt
  lookup} function significantly.

  \item The implementation uses a~function {\tt maskW}.  When {\tt m} contains
  exactly one bit set, the {\tt maskW i m} should return only the values of
  bits of {\tt i} than are higher than the bit set in {\tt m}:

  \begin{center}\begin{tabular}{c|l}
    {\tt m}        &{\tt 0\dots 010\dots 0}\\
    {\tt i}        &{\tt a\dots abc\dots c}\\
    {\tt maskW i m}&{\tt a\dots a00\dots 0}\\
  \end{tabular}\end{center}

  This method is defined as\\
  \rlap{}{\tt~~maskW i m = i .\&. (complement (m-1) `xor` m)}\\
  But there are other effective alternatives, for example:\\
  \rlap{}{\tt~~maskW i m = i .\&. (-m - m)}\\
  \rlap{}{\tt~~maskW i m = i .\&. (m * complement 1)}\\
  The last one is (unexpectedly for us) the best and caused the 10\% speedup in
  all benchmarked methods.
\end{itemize}

The results are presented on Figures~\ref{fig:bench-myintsetsI}
and~\ref{fig:bench-myintsetsII}, the improved implementations are called {\tt
NewIntSet} and {\tt NewIntMap}. The {\tt NewIntSet} is now 10\% faster
in all benchmarks and even 20\% faster in the lookup benchmark. The speedup
of {\tt NewIntMap} is a~bit smaller.

\section{New set and map implementation based on hashing}
\label{sec:hashset}

When a~comparison of two elements is expensive, using a~tree representation for
a~set can be ineffective, because the comparison has to be performed multiple
times. For example, when considering {\tt String}s, one could use a~hash table
(Section 6.4 of~\cite{knuth3}), which tries to guess the position of an element
in the set and performs only one comparison if it guessed right. Another
alternative is a~trie (Section 6.3 of~\cite{knuth3}), which can also be
implemented using a~ternary search tree (\cite{ternary}), which compares only
subparts of an element.

The problem with a~hash table is that it is usually built using an array, but
there is no available implementation of an array that could be shared, ie. be
persistent.

However, we saw that an {\tt IntMap} can be used as an array with reasonable
performance. We used this fact and implemented a~{\tt HashSet elem} as\\
\rlap{}{\tt~~\textbf{data} HashSet elem = HS (IntMap (Set elem))}.\\
The {\tt HashSet} is therefore an {\tt IntMap} indexed by the hash value of an
element. In the {\tt IntMap}, there is a~{\tt Set elem} containing elements with the
same hash value (this set will be of size one if there are no hash collisions).
A~{\tt HashMap} can be implemented in the same way as\\
\rlap{}{\tt~~\textbf{data} HashMap key val = HM (IntMap (Map key val))}.

This data structure is quite simple to implement, using the methods of an {\tt
IntMap} and a~{\tt Set} or a~{\tt Map}. It offers a~subset of {\tt IntMap}
interface, which does not depend on the elements being stored in an {\tt
IntMap} in ascending order (the elements are stored in ascending order of the
hash value only). Namely, we do not provide {\tt toAscList} (users can use
{\tt sort . toList}), {\tt split}, and the methods working with the minimum
and maximum element ({\tt findMin}, {\tt findMax} and others). Moreover,
the folds and maps are performed in unspecified element order.

We uploaded our implementation to the HackageDB as a~package called \pkg{hashmap}.

We performed the same lookup, insert and delete benchmark on the {\tt HashSet}
as on the {\tt Set} and {\tt IntSet}. We used the original unimproved
implementation of the \pkg{containers} package -- the performance of the {\tt
HashSet} will improve once the improvements from Section~\ref{sec:improving}
are incorporated.

\begin{figure}
  \begin{center}
    \includegraphics{graphs/hash/lookup-int.eps}
    \includegraphics{graphs/hash/insert-int.eps}
    \includegraphics{graphs/hash/delete-int.eps}
  \end{center}
  \caption{Benchmark of hashset operations on {\tt Int}s}
  \label{fig:bench-hashset-int}
\end{figure}

The performance of a~{\tt HashSet} when using elements of type {\tt Int} is
displayed on Figure~\ref{fig:bench-hashset-int}. It is worse than an {\tt
IntSet}, because it uses an {\tt IntMap} which is it self slower than an {\tt
IntSet}, and also uses an additional {\tt Set}.

The {\tt HashSet} should be beneficial when the comparison of the set elements
is expensive. We therefore benchmarked it with {\tt String}s and {\tt
ByteString}s elements. We compared the {\tt HashSet} implementation to all
alternatives present on the HackageDB (mostly trie-like data structures):
\begin{citemize}
  \item {\tt ListTrie} and {\tt PatriciaTrie} from the \pkg{list-tries} package
  implementing trie and Patricia trie (Section 6.3 of~\cite{knuth3}),
  \item {\tt BStrTrie} from the \pkg{bytestring-trie} package, which is
  speciazed for {\tt ByteString}s and (like {\tt IntSet}) implements the
  big-endian Patricia tree~\cite{intset},
  \item {\tt StringSet} from the \pkg{TernaryTrees} package, which implements
  ternary search tree (\cite{ternary}) specialized for the elements of type
  {\tt String},
  \item {\tt TernaryTrie} from {\tt EdisonCore} also implementing ternary
  search tree.
\end{citemize}

\begin{figure}
  \begin{center}
    \includegraphics{graphs/hash/lookup-str.eps}
    \includegraphics{graphs/hash/insert-str.eps}
    \includegraphics{graphs/hash/delete-str.eps}
  \end{center}
  \caption{Benchmark of hashset operations on {\tt String}s}
  \label{fig:bench-hashset-str}
\end{figure}
\begin{figure}
  \begin{center}
    \includegraphics{graphs/hash/lookup-bst.eps}
    \includegraphics{graphs/hash/insert-bst.eps}
    \includegraphics{graphs/hash/delete-bst.eps}
  \end{center}
  \caption{Benchmark of hashset operations on {\tt ByteString}s}
  \label{fig:bench-hashset-bst}
\end{figure}

The results are presented on Figures~\ref{fig:bench-hashset-str} and~\ref{fig:bench-hashset-bst}.
The length of the strings used in the benchmarks is the last number in the
input description. We used random-generated strings of small letters ({\tt rnd}
in the input description) and also a~consecutive ascending sequence of strings
({\tt asc} in the input description). In the latter case the strings have
a~long common prefix of {\tt a}'s. The {\tt ListTrie} is not present in the
benchmark results because it was 5-10 times slower than the {\tt HashSet}.

The {\tt HashSetNoC} is the same as the {\tt HashSet}, only the computation of
a~hash value of a~{\tt ByteString} is done in Haskell and not in C. There is
quite significant slowdown in the case Haskell generating the hashing code. We
discussed this with the GHC developers and were informed that the problem
should be solved using the new LLVM backend~\cite{llvm}.

We also performed the union benchmark.  We generated a sequence of elements
(its length is the first part of the input description) and created two sets of
the same size, one from the elements on the positions and the other from
the elements on odd positions. Then we performed a {\tt union} of those
sets.  The results for {\tt Int}, {\tt String} and {\tt ByteString} elements
are presented on Figure~\ref{fig:bench-hashset-union}.

The performance of a~{\tt HashSet} is superior to trie structures, even
those specialised for the {\tt String} or {\tt ByteString} elements. As
mentioned, the performance will improve even more with the enhancements of the
\pkg{containers} package.

\section{Conclusions and further work}

We have undertaken a~thorough performance analysis of the \pkg{containers}
package, comparing it to the most of existing alternatives found on the HackageDB.
These measurements are interesting of its own accord, because they allow existing data
structure implementations to be compared.

Using the benchmark results and code profiling, we significantly improved the
performance of the \pkg{containers} package, making it comparable to the best
implementations available. We will submit our patches for inclusion to the
upstream shortly.

Inspired by the benchmark results we also implemented a~new persistent data
structure based on hashing, which offers the best performance out of available
set implementations with {\tt String} and {\tt ByteString} elements, but
should perform well for any element type whose comparison is expensive. This
data structure is now available on the HackageDB.

\medskip

Improving a~library's performance is an unending process. Certainly the
\pkg{containers} package could be improved even further and more its
methods could be benchmarked and improved. Even though it would be very unsatisfactory,
part of the package could also we rewritten to C, which would definitely result in
further improvements.

\acks

I would like to express my sincere gratitude to Simon Peyton Jones for his
supervision and guidance during my internship in Microsoft Research Labs.
Our discussions were always very intriguing and motivating.

\appendix
\section{The list of referenced HackageDB packages}
\label{sec:pkglist}

All packages mentioned in this paper can be found on the HackageDB, which is
a~public collection of packages released by the Haskell community. The list
of HackageDB packages currently resides at {\tt http://hackage.haskell.org/}.

We used the following packages in the benchmarks:

\begin{center}
  \begin{tabular}{c|l}
Package name&Version used\\\hline
\pkg{array}&0.3.0.0\\
\pkg{AvlTree}&4.2\\
\pkg{bytestring-trie}&0.1.4\\
\pkg{containers}&0.3.0.0\\
\pkg{criterion}&0.5.0.0\\
\pkg{deepseq}&1.1.0.0\\
\pkg{EdisonCore}&1.2.1.3\\
\pkg{hashmap}&1.0.0.2\\
\pkg{list-tries}&0.2\\
\pkg{progression}&0.3\\
\pkg{random-access-list}&0.2\\
\pkg{TernaryTrees}&0.1.3.4\\
\pkg{TreeStructures}&0.0.2\\
  \end{tabular}
\end{center}

We also benchmarked internal data structures of GHC compiler. These can be
found in the sources of GHC 6.12.2, namely as files {\tt FiniteMap.hs}
and {\tt UniqFM.hs} in the {\tt compiler/utils} directory.

\begin{figure}
  \begin{center}
    \includegraphics{graphs/hash/union-int.eps}
    \includegraphics{graphs/hash/union-str.eps}
    \includegraphics{graphs/hash/union-bst.eps}
  \end{center}
  \caption{Benchmark of {\tt union} operation on hashset}
  \label{fig:bench-hashset-union}
\end{figure}

%\section{HOWTO}
%What do we want to say:
%\begin{itemize}
%  \item we have benchmarked the current containers package
%    \begin{itemize}
%      \item trees with several existing packages + GHC's one,
%        with emphasis on unions
%      \item {\tt Sequence} with several queues
%    \end{itemize}
%  \item improved the containers performance
%    \begin{itemize}
%      \item mostly Data.Map
%      \item smaller gains in Data.IntMap
%    \end{itemize}
%  \item implemented and benchmarked HashMap
%\end{itemize}
%
%\section*{Howto}
%\begin{itemize}
%  \item here is a problem
%  \item it is interesting
%  \item it is unsolved
%  \item here is the idea
%  \item my idea works
%  \item here is how the idea compares to other peoples approaches
%\end{itemize}
%
%\begin{itemize}
%  \item Abstract (4 sentences)
%    1. State the problem
%    2. Say why it's an interesting problem
%    3. Say what your solution achieves
%    4. Say what follows from your solution
%  \item Introduction (1 page) (.5 page)
%    1. Describe the problem
%    2. State your contributions
%  \item Problem (1 page) (.5 page)
%  \item My idea (2 pages) (1 page)
%  \item The details (5 pages) (2.5 pages)
%Problem 1: describing alternative
%approaches gets between the reader
%and your idea
%
%Problem 2: the reader knows nothing
%about the problem yet; so your (carefully
%trimmed) description of various technical
%trade-offs is absolutely incomprehensible
%
%Concentrate single-mindedly on a narrative that
%\begin{itemize}
%  \item Describes the problem, and why it is interesting
%  \item Describes your idea
%  \item Defends your idea, showing how it solves the problem, and filling out the details
%\end{itemize}
%On the way, cite relevant work in passing, but defer
%discussion to the end
%  \item Related work (1-2 pages) (.5-1 page)
%  \item Conclusions and further work (0.5 pages) (.25-.5 page)
%\end{itemize}

\softraggedright
\bibliographystyle{abbrvnat}
\bibliography{containers}

% The bibliography should be embedded for final submission, like this:
%\begin{thebibliography}{}
%\softraggedright
%
%\bibitem[Smith(1999)]{smith02}
%P. Q. Smith, and X. Y. Jones. ...reference text...
%
%\end{thebibliography}

\end{document}