\documentclass[a4paper,12pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage[superscript]{cite}
\usepackage{color}

\usepackage{listings}
\definecolor{light-gray}{gray}{0.95}
\lstdefinestyle{normal}{
  basicstyle=\footnotesize\ttfamily,
  numbers=none,
  backgroundcolor=\color{light-gray},
  showspaces=false,
  showstringspaces=false,
  showtabs=false,
  tabsize=2,
  captionpos=b,
  escapechar=¤,
  frame=single,
  framesep=0.2cm,
  xleftmargin=2mm,
  xrightmargin=2mm,
  breaklines=true,
  emph={}
}
\lstset{style=normal}
\lstdefinestyle{javascript}{ language=JavaScript }
\lstdefinestyle{haskell}{ language=Haskell }
\lstset{style=haskell}

\title{HQMp3 -- MP3 decoder in Haskell}
\author{Tobias Olausson \\ \texttt{\small{olaussot@student.chalmers.se}} \and 
        Anders Karlsson \\ \texttt{\small{andekar@student.chalmers.se}}
}
\date{ \rule{0.8 \linewidth}{0.5mm} \\[3mm]
       University of Gothenburg \\ 
       \small{\today}
}

\begin{document}
\maketitle

\begin{abstract}
    This project aims to show how one can parse and decode binary formats in
    Haskell, showcased by implementing an mp3 decoder. The decoder features its
    own library for handling of binary formats, a fast huffman decoder and is
    written entirely in Haskell.
\end{abstract}

\tableofcontents


TODO: Lägg all kod i captions,
      Lägg alla referenser till funktioner i en texttt,
      Fixa bilder,
      Skriv en snutt om björns decoder,
      Lägg alla tabeller i captions.

\section{Motivation}
    MP3 decoders exists today in most languages, however there does not exist
    a fast one written in Haskell. It does exists one written by Björn Edström
    \cite{bjorn}, but it is very slow, and bitwise poorly implemented (sorry
    Björn).

\section{Background}
    \subsection{The MP3 format}
       The MP3 format is probably the most well-known and widespread format for
       encoding audio. It was engineered by the MPEG group as part of the MPEG-1
       standard ISO-11172 which was published in 1993. As part of this standard,
       the document describing MP3 is ISO-11172-3 \cite{wikimp3,wikimpeg1}.

       The data in an mp3 file is divided into frames. Each frame consists of
       a header, side info, and main data. We distinguish between physical
       frames, and logical frames. Physical frames are all data between two
       headers (unsorted), whereas a logical frame is a header and it's
       associated frame data (sorted). The distance in bits between two
       consecutive physical frames are constant once you know bitrate and sample
       rate. However, there might not be data to fill all those bits, in which
       case the data for the \textit{next} frame is inserted before its frame.
       This implies that one needs to keep track of two consecutive frames at a 
       time in order to know where data begins and ends.

       To avoid redundant information we only briefly describe the format here.
       If you feel that there are a lot of details missing, then don't worry --
       we will cover all that in the implementation part.

       \subsubsection{Header}
       \label{sec:header}
            Bild här \\
            \begin{itemize}
                \item{12 bits} All ones, sync word.
                \item{3 bits} 101 for MPEG layer 3.
                \item{1 bit} 0 if CRC follows the frame.
                \item{4 bits} Bitrate index.
                \item{2 bits} Sample rate index.
                \item{1 bit} 1 if frame is padded.
                \item{1 bit} Private bit - not used.
                \item{2 bits} Mode - Mono/Stereo/DualChannel/JointStereo.
                \item{2 bits} Mode extension - only used with JointStereo.
                \item{2 bits} Copyright data.
                \item{2 bits} Emphasis.
            \end{itemize}
            
            There are lots of data here that does not matter for the decoding
            process, and some that does not matter because of limits in our
            decoder. We restricted our decoder to files not using JointStereo,
            so the Mode Extension bits are not used. All bits are however
            checked for correctness. The most important bits for decoding are
            bitrate index (BT) and sample rate index (SR).

            The decoding process can be divided into two parts: unpacking and
            decoding. For the unpacking, the CRC info and padding data are also
            important.

            One might think that the header would contain all data needed for
            decoding, but that is absolutely not the case. In fact, the only
            useful data for decoding that is contained in the header are SR and
            BT. Then some data in the header is of course useful for unpacking,
            but unpacking is a rather small part of the whole decoding process.

       \subsubsection{Side Info}
            The side info is the part that contains most information on how to
            unpack the compressed audio and some information on how to decode
            the data. However, there is some information usable for decoding
            that is \textit{not} included in the side info, for some unknown
            reason. Some things to note in this data are: 
            \begin{itemize}
                \item main\_data\_begin - a pointer to the audio data
                \item scfsi bands - scale factor information for each data part.
                \item part2\_3\_length - the amount of audio data for each 
                      data part.
            \end{itemize}
            The rest of this section contains only decoding-specific data, which
            we will cover more in the implementation part.
       
       \subsubsection{Main Data}
            The main data is further divided into two granules, which may be
            divided further, depending on the number of channels (1 or 2). For
            each of these data parts, the main data contains scale information
            for decoding, and huffman data.
        
        \subsubsection{ID3}
            ID3 tags may be present within an mp3 file. Technically, it may
            appear anywhere in the file, but is by convention always in the
            beginning. Our decoder ignores id3 tags, however a library for
            parsing these tags was written for future use.

\section{Implementation}
    We wanted different parts of the process to be separate in the code, so we
    divided the decoder into two main modules: Unpack.hs and Decode.hs. 

    \subsection{Unpacking}

    \begin{lstlisting}
unpackMp3 :: L.ByteString -> [SideInfo BS.BitString]
unpackMp3 = runBitGet first . BS.convert
    where first = do
            skipId3
            init <- lookAhead (runMaybeT readHeader)
            unpackFrames init
    \end{lstlisting}
    What is going on here? The function expects a lazy bytestring as input (this
    is the contents of the mp3 file), and finally gives us a list of SideInfo
    datatypes parametrized by a BitString. There are lots of stuff that needs to
    be explained, clearly!
    
    First, what is the SideInfo type?
    \begin{lstlisting}
data SideInfo a
    = Single { sampleRate  :: !Int
             , dataPointer :: !Int -- 9 bits
             , scales      :: ![Bool]
             , gran        :: !(Granule a, Granule a)
             }
    | Dual   { sampleRate  :: !Int
             , dataPointer :: !Int -- 9 bits
             , scales' :: ![Bool]
             , gran0  :: !(Granule a, Granule a)
             , gran1  :: !(Granule a, Granule a)
             } deriving Show
    \end{lstlisting}
    So, it holds some values (sampleRate, dataPointer), some scale data
    information, and two or four Granules. There is a type of headers aswell,
    but that is only used intermediately when unpacking, so it is of rather
    little interest. We might describe it later.

    What is a Granule then? This is where the good stuff is!
    \begin{lstlisting}
data Granule a = Granule {
    bigValues         :: !Int
  , globalGain        :: !Double
  , scaleFacCompress  :: !Int
  , windowSwitching   :: !Bool
    -- windows
  , blockType       :: !Int
  , mixedBlock      :: !Bool
  , tableSelect_1   :: !Int
  , tableSelect_2   :: !Int

    -- window == 0
  , tableSelect_3   :: !Int

    -- window == 1
  , subBlockGain1   :: !Double
  , subBlockGain2   :: !Double
  , subBlockGain3   :: !Double

  , preFlag           :: !Bool
  , scaleFacScale     :: !Bool
  , count1TableSelect :: !Bool

  -- calculated from previous values
  , region0Start     :: !Int
  , region1Start     :: !Int
  , region2Start     :: !Int

  , mp3Data          :: a
} deriving Show
    \end{lstlisting}
    So, it contains lots of information on how to decode, and then it contains
    an \texttt{a}. Recall that the result from unpackMp3 was parametrized by a
    BitString, so the mp3Data in the Granule would be a BitString in this case.
    BitString is a library we wrote during the project. It works like
    ByteString, but has support for bit-level operations, hence the name. We
    will cover BitString in section \ref{sec:bitstring}.

    We saw in \texttt{unpackMp3} that it read one frame, and then called 
    \texttt{unpackFrames} with that frame as an argument. \texttt{unpackFrames}
    simply loops and accumulates frames until there are no more frames to read.
    The interesting function here is \texttt{readHeader}, which looks as
    follows:
    \begin{lstlisting}
readHeader :: MaybeT BitGet (MP3Header Int)
readHeader = do
    h <- lift syncWord
    if not h then fail "" else do
        prot  <- lift $ liftM not getBit
        brate <- getBitRate
        freq  <- getFreq
        padd  <- lift getBit
        lift $ skip 1 -- private bit
        mode  <- getMode
        mext' <- getModeExt
        lift $ skip 4 -- copyright, original & emphasis
        when prot $ lift $ skip 16
        sinfo <- lift $ readSideInfo mode freq
        let size = ((144 * 1000 * brate) `div` freq +
                    if padd then 1 else 0) * 8
            f' = if prot then 2 else 0
            ff = case mode of
                Mono -> 17
                _    -> 32
            hsize = (f' + ff + 4) * 8
            mext = case mode of
                JointStereo -> mext'
                _ -> (False, False)
        return $ MP3Header mode mext size hsize sinfo
    \end{lstlisting}
    So, comparing it to the description of an mp3 header in section
    \ref{sec:header} we see that the only change is the size variable, which
    calculates the length in bits to the next header. This function however,
    does not read in any data, and thus does not produce a logical frame. To do
    that, we need to first read side info and then look ahead to the next
    header in order to determine how long we should read. The function that
    reads the data can be found in the source code of Unpack.hs, we do not
    include it here. However, we do include \texttt{readSideInfo}, since it is
    by far more interesting even though it is long.
    \begin{lstlisting}
readSideInfo :: MP3Mode -> Int -> BitGet (SideInfo Int)
readSideInfo mode freq = do
    dataptr <- getInt 9
    skipPrivate
    scaleFactors <- getScaleFactors
    case mode of
        Mono -> do 
            g0 <- getGranule
            g1 <- getGranule
            return $ Single freq (dataptr * 8) 
                     scaleFactors (g0, g1)
        _    -> do 
            g0 <- getGranule
            g1 <- getGranule
            g2 <- getGranule
            g3 <- getGranule
            return $ Dual freq (dataptr * 8) 
                     scaleFactors (g0, g2) (g1, g3)
  where
    skipPrivate = case mode of
        Mono -> skip 5
        _    -> skip 3
    getScaleFactors = case mode of
        Mono -> replicateM 4 getBit
        _    -> replicateM 8 getBit
    getGranule = do
        part23len        <- getInt 12
        bigValues        <- getInt 9
        globalGain       <- getInt 8
        scaleFacCompress <- getInt 4
        w  <- getBit -- windowSwitching
        -- blockType
        bt <- getInt $ if w then 2 else 0
        -- mixedBlock
        mb <- if w then getBit else return False
        tableSelect0  <- getInt 5
        tableSelect1  <- getInt 5
        tableSelect2  <- if w then return 0 else getInt 5
        subBlockGain0 <- if w then liftM fi (getInt 3) 
                              else return 0
        subBlockGain1 <- if w then liftM fi (getInt 3) 
                              else return 0
        subBlockGain2 <- if w then liftM fi (getInt 3) 
                              else return 0
        region0Count  <- if w then return (reg0 mb bt)
                              else getInt 4
        region1Count  <- if w then return 0 else getInt 3
        let r0count = if w then (if bt == 2 then 8 else 7)
                           else region0Count
            r1count = if w then 20 - r0count
                           else region1Count
            sbTable = tableScaleBandBoundary freq
            r1bound = sbTable $ r0count + 1
            r2bound = sbTable $ r0count + 1 + r1count + 1
            bv2     = bigValues * 2
            reg0len = if bt == 2 then min bv2 36
                                 else min bv2 r1bound
            reg1len = if bt == 2 
                      then min (bv2 - reg0len) 540 
                      else min (bv2 - reg0len) 
                               (r2bound - reg0len) 
            reg2len = if bt == 2 
                      then 0   
                      else bv2 - (reg0len + reg1len)
        preFlag           <- getBit
        scaleFacScale     <- getBit
        count1TableSelect <- getBit
        return $ Granule bigValues 
                 (mp3FloatRep1 globalGain)
                 scaleFacCompress w bt mb 
                 tableSelect0 tableSelect1 tableSelect2
                 (mp3FloatRep2 subBlockGain0)
                 (mp3FloatRep2 subBlockGain1)
                 (mp3FloatRep2 subBlockGain2)
                 preFlag scaleFacScale count1TableSelect
                 reg0len reg1len reg2len part23len
    reg0 False 2 = 8
    reg0 _ _     = 7

    \end{lstlisting}
    So, lots of stuff that is done here are \textit{not} mentioned in the
    standard, but is stuff that one have to magically know (or see in somebody
    else's code, like we did). However, this is what a granule looks like,
    except for the data that is read in as stated above. Most of the stuff here
    is of little interest from a programming point of view, however all of them
    are used when decoding. The variables named \texttt{reg\{0,1,2}len are a bit
    more interesting though, since they control the number of times one has to
    invoke the huffman decoder. These variables are used together with
    \texttt{tableSelect\{0,1,2\}} that tells the decoder which huffman table
    that was used when compressing the audio data. 

    \subsection{Decoding}
    \subsubsection{Huffman}
    We have had two approaches to huffman decoding; binary trees, and lookup
    tables (arrays). The first approach was the most intuitive one in which we
    read one bit at a time, and walked the tree until we had read a whole code
    word and reached a leaf. This was rather elegant and the code for doing so
    was no more than $\sim$10 lines. On the downside, this approach was very
    slow, and at that time the huffman decoder took almost 50\% of the total
    running time. We did some thinking, and looking at other people's ideas we
    found out that one could make rather large tables of size $2^n$ in the
    largest code word. Of course this was not practical, since the longest code
    word in one of the huffman tables was 19 bits, so we ended up with a 13MB
    source code file which was impossible to compile. We did some further
    thinking at this point, and came up with another idea! How about if we fix
    some code word length for a table (say 8), and for those code words that
    were longer a new table was returned. For those code words that were
    shorter, these were padded with all possible combinations to make it of the
    right length. It might be easier to understand with an example!

    \begin{tabular}{| l | l |}
        \hline
        Code Word & Value \\ \hline \hline
        1   & 5 \\ 
        000 & 7 \\
        001 & 3 \\
        010 & 2 \\
        0111  & 4 \\ 
        01101 & 1 \\ 
        01100 & 8 \\ \hline
    \end{tabular} \\
    In this case, one might choose code word length three, which we will use in
    this example. The lookup table generated from this would look like this:
    
    \begin{tabular}{| l | l |}
        \hline
        Code Word & Value \\ \hline \hline
        100 & 5 \\
        101 & 5 \\
        110 & 5 \\
        111 & 5 \\
        000 & 7 \\
        001 & 3 \\
        010 & 2 \\ \hline
        011 & \begin{tabular}{l | l}
                10 & 4 \\
                11 & 4 \\
                01 & 1 \\
                00 & 8 \\
              \end{tabular} \\ \hline
    \end{tabular} \\
    So what happened here was that the one was expanded to fill all code words
    of length three starting with a one. The words that were of length three
    already were left as is, but the ones longer had one common factor of size
    three, 011. So if one reads 011, a new table is returned, where all code
    words are expanded to the longest one (two in this case). We see already
    here that this is much smaller than having $2^5$ entries (we have almost a
    three-fold in size here). With this approach, we could read more than one
    bit at the time. What we did was to read the common code word length number
    of bits. This saved time both in reading, and it saved us the traversal of a
    data structure, since it was replaced with an array lookup. Together with
    some other optimizations we lowered the amount of time taken by the huffman
    decoder to $\sim$8\%.

    \subsubsection{Audio Decoding}
        All audio is represented as floating-point numbers. Now, representing
        floats in a binary file format is not very space efficient, so the
        huffman decoder decodes integers instead of floats. These integers then
        has to be \textit{requantized}. \\
        
        At this point we were starting to get low on time. A lot of time had
        gone to parse the format and to huffman decode, largely due to the
        rather unspecific specification in ISO 11172-3. Therefore, some of the
        parts below will not be very well described, basically because we did
        not have time to investigate in how they work. Also, we will not show
        much of this code, since we did not write it. \\

        So, the \textit{requantization} is basically a transformation of all
        integers decoded by the huffman decoder into floats. How this is done
        depends on blocktype and windowswitching. Depending on those, different
        tables are for obtaining values that are multiplied by the integer from
        huffdecoding and also multiplied by different other more or less special
        values depending on blocktype and windowswitching. \\

        Next processing step is the reordering. During encoding, some parts of
        the data may be reordered in order to increase compression. The
        reordering basically reverses this process so that the actual audio is
        correctly ordered. Whenever the granule only has long blocks, this
        reordering is skipped. In the code, this step also pads the list of
        samples so that it is always of length 576 for simplicity in the further
        steps. \\
        
        % TODO: Describe these, and that's about it for this section
        AliasAddition, IMDCT, FrequencyInvert, SynthethisFilterBank \\

        To further compress the audio, the encoder passes the data through an
        \textit{Inverse Modified Discrete Cosine Transform} (IMDCT). As stated
        in ISO 11172-3, the transformation for each sample is done as follows:
        \[
            x_i = \sum_{k=0}^{\frac{n}{2} - 1}{X_k * cos(\frac{\pi}{2n}(2i + 1 +
                  \frac{n}{2})(2k + 1))}
            \textit{ for i=0 to n-1}
        \]
        In the code that we plugged in to, the IMDCT was written as a piece of C
        code, as was the synthethis filterbank. We thought that it would be much
        nicer to have that code written in Haskell, so we rewrote both those
        files into Haskell versions. For the IMDCT that went extremely smooth:
        \begin{lstlisting}
-- Straightforward translation from the C code.
imdct :: Int -> [Double] -> [Double]
imdct 18 xs  = imdct18 xs
imdct pts xs = map 
    (\n -> sum $ zipWith (subone n) xs [0..pts-1]) 
    [0..2*pts-1]
  where
    subone :: Int -> Double -> Int -> Double
    subone n y k = y * (cos $ pipts 
                            * (fi n + 0.5 + nhalf) 
                            * (fi k + 0.5)
                       )
    pipts        = pi / (fi pts)
    nhalf        = (fi pts) / 2.0

-- Straightforward translation from the C code, elegant!
imdct18 :: [Double] -> [Double]
imdct18 xs = map (\s -> sumZip (*) xs s) lookupIMDCT
  where
    -- 36x18 matrix
    lookupIMDCT :: [[Double]]
    lookupIMDCT = [[ cos $ (pi / 18.0) * n * k
                   | k <- [0.5, 1.5 .. 17.5]] 
                   | n <- [9.5, 10.5 .. 44.5]]
    
    -- Instead of specialize
    sumZip _ [] _ = 0
    sumZip _ _ [] = 0
    sumZip f (a:as) (b:bs) 
        = let acc = (f a b) 
          in acc `seq` acc + sumZip f as bs
        \end{lstlisting}
        As you can see, there was a special version when there were only 18
        samples to transform, and we have not investigated in how that code was
        written, but we just translated it to Haskell, which in our opinion went
        very well. For performance reasons we wrote sumZip, which would not be
        needed otherwise. This was however quite slow compared to the C version,
        but that was not a big concern when we first wrote it, the goal was to
        have a Haskell version first, and if there were time, to optimise.

\section{Results}
    Beauty, Slowness\ldots
    \subsection{bitstring}
    \label{sec:bitstring}
        Probably the best library I have ever seen
    \subsection{huffman}
    \label{sec:huffman}
        Probably the most unneccessary library I have ever written. Remove?
        maybe some different approaches should be discussed here? Ulf and Koen
        had their ideas that we discussed and ignored.

\section{Future Work}
    Rewrite the back-end to use vector, optimise further. Add support for
    all modes that are currently not supported. Also, other implementations of
    the huffman library may be considered. 

\begin{thebibliography}{99}
    \bibitem{bjorn}
        Let's build an MP3-decoder!
        http://blog.bjrn.se/2008/10/lets-build-mp3-decoder.html
    \bibitem{wikimp3}
        Wikipedia on MP3: http://en.wikipedia.org/wiki/Mp3
    \bibitem{wikimpeg1}
        Wikipedia on MPEG-1: http://en.wikipedia.org/wiki/MPEG-1
    \bibitem{standard}
        ISO-11172-3 Information technology -- Coding of moving pictures and
        associated audio for digital storage media at up to about 1,5 Mbit/s --
        Part 3: Audio
\end{thebibliography}

\end{document}
