LCS timings

Ian Lynagh igloo at earth.li
Sun Jan 18 06:53:47 EST 2009


Hi all,

Over xmas I did a quick Myers implementation, and also wrote some
performance testing stuff to compare:
* myers
* hunt (Hunt-Szymanski, as in the lcs package)
* mc (the same as myers, except it only works out the length of the LCS,
      not what's in it. myers doesn't find the actual LCS particularly
      efficiently)
* darcs

This work isn't finished, but given I don't expect to work on it again
for a while, and seeing as someone was asking about this, here's the
current status and code.

First, the output:

Running timings
     myers      hunt        mc     darcs
         1         1         1        39  empty
         1         1         1        39  singleton
         8       719         9        98  All A
         9        35         9       117  Both A1...
      6544        26      1562       134  All A and all B
      6483        31      1638      2859  A1... and B1...
      6859        30      1620      3485  A1... and ...A1
      8638      1134      2253       327  A,B.. and A,C..
      8530        85      2281      6545  A1,B1.. and A1,C1..

The numbers are times in milliseconds (lowest of 3 runs). There's some
noise in some of these numbers (e.g. in another run hunts peaks were 660
and 784), but they're good enough.

The tests are just patterns of 4000 or 8000 lines, e.g. the most
complicated "A1,B1.. and A1,C1.." test is the LCS of
    ["A1","B1","A2","B2","A3","B3", ..., "B3999","A4000","B4000"] and
    ["A1","C1","A2","C2","A3","C3", ..., "C3999","A4000","C4000"]

More tests of interesting cases welcome. One I'd really like is the
real-world case that motivated darcs to switch algorithm - I think it
was a configure script. I had a quick look online, but couldn't find it.

The code is:
* All.hs            Defines the tests, and runs them all 3 times each way
* DoTiming.hs       Times how long the tests take (clock time). Checks
                    that they give a right answer (doesn't bother in the
                    case of darcs, as it doesn't return an LCS directly,
                    and ni the case of mc it only checks that the length
                    is right)
* TimeMe.hs         Actually does the LCS in the appropriate way
* Myers.hs          The myers implementation
* HuntSzymanski.hs  The hunt implementation
* MyersCheat.hs     The mc implementation

And run with:
    $GHC --make -O -Wall -Werror TimeMe
    $GHC --make -O -Wall -Werror DoTiming
    $GHC --make -O -Wall -Werror All
    echo Running timings
    ./All

Some things to note:

* the darcs test actually runs the darcs program on a repo, so it has
  higher startup overhead, as evidenced by the "empty" and "singleton"
  tests. Plus it was compiled with a different compiler, different flags
  etc.

* I believe darcs is hashing all the lines in each file and, if a hash
  appears in one file but not the other, removing the lines with that
  hash. This turns all of the tests except for "A1... and ...A1" into
  comparing two identical sequences, which is trivial for Myers (which
  darcs uses). The large runtime in other cases is presumably the time
  taken to hash everything, and to find out which hashes are only in one
  file.

* The lines are all very short, so some of the benefits of darcs's
  comparing hashes, rather than strings, when lcs'ing may be being
  hidden.

TODO:

* Make a HashFirst lcs-transformer, so myers, hunt and mc can be tested
  with and without darcs's pre-LCS hashing behaviour.
* Write a decent lcs-finder for the Myers implementation.
* Review/profile the hunt and myers code, to see if anything silly is
  happening.


Thanks
Ian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: All.hs
Type: text/x-haskell
Size: 5778 bytes
Desc: not available
Url : http://projects.haskell.org/pipermail/camp/attachments/20090118/77c5b0c5/attachment-0006.hs 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DoTiming.hs
Type: text/x-haskell
Size: 2200 bytes
Desc: not available
Url : http://projects.haskell.org/pipermail/camp/attachments/20090118/77c5b0c5/attachment-0007.hs 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TimeMe.hs
Type: text/x-haskell
Size: 1436 bytes
Desc: not available
Url : http://projects.haskell.org/pipermail/camp/attachments/20090118/77c5b0c5/attachment-0008.hs 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Myers.hs
Type: text/x-haskell
Size: 4170 bytes
Desc: not available
Url : http://projects.haskell.org/pipermail/camp/attachments/20090118/77c5b0c5/attachment-0009.hs 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: HuntSzymanski.hs
Type: text/x-haskell
Size: 8195 bytes
Desc: not available
Url : http://projects.haskell.org/pipermail/camp/attachments/20090118/77c5b0c5/attachment-0010.hs 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MyersCheat.hs
Type: text/x-haskell
Size: 4186 bytes
Desc: not available
Url : http://projects.haskell.org/pipermail/camp/attachments/20090118/77c5b0c5/attachment-0011.hs 


More information about the Camp mailing list