fullstop 0.1 - ridiculously simple sentence segmentation in Haskell

Eric Kow eric.kow at gmail.com
Wed Mar 3 10:58:17 EST 2010


[Resending due to technical problems; sorry for duplicates]

Dear Haskell NLP people ,

I'd like to announce a new sentence segmentation library I've uploaded to
Hackage : fullstop.

In lieu of a description, I present to you a set of test cases that
currently pass:

> testSuite =
>  testGroup "NLP.FullStop"
>   [ testGroup "basic sanity checking"
>       [ testProperty "concat (segment s) == id s, modulo whitespace" prop_segment_concat
>       ]
>   , testGroup "segmentation"
>      [ testCaseSegments "simple"  ["Foo.", "Bar."]   "Foo. Bar."
>      , testCaseSegments "condense"  ["What?!", "Yeah"]   "What?! Yeah"
>      , testCaseSegments "URLs"    ["Check out http://www.example.com.", "OK?"]
>                                    "Check out http://www.example.com. OK?"
>      , testCaseNoSplit "titles"    "Mr. Doe, Mrs. Durand and Dr. Singh"
>      , testCaseNoSplit "initials"  "E. Y. Kow"
>      , testCaseNoSplit "numbers"   "version 2.3.99.2" ] ]

The library is extremely simple and stupid.  I'm hoping that somebody here
will be sufficiently offended by it to upload something better in its place.

Here's the whole segmenter.  As you can see, it works by aggressively
ripping the text apart and then gluing back together the pieces that
aren't actually separate sentences:

> import Data.List.Split
>
> segment = map (dropWhile isSpace) . squish . breakup
>
> breakup = split
>           . condense       -- "huh?!"
>           . dropFinalBlank -- strings that end with terminator
>           . keepDelimsR    -- we want to preserve terminators
>           $ oneOf stopPunctuation
> 
> stopPunctuation = [ '.', '?', '!' ]
>
> squish = squishBy (\_ y -> not (startsWithSpace y))
>        . squishBy (\x _ -> looksLikeAnInitial x)
>        . squishBy (\x _ -> any (`isSuffixOf` x) titles)
>        . squishBy (\x y -> endsWithDigit x  && startsWithDigit y)
>  where
>   looksLikeAnInitial [_,'.'] = True
>   looksLikeAnInitial _ = False
>   --
>   startsW f [] = False
>   startsW f (x:_) = f x
>   --
>   startsWithDigit = startsW isDigit
>   startsWithSpace = startsW isSpace
>   --
>   endsWithDigit xs =
>     case reverse xs of
>      ('.':x:_) -> isDigit x
>      _ -> False
> 
> squishBy f = map concat . groupBy f
> 
> titles = [ "Mr.", "Mrs.", "Dr." ]

Enjoy!

PS. This secondary purpose of this message is to remind everybody that the
    list exists and ought to be put to use ;-)
    
    We now have 15 Haskell NLP packages on hackage.  I'm looking forward
    to somebody combining them in clever ways to make something new and fun!

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url : http://projects.haskell.org/pipermail/nlp/attachments/20100303/ebbde42b/attachment.pgp 


More information about the NLP mailing list