fullstop 0.1 - ridiculously simple sentence segmentation in Haskell

Eric Kow eric.kow at gmail.com
Wed Mar 3 15:59:24 EST 2010


Dear Haskell NLP people ,

I'd like to announce a new sentence segmentation library I've uploaded to
Hackage : fullstop.

In lieu of a description, I present to you a set of test cases that
currently pass:

> testSuite =
>  testGroup "NLP.FullStop"
>   [ testGroup "basic sanity checking"
>       [ testProperty "concat (segment s) == id s, modulo whitespace" prop_segment_concat
>       ]
>   , testGroup "segmentation"
>      [ testCaseSegments "simple"  ["Foo.", "Bar."]   "Foo. Bar."
>      , testCaseSegments "condense"  ["What?!", "Yeah"]   "What?! Yeah"
>      , testCaseSegments "URLs"    ["Check out http://www.example.com.", "OK?"]
>                                    "Check out http://www.example.com. OK?"
>      , testCaseNoSplit "titles"    "Mr. Doe, Mrs. Durand and Dr. Singh"
>      , testCaseNoSplit "initials"  "E. Y. Kow"
>      , testCaseNoSplit "numbers"   "version 2.3.99.2" ] ]

The library is extremely simple and stupid.  I'm hoping that somebody here
will be sufficiently offended by it to upload something better in its place.

Here's the whole segmenter:

> import Data.List.Split
>
> segment = map (dropWhile isSpace) . squish . breakup
> 
> breakup = split
>           . condense       -- "huh?!"
>           . dropFinalBlank -- strings that end with terminator
>           . keepDelimsR    -- we want to preserve terminators
>           $ oneOf stopPunctuation
> 
> stopPunctuation = [ '.', '?', '!' ]

> squish = squishBy (\_ y -> not (startsWithSpace y))
>        . squishBy (\x _ -> looksLikeAnInitial x)
>        . squishBy (\x _ -> any (`isSuffixOf` x) titles)
>        . squishBy (\x y -> endsWithDigit x  && startsWithDigit y)
>  where
>   looksLikeAnInitial [_,'.'] = True
>   looksLikeAnInitial _ = False
>   --
>   startsW f [] = False
>   startsW f (x:_) = f x
>   --
>   startsWithDigit = startsW isDigit
>   startsWithSpace = startsW isSpace
>   --
>   endsWithDigit xs =
>     case reverse xs of
>      ('.':x:_) -> isDigit x
>      _ -> False
> 
> squishBy f = map concat . groupBy f
> 
> titles :: [String]
> titles = [ "Mr.", "Mrs.", "Dr." ]

Enjoy!

PS. This message has a secondary purpose, to remind everybody that this
    mailing list exists and should be put to use ;-) We now have 15 Haskell NLP
    packages on hackage.  I'm looking forward to somebody combining them
    in clever ways to make something new and fun!

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url : http://projects.haskell.org/pipermail/nlp/attachments/20100303/93dd27d8/attachment.pgp 


More information about the NLP mailing list