NLP: the missing framework

Aleksandar Dimitrov aleks.dimitrov at gmail.com
Sat Jan 19 16:53:53 GMT 2013


Hello,

(Fair warning, ranty wall o' text ahead.)

> > Just thought you might be interested in Edward Yang's call to arms if you haven't seen it already:
> > 
> > http://blog.ezyang.com/2013/01/nlp-the-missing-framework/
> > 
> > How can we push things a little bit more in the right direction in the Haskell NLP world? What is the right direction?
> 
> Summarized, I think there are currently two major problems:
> 
> * Interoperability between existing components. From different expectations
> about tokenization to different syntactic annotations.

I've been thinking long and hard about the framework question in NLP, and this
really lies at the bottom of the issue: NLP is a very academic subject, and when
it comes to software, I have made the observation that people in academia suffer
a lot from the NIH-syndrome; they home-bake solutions even where it would be
unnecessary, making any sort of concerted effort very difficult.

In addition, I feel like out of the dozens of papers I've read in the last
month, only about 95% even made the *software* they used to come up with results
publicly available. This feels like a disgrace to me — isn't the entire idea
behind academic and scientific research reproducibility?
But it seems few people are interested in *real* reproducibility, i.e. here's
the data, there's the program (with source!) and you can run it like so in order
to come up with these results.

> * High-quality, annotated data is often not available under a permissive license.

The problem with high-quality data is its high-quality source, which is often
*not* under a permissive license. Web corpora or newspaper corpora are
unacceptable, because they can *never* be distributed freely. I think that
annotating large amounts of CC licensed data (Wikipedia, Gutenberg) instead
might be the way to go.

And no, bandwidth costs are not an issue for free availability, that's what we
have torrents for.

> Some projects, such as OpenNLP and NLTK, that aim to provide what the
> blog-post asks: ready to use, pre-trained NLP components. However, the blog
> post doesn't really lay out why these frameworks are not acceptable.

They might not be acceptable because the web-app or other thing you're writing
isn't in Python or Java. They might not be acceptable because the quality of
their tool chains isn't enough for you. Etc.

To me, they are not acceptable because (last time I looked) they were still
producing in-line destructive annotations of textual data.

We're in the 21st century. Linguistic analysis produces *metadata*. Shouldn't we
*know* that you don't intersperse metadata with data? Just don't do it. This/DT
format/NN is/VBZ not/RB acceptable/JJ! It makes NLP tools bulky and cumbersome
to use and integrate.

So yes, I think the guys behind UIMA had a wonderful idea. However, UIMA is kind
of bulky and, as you said, difficult to install or maintain, and it all but
forces you to use Java, which not everybody wants to do. In addition, I find
UIMAs bolt-on type system cumbersome, and the API is… weird, mostly because it
is really difficult to write a good abstract general-purpose API in Java in the
first place (or maybe that's just me.)

The Haskell ecosystem lacks such a solution completely. I don't know whether a
C-solution with Haskell-bindings would be better — I like your idea of C-libs,
because they are portable. Not platform-portable, that's trivial nowadays, but
*language*-portable. But I don't think it would catch on. Most FFIs suck.  There
are some good ones (Lua comes to mind, the Haskell one is also good,) but few
people can be bothered to use them. It's most definitely not in the standard
skill-set of an academic NLP researcher to use their favorite language's FFI, so
they just won't use your C libs. And almost *nobody* is doing NLP in plain C.

Haskell itself has its own problems. IO is one: you have to use a high-level IO
abstraction like iteratees or proxies in order to process NLP-size amounts of
data in Haskell, which represents another barrier to entry. I'm debating writing
some sort of general-purpose metadata annotation and processing framework in
Haskell, but I'm not quite sure how to do it, or whether there would even be
demand for it.

Thanks for reading,
Aleks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://projects.haskell.org/pipermail/nlp/attachments/20130119/fc782a4e/attachment.pgp>


More information about the NLP mailing list