Working with syntax trees

Fri Oct 28 15:51:48 BST 2011

On Fri, Oct 28, 2011 at 16:36, dokondr <dokondr at gmail.com> wrote:
> Please advise on Haskell libraries and data types to do the following work:
> I need to compare both structure and node contents of two graphs, find
> similar sub-graphs, and need some metric to measure distance between two
> graphs. These graphs are produced by POS (part of speech) sentence tagging.
> POS tagging is done by Standford statistical parser:
> http://nlp.stanford.edu/software/lex-parser.shtml

I think you mean syntactic parsing, not POS tagging.

>
> For example in the graph G1:
> (ROOT
>   (S
>     (NP (DT The) (NN voice) (NN quality))
>     (VP (VBD was)
>       (ADJP (JJ clear) (RB too)))
>     (. .)))
>
> G1 has an NP node (lets call this sub-graph SG1)  which has three child leaf
> nodes: (DT The) (NN voice) (NN quality), where
> DT, NN - nodes names, and
> "The", "voice", "quality" - corresponding leaf values of these nodes.
>
> I need to compare this graph with graph G2:
> (ROOT
>   (S
>     (SBAR (IN Although)
>       (S
>         (NP (DT the) (NN battery) (NN life))
>         (VP (VBD was) (RB not)
>           (ADJP (JJ long)))))
>     (, ,)
>     (NP (DT that))
>     (VP (VBZ is)
>       (VP (VBN ok)
>         (PP (IN for)
>           (NP (PRP me)))))
>     (. .)))
>
> G2 also has the NP node (let's call it sub-graph SG2)  which also has three
> child leaf nodes: (DT the) (NN battery) (NN life))
> I need to find that G1 and G2 has sub-graphs SG1 and SG2 with the similar
> structure, but with different values of the leaf nodes.
> I also need to devise some general metric that will allow me to estimate
> distance between any two graphs. This distance should account both for
> structural and leaf-node values similarity.
>
> It would be easier to measure distance between vectors then graphs. So I am
> thinking how to convert directed graph (that results from POS tagging) into
> vector. Any ideas, links here?

Your graphs seem to be Penn-treebank-style trees. Eric has a parser
for this format on hackage:
http://hackage.haskell.org/package/penn-treebank

As to comparing trees and measuring distance between them, it's hard
to give specific advice without knowing what you're trying to achieve.
But it looks like some version of a tree kernel would do what you
need. Roughly speaking, tree kernels count the number of common
subtrees between two trees. For the definition of the original tree
kernel see Collins and Duffy 2002, Convolution kernels for natural
language, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.8830&rep=rep1&type=pdf

Hope this helps,
--
Grzegorz