[Haddock] Help with parsing Haskell modules for documentation

Tue Apr 15 00:47:11 BST 2014

On 14/04/14 19:53, Michael Pankov wrote:
> Hello everyone,
> 
> I'm working on an experimental tool that integrates with Git and tracks
> updates to documentation as the source code is changing. It's early in
> development and I'm not ready to show anything yet, but would like to ask
> for some help instead.
> 
> On the most basic level, I intend to notify the programmer in case they
> change the source code and do not change the documentation comment of
> top-level functions. I do understand that this will create a lot of false
> positives, and it quite limited, but that's the first step I want to take.
> 
> Then, I'm going to try to detect changes of arguments lists of the
> functions as in source and as documented, and notify about that.
> 
> Parsing the module itself already proved to be difficult to do in a
> sensible or moderately complete way. I tried to use Haskell.Language.Exts
> parser. But there are cases when you have multiple functions with same type
> signature, don't have any type signature at all, etc.

We use GHC API although as far as I know, Haskell.Language.Exts is able
to extract Haddock comments as well. I don't know how well it handles
other cases (no signature for example). We do not have to worry about
any of that stuff as we use GHC itself. While there are many things
wrong with Haddock being so attached to GHC, in return we get the
ability to do things like ask for type signatures of everything and use
that when generating documentation.

I don't have much HSE experience but to me it seems that what HSE does
and what Haddock needs aren't exactly lined up. All we care about on our
end is that we can extract a lot of information about identifiers and
documentation that is attached to them. HSE seems like it it's intended
for source manipulation rather than information extraction but again I'm
not experienced with it so I can't say for sure.

I do think that we could use HSE to achieve some of what we're doing
now: I'm actually told that there is a version of Haddock out there that
uses HSE instead of GHC API directly although it's an internal project
in some company so I did not actually witness it myself.

> I started to look into Haddock's source code to see how it handles this
> stuff, but it's pretty hard to me to even find the place. To me, it seems
> like there should be a map of entities to their comments.

You're correct. We ask GHC to do all the heavy lifting with regards to
renaming, type-checking and attaching comments.

> Maybe someone could point me to the right source files and functions?

Hm, it's rather spread out so it's difficult to point to the exact
location. More or less how it works is that we parse the flags passed to
Haddock, set any GHC flags according to that and ask GHC to rename and
type-check things for us. We then get TypecheckedModule (this is a GHC
API type) out of it which we further process. Amongst many things,
TypecheckedModule contains list of all declarations &c. All these have a
potential Haddock string attached to them. What we do is simply take
these declarations, parse a comment and create various maps from Name
(GHC type) to ‘Doc a’ (Haddock type). We store this and more information
in a file for future invocations (this is what the .haddock files are).

I suppose you should be looking how we work with the GHC API output to
achieve these interface files. You should be looking at close to
everything under Interface directory as well as how we invoke the
functions inside of it. createInterface function in Create.hs might
might a fair starting point even though it's not exactly the smallest
function.

A small usage of GHC API is at [1], perhaps it will help you to get
started, perhaps it won't. It does show how to go from a filename to a
TypecheckedModule though.

> I also think that having Haddock API would be great and I noticed it's in
> quite incomplete state now. 

Yes, there are plans for 2.15.x to improve the state of
Haddock-as-a-library. Hopefully by GHC 7.10 things will be much nicer.

> To use the Haddock's API is not my primary
> interest, however. I could try at least looking on Haddock's way to handle
> the ambiguities.

I'm unsure what ambiguities you mean. Any source-code gets parsed by GHC
itself so if your code itself is not ambiguous then the information we
get back isn't either. Going the other way, String -> actual identifier,
we first ask GHC to parse the identifier (makes sure it's valid) and
then we ask it to give us things it knows about in the current
environment with that name and then we make a best guess which one is
meant. See [2] for an example when GHC folk changed something up and our
guess was no longer correct. Also see bugfix commits for the mentioned
tickets to actually see the code we use to decide this.

> Thanks,

Sorry for not being much help. I think your project has a potential to
be quite useful.

[1]: https://ghc.haskell.org/trac/ghc/ticket/8945
[2]:
http://stackoverflow.com/questions/17912567/haddock-link-to-functions-in-non-imported-modules

-- 
Mateusz K.