[Haddock] Help with parsing Haskell modules for documentation

Thu Apr 17 03:16:06 BST 2014

On 16/04/14 19:32, Michael Pankov wrote:
> On Tue, Apr 15, 2014 at 3:47 AM, Mateusz Kowalczyk
> <fuuzetsu at fuuzetsu.co.uk>wrote:
> 
>> On 14/04/14 19:53, Michael Pankov wrote:
>>> Hello everyone,
>>>
>>> I'm working on an experimental tool that integrates with Git and tracks
>>> updates to documentation as the source code is changing. It's early in
>>> development and I'm not ready to show anything yet, but would like to ask
>>> for some help instead.
>>>
>>> On the most basic level, I intend to notify the programmer in case they
>>> change the source code and do not change the documentation comment of
>>> top-level functions. I do understand that this will create a lot of false
>>> positives, and it quite limited, but that's the first step I want to
>> take.
>>>
>>> Then, I'm going to try to detect changes of arguments lists of the
>>> functions as in source and as documented, and notify about that.
>>>
>>> Parsing the module itself already proved to be difficult to do in a
>>> sensible or moderately complete way. I tried to use Haskell.Language.Exts
>>> parser. But there are cases when you have multiple functions with same
>> type
>>> signature, don't have any type signature at all, etc.
>>
>> We use GHC API although as far as I know, Haskell.Language.Exts is able
>> to extract Haddock comments as well. I don't know how well it handles
>> other cases (no signature for example).
> 
> 
> Well, Haskell.Language.Exts is able to parse the module with comments.
> there's parseFileWIthComments (
> http://hackage.haskell.org/package/haskell-src-exts-1.15.0/docs/Language-Haskell-Exts.html).
> It returns a module AST and a list of comments.
> 
> But the problem is this: what comment exactly should be considered the
> documentation comment? I mean, there can be bunches of comments with
> newlines between, there also can be multiple functions with same type
> signature (foo, bar :: Int -> Int). There may also be other corner cases.
> 
> It seems Haddock just considers the previous comment to be a part of
> documentation. So that the following declaration is documented. And well,
> that is probably sensible, I just have all these unknowns buzzing in my
> head and nearly feel overwhelmed by the parts I may miss.

When GHC sees comments and is running with a Haddock flag, it will look
for the |, ^, #, * and $ symbols in them and stitch the comments
together. We don't do any of that ourselves. I don't know if HSE
discriminates between regular comments and Haddock comments, you might
have to do the stitching yourself here.

For multiple functions per signature, I believe GHC just attaches the
same comment to each of the functions.

All this is relatively to test, I encourage that you put together some
GHC API examples and print out the values you get back from GHC. GHC in
particular returns declarations with possible HaddockDoc (not actual
type name, I forget what it is now) field and we simply check that.
Things that aren't deemed as Haddock comments by GHC are just thrown away.

> 
>> We do not have to worry about
>> any of that stuff as we use GHC itself. While there are many things
>> wrong with Haddock being so attached to GHC, in return we get the
>> ability to do things like ask for type signatures of everything and use
>> that when generating documentation.
>>
>> I don't have much HSE experience but to me it seems that what HSE does
>> and what Haddock needs aren't exactly lined up. All we care about on our
>> end is that we can extract a lot of information about identifiers and
>> documentation that is attached to them. HSE seems like it it's intended
>> for source manipulation rather than information extraction but again I'm
>> not experienced with it so I can't say for sure.
>>
> 
> Yes, I think you're right. As I wrote above, HSE creates entire AST and a
> separate list of comments, which is not exactly convenient for a project
> like I intend to develop.

Ah, when I set off to write doccheck (see the dead project on Hackage),
I encountered this problem and I am pretty sure that I was informed that
HSE does not provide functionality to splice the comments back into the
AST or to ask what comment is attached to what. It just seems to me that
HSE does not have such functionality and it's arguable whether it
should. From what I can tell, you might have to extend HSE/write a
wrapper around it or switch to using GHC API.

> There's also annotated AST in HSE (
> http://hackage.haskell.org/package/haskell-src-exts-1.15.0/docs/Language-Haskell-Exts-Annotated.html).
> It stores SrcSpanInfo in the node by default, and it's not quite
> transparent to me how to store anything else (in my case, the corresponding
> comment would be useful).
> 
> 
>>
>> I do think that we could use HSE to achieve some of what we're doing
>> now: I'm actually told that there is a version of Haddock out there that
>> uses HSE instead of GHC API directly although it's an internal project
>> in some company so I did not actually witness it myself.
>>
>>> I started to look into Haddock's source code to see how it handles this
>>> stuff, but it's pretty hard to me to even find the place. To me, it seems
>>> like there should be a map of entities to their comments.
>>
>> You're correct. We ask GHC to do all the heavy lifting with regards to
>> renaming, type-checking and attaching comments.
>>
>>> Maybe someone could point me to the right source files and functions?
>>
>> Hm, it's rather spread out so it's difficult to point to the exact
>> location. More or less how it works is that we parse the flags passed to
>> Haddock, set any GHC flags according to that and ask GHC to rename and
>> type-check things for us. We then get TypecheckedModule (this is a GHC
>> API type) out of it which we further process. Amongst many things,
>> TypecheckedModule contains list of all declarations &c. All these have a
>> potential Haddock string attached to them.
> 
> 
> Do I understand correctly that GHC matches the Haddock documentation to the
> names by itself?..
> 
> Because in case of HSE you have to bind the comments to names afterwards.
> And the only sensible way to do that seems to be to search for comments
> whose source spans end just before the source span of the entity we're
> interested in (say, function). But in HSE function itself isn't
> represented. There's type binding, there's equation, etc., and when I
> looked at documentation I got the impression that there are several
> possible ways a function can be represented in HSE source tree.
> 
> Is it the same with GHC?

I don't know how exactly GHC joins up comments with appropriate
declarations but I can tell you that we don't do it ourselves. We simply
get out declarations from GHC which may or may not have a Haddock
comment attached to them. We don't concern with any ‘matching up’ by
ourselves.

> 
>> What we do is simply take
>> these declarations, parse a comment and create various maps from Name
>> (GHC type) to ‘Doc a’ (Haddock type). We store this and more information
>> in a file for future invocations (this is what the .haddock files are).
>>
>> I suppose you should be looking how we work with the GHC API output to
>> achieve these interface files. You should be looking at close to
>> everything under Interface directory as well as how we invoke the
>> functions inside of it. createInterface function in Create.hs might
>> might a fair starting point even though it's not exactly the smallest
>> function.
>>
>> A small usage of GHC API is at [1], perhaps it will help you to get
>> started, perhaps it won't. It does show how to go from a filename to a
>> TypecheckedModule though.
>>
> 
> I probably will take a look. I'm still hesitant to rely on GHC, though.

Yes, I fully understand not wanting to rely on GHC, it can be quite
constraining/difficult. In this scenario GHC does the job however and
HSE not so much. It's up to you to decide whether to use GHC or extend
HSE to do what you want or maybe even something else.

> 
>>
>>> I also think that having Haddock API would be great and I noticed it's in
>>> quite incomplete state now.
>>
>> Yes, there are plans for 2.15.x to improve the state of
>> Haddock-as-a-library. Hopefully by GHC 7.10 things will be much nicer.
>>
>>> To use the Haddock's API is not my primary
>>> interest, however. I could try at least looking on Haddock's way to
>> handle
>>> the ambiguities.
>>
>> I'm unsure what ambiguities you mean.
> 
> 
> Well, the ones I stated above: ambiguities of the HSE AST. Maybe I'm
> missing something. And maybe it's the other way with GHC.
> 
> Seems it would be great to have a lightweight parser which only gets names,
> types, and comments in a nice map… But surely I won't be able to pull that
> off. :)

Well, it would not be difficult to get out such a map from the values
provided by GHC but to do this without using GHC or some bastardised
subset of its parser might turn out rather difficult.

All in all I think using GHC API would be the easiest to get going but I
think that using HSE is the right way to do it in the end. I look
forward to what you can come up with.

> Any source-code gets parsed by GHC
>> itself so if your code itself is not ambiguous then the information we
>> get back isn't either. Going the other way, String -> actual identifier,
>> we first ask GHC to parse the identifier (makes sure it's valid) and
>> then we ask it to give us things it knows about in the current
>> environment with that name and then we make a best guess which one is
>> meant. See [2] for an example when GHC folk changed something up and our
>> guess was no longer correct. Also see bugfix commits for the mentioned
>> tickets to actually see the code we use to decide this.
>>
>>> Thanks,
>>
>> Sorry for not being much help. I think your project has a potential to
>> be quite useful.
>>
> 
> Thanks for info and links, will take a look.
> 
> 
>>
>> [1]: https://ghc.haskell.org/trac/ghc/ticket/8945
>> [2]:
>>
>> http://stackoverflow.com/questions/17912567/haddock-link-to-functions-in-non-imported-modules
>>
>> --
>> Mateusz K.
>>
>> _______________________________________________
>> Haddock mailing list
>> Haddock at projects.haskell.org
>> http://projects.haskell.org/cgi-bin/mailman/listinfo/haddock
>>
> 

-- 
Mateusz K.