[[project @ 2001-08-08 09:48:58 by chak] chak**20010808094859 As discussed at HIM in Cambridge, the GHC Commentary is now located in the main repository. The idea is that any developer who wants to share his wisdom with the world-at-large can alter the document. Obviously, it would be nice to keep the structure and style of the document as I have written it so far (admittedly not terribly much). In particular, please add links to the files in CVS where appropriate. Moreover, I have now included a version number in the title, which should be pumped whenever there are substantial changes; check the file index.html. I am planing to set up a cron job that updates the version of the Commentary on my Web page with the CVS version every night (this is also the version linked from GHC's documentation page). ] { adddir ./ghc/docs/comm addfile ./ghc/docs/comm/feedback.html adddir ./ghc/docs/comm/genesis addfile ./ghc/docs/comm/genesis/genesis.html addfile ./ghc/docs/comm/genesis/makefiles.html addfile ./ghc/docs/comm/index.html addfile ./ghc/docs/comm/others.html adddir ./ghc/docs/comm/rts-libs addfile ./ghc/docs/comm/rts-libs/prelfound.html addfile ./ghc/docs/comm/rts-libs/prelude.html addfile ./ghc/docs/comm/rts-libs/primitives.html addfile ./ghc/docs/comm/rts-libs/stgc.html adddir ./ghc/docs/comm/the-beast addfile ./ghc/docs/comm/the-beast/basicTypes.html addfile ./ghc/docs/comm/the-beast/driver.html addfile ./ghc/docs/comm/the-beast/mangler.html addfile ./ghc/docs/comm/the-beast/simplifier.html addfile ./ghc/docs/comm/the-beast/typecheck.html hunk ./ghc/docs/comm/feedback.html 1 + + +
+ ++ I welcome any feedback on the + material and in particular would appreciated comments on which parts of + the document are incomprehensible or miss explanation -- e.g., due to + the use of GHC speak that is explained nowhere (words like infotable or + so). Moreover, I would be interested to know which areas of GHC you + would like to see covered here. +
+ For the moment is probably best if feedback is directed to +
+
+ chak@cse.unsw.edu.au
+
+ + However, if there is sufficient interest, we might consider setting up a + mailing list. + +
+ +Last modified: Wed Aug 8 00:11:42 EST 2001 + + + + hunk ./ghc/docs/comm/genesis/genesis.html 1 + + +
+ +
+ Building GHC happens in two stages: First you have to prepare the tree
+ with make boot
; and second, you build the compiler and
+ associated libraries with make all
. The boot
+ stage builds some tools used during the main build process, generates
+ parsers and other pre-computed source, and finally computes dependency
+ information. There is considerable detail on the build process in GHC's
+ Building Guide.
+
+
+ If you are hacking the compiler or like to play with unstable
+ development versions, chances are that the compiler someday just crashes
+ on you. Then, it is a good idea to load the core
into
+ gdb
as usual, but unfortunately there is usually not too
+ much useful information.
+
+ The next step, then, is somewhat tedious. You should build a compiler
+ producing programs with a runtime system that has debugging turned on
+ and use that to build the crashing compiler. There are many sanity
+ checks in the RTS, which may detect inconsistency before they lead to a
+ crash and you may include more debugging information, which helps
+ gdb.
For a RTS with debugging turned on, add the following
+ to build.mk
(see also the comment in
+ config.mk.in
that you find when searching for
+ GhcRtsHcOpts
):
+
++GhcRtsHcOpts+=-optc-DDEBUG +GhcRtsCcOpts+=-optc-g +EXTRA_LD_OPTS=-lbfd -liberty
+ Then go into fptools/ghc/rts
and make clean boot &&
+ make all
. With the resulting runtime system, you have to re-link
+ the compiler. Go into fptools/ghc/compiler
, delete the
+ file hsc
(up to version 4.08) or
+ ghc-<version>
, and execute make all
.
+
+ The EXTRA_LD_OPTS
are necessary as some of the debugging
+ code uses the BFD library, which in turn requires liberty
.
+ I would also recommend (in 4.11 and from 5.0 upwards) adding these linker
+ options to the files package.conf
and
+ package.conf.inplace
in the directory
+ fptools/ghc/driver/
to the extra_ld_opts
entry
+ of the package RTS
. Otherwise, you have to supply them
+ whenever you compile and link a program with a compiler that uses the
+ debugging RTS for the programs it produces.
+
+ To run GHC up to version 4.08 in gdb
, first invoke the
+ compiler as usual, but pass it the option -v
. This will
+ show you the exact invocation of the compiler proper hsc
.
+ Run hsc
with these options in gdb
. The
+ development version 4.11 and stable releases from 5.0 on do no longer
+ use the Perl driver; so, you can run them directly with gdb.
+
+ Debugging a compiler during building from HC files.
+ If you are boot strapping the compiler on new platform from HC files and
+ it crashes somewhere during the build (e.g., when compiling the
+ libraries), do as explained above, but you may have to re-configure the
+ build system with --enable-hc-boot
before re-making the
+ code in fptools/ghc/driver/
.
+ If you do this with a compiler up to version 4.08, run the build process
+ with make EXTRA_HC_OPTS=-v
to get the exact arguments with
+ which you have to invoke hsc
in gdb
.
+
+
+ +Last modified: Wed Aug 8 19:18:54 EST 2001 + + + + hunk ./ghc/docs/comm/genesis/makefiles.html 1 + + +
+ ++ The size and structure of GHC's makefiles makes it quite easy to scream + out loud - in pain - during the process of tracking down problems in the + make system or when attempting to alter it. GHC's Building + Guide has valuable information on the + makefile architecture. + +
+ The fptools/
toplevel and the various project directories
+ contain not only a Makefile
each, but there are
+ subdirectories of name mk/
at various levels that contain
+ rules, targets, and so on specific to a project - or, in the case of the
+ toplevel, the default rules for the whole system. Each mk/
+ directory contains a file boilerplate.mk
that ties the
+ various other makefiles together. Files called target.mk
,
+ paths.mk
, and suffix.mk
contain make targets,
+ definitions of variables containing paths, and suffix rules,
+ respectively.
+
+ One particularly nasty trick used in this hierarchy of makesfiles is the
+ way in which the variable $(TOP)
is used. AFAIK,
+ $(TOP)
always points to a directory containing an
+ mk/
subdirectory; however, it not necessarily points to the
+ toplevel fptools/
directory. For example, within the GHC
+ subtree, $(TOP)
points to fptools/ghc/
.
+ However, some of the makefiles in fptools/ghc/mk/
will then
+ temporarily redefine $(TOP)
to point a level
+ higher (i.e., to fptools/
) while they are including the
+ toplevel boilerplate. After that $(TOP)
is redefined to
+ whatever value it had before including makfiles from higher up in the
+ hierarchy.
+
+
+ +Last modified: Wed Aug 8 19:19:54 EST 2001 + + + + hunk ./ghc/docs/comm/index.html 1 + + +
+ +
+
+ Manuel M. T. Chakravarty
+
+
+ This document started as a collection of notes describing what I learnt when poking around in + the GHC sources. During the + Haskell Implementers Workshop in January 2001 it was decided to + put the commentary into GHC's CVS repository to allow the whole + developer community to add their wizardly insight to the document. +
+ The document is still in its infancy - help it grow! + +
+
+ ++
+ ++
+ ++
+ +Last modified: Wed Aug 8 00:11:49 EST 2001 + + + + hunk ./ghc/docs/comm/others.html 1 + + +
+ ++ Believe it or not, but there are other people besides you who are + masochistic enough to study the innards of the beast. Some of the have + been kind (or cruel?) enough to share their insights with us. Here is a + probably incomplete list: +
+
+ 1Usually reliable sources have it that + the poor soul in question is no one less than GUM hardcore hacker Hans-Wolfgang Loidl. + +
+ +Last modified: Wed Aug 8 00:47:05 EST 2001 + + + + hunk ./ghc/docs/comm/rts-libs/prelfound.html 1 + + +
+ ++ The standard Haskell Prelude as well as GHC's Prelude extensions are + constructed from GHC's primitives in a + couple of layers. + +
PrelBase.lhs
+ Some most elementary Prelude definitions are collected in PrelBase.lhs
.
+ In particular, it defines the boxed versions of Haskell primitive types
+ - for example, Int
is defined as
+
++data Int = I# Int#+
+ Saying that a boxed integer Int
is formed by applying the
+ data constructor I#
to an unboxed integer of type
+ Int#
. Unboxed types are hardcoded in the compiler and
+ exported together with the primitive
+ operations understood by GHC.
+
+ PrelBase.lhs
similarly defines basic types, such as,
+ boolean values
+
++data Bool = False | True deriving (Eq, Ord)+
+ the unit type +
++data () = ()+
+ and lists +
++data [] a = [] | a : [a]+
+ It also contains instance delarations for these types. In addition,
+ PrelBase.lhs
contains some tricky
+ machinery for efficient list handling.
+
+
+ +Last modified: Wed Aug 8 19:30:18 EST 2001 + + + + hunk ./ghc/docs/comm/rts-libs/prelude.html 1 + + +
+ ++ GHC's uses a many optimsations and GHC specific techniques (unboxed + values, RULES pragmas, and so on) to make the heavily used Prelude code + as fast as possible. + +
+ There is a lot of magic in PrelBase.lhs
-
+ among other things, the RULES
+ pragmas implementing the fold/build
+ optimisation. The code for map
is
+ a good example for how it all works. In the prelude code for version
+ 4.08.1 it reads as follows:
+
++map :: (a -> b) -> [a] -> [b] +map = mapList + +-- Note eta expanded +mapFB :: (elt -> lst -> lst) -> (a -> elt) -> a -> lst -> lst +mapFB c f x ys = c (f x) ys + +mapList :: (a -> b) -> [a] -> [b] +mapList _ [] = [] +mapList f (x:xs) = f x : mapList f xs + +{-# RULES +"map" forall f xs. map f xs = build (\c n -> foldr (mapFB c f) n xs) +"mapFB" forall c f g. mapFB (mapFB c f) g = mapFB c (f.g) +"mapList" forall f. foldr (mapFB (:) f) [] = mapList f + #-}+
+ This code is structured as it is, because the "map" rule first
+ breaks the map open, which exposes it to the various
+ foldr/build rules, and if no foldr/build rule matches, the "mapList"
+ rule closes it again in a later phase of optimisation - after
+ build was inlined. As a consequence, the whole thing depends a bit on
+ the timing of the various optimsations (the map might be closed again
+ before any of the foldr/build rules fires). To make the timing
+ deterministic, build
gets a {-# INLINE 2 build
+ #-}
pragma, which delays build
's inlining, and thus,
+ the closing of the map.
+
+
+ +Last modified: Wed Aug 8 19:31:18 EST 2001 + + + + hunk ./ghc/docs/comm/rts-libs/primitives.html 1 + + +
+ ++ Most user-level Haskell types and functions provided by GHC (in + particular those from the Prelude and GHC's Prelude extensions) are + internally constructed from even more elementary types and functions. + Most notably, GHC understands a notion of unboxed types, which + are the Haskell representation of primitive bit-level integer, float, + etc. types (as opposed to their boxed, heap allocated counterparts) - + cf. "Unboxed + Values as First Class Citizens." + +
+ The hardwired types of GHC are brought into scope by the module
+ PrelGHC
. This modules only exists in the form of a
+ handwritten interface file PrelGHC.hi-boot
,
+ which lists the type and function names, as well as instance
+ declarations. The actually types of these names as well as their
+ implementation is hardwired into GHC. Note that the names in this file
+ are z-encoded, and in particular, identifiers ending on zh
+ denote user-level identifiers ending in a hash mark (#
),
+ which is used to flag unboxed values or functions operating on unboxed
+ values. For example, we have Char#
, ord#
, and
+ so on.
+
+
+ As of (about) the development version 4.11, the types and various
+ properties of primitive operations are defined in the file primops.txt
+ (Personally, I don't think that the .txt
suffix is really
+ appropriate, as the file is used for automatic code generation).
+
+ The utility genprimopcode
+ generates a series of Haskell files from primops.txt
, which
+ encode the types and various properties of the primitive operations as
+ compiler internal data structures. These Haskell files are not complete
+ modules, but program fragments, which are included into compiler modules
+ during the GHC build process. The generated include files can be found
+ in the directory fptools/ghc/compiler/
and carry names
+ matching the pattern primop-*.hs-incl
. They are generate
+ during the execution of the boot
target in the
+ fptools/ghc/
directory. This scheme significantly
+ simplifies the maintenance of primitive operations.
+
+
+ +Last modified: Wed Aug 8 19:29:12 EST 2001 + + + + hunk ./ghc/docs/comm/rts-libs/stgc.html 1 + + +
+ ++ The C code generated by GHC doesn't use higher-level features of C to be + able to control as precisely as possible what code is generated. + Moreover, it uses special features of gcc (such as, first class labels) + to produce more efficient code. +
+ STG C makes ample use of C's macro language to define idioms, which also
+ reduces the size of the generated C code (thus, reducing I/O times).
+ These macros are defined in the C headers located in GHC's includes
+ directory.
+
+
TailCalls.h
+ TailCalls.h
+ defines how tail calls are implemented - and in particular - optimised
+ in GHC generated code. The default case, for an architecture for which
+ GHC is not optimised, is to use the mini interpreter described in the STG paper.
+
+ For supported architectures, various tricks are used to generate
+ assembler implementing proper tail calls. On i386, gcc's first class
+ labels are used to directly jump to a function pointer. Furthermore,
+ markers of the form --- BEGIN ---
and --- END
+ ---
are added to the assembly right after the function prologue
+ and before the epilogue. These markers are used by the Evil Mangler.
+
+
+ +Last modified: Wed Aug 8 19:28:29 EST 2001 + + + + hunk ./ghc/docs/comm/the-beast/basicTypes.html 1 + + +
+ +
+ The directory fptools/ghc/compiler/basicTypes/
+ contains modules that define some of the essential types definition for
+ the compiler - such as, identifiers, variables, modules, and unique
+ names.
+
+
Id
s
+ An Id
(defined in Id.lhs
+ essentially records information about value and data constructor
+ identifiers -- to be precise, in the case of data constructors, two
+ Id
s are used to represent the worker and wrapper functions
+ for the data constructor, respectively. The information maintained in
+ the Id
abstraction includes among other items strictness,
+ occurrence, specialisation, and unfolding information.
+
+ Due to the way Id
s are used for data constructors,
+ all Id
s are represented as variables, which contain a
+ varInfo
field of abstract type IdInfo.IdInfo
.
+ This is where the information about Id
s is really stored.
+ The following is a (currently, partial) list of the various items in an
+ IdInfo
:
+
+
OccInfo
data type is defined in the module BasicTypes.lhs
.
+ Apart from the trivial NoOccInfo
, it distinguishes
+ between variables that do not occur at all (IAmDead
),
+ occur just once (OneOcc
), or a loop breakers
+ (IAmALoopBreaker
).
+ + +Last modified: Wed Aug 8 19:23:01 EST 2001 + + + + hunk ./ghc/docs/comm/the-beast/driver.html 1 + + +
+ ++ The Glorious Driver (GD) is the part of GHC that orchestrates the + interaction of all the other pieces that make up GHC. It supersedes the + Evil Driver (ED), which was a Perl script that served the same + purpose and was in use until version 4.08.1 of GHC. Simon Marlow + eventually slayed the ED and instated the GD. +
+ The GD has been substantially extended for GHCi, i.e., the interactive + variant of GHC that integrates the compiler with a (meta-circular) + interpreter since version 5.00. + +
+ +Last modified: Wed Aug 8 19:22:14 EST 2001 + + + + hunk ./ghc/docs/comm/the-beast/mangler.html 1 + + +
+ +
+ The Evil Mangler (EM) is a Perl script invoked by the Glorious Driver after the C compiler (gcc) has
+ translated the GHC-produced C code into assembly. Consequently, it is
+ only of interest if -fvia-C
is in effect (either explicitly
+ or implicitly).
+
+
+ The EM reads the assembly produced by gcc and re-arranges code blocks as + well as nukes instructions that it considers non-essential. It + derives it evilness from its utterly ad hoc, machine, compiler, and + whatnot dependent design and implementation. More precisely, the EM + performs the following tasks: +
+ The EM is located in the Perl script ghc-asm.lprl
.
+ The script reads the .s
file and chops it up into
+ chunks (that's how they are actually called in the script) that
+ roughly correspond to basic blocks. Each chunk is annotated with an
+ educated guess about what kind of code it contains (e.g., infotable,
+ fast entry point, slow entry point, etc.). The annotations also contain
+ the symbol introducing the chunk of assembly and whether that chunk has
+ already been processed or not.
+
+ The parsing of the input into chunks as well as recognising assembly + instructions that are to be removed or altered is based on a large + number of Perl regular expressions sprinkled over the whole code. These + expressions are rather fragile as they heavily rely on the structure of + the generated code - in fact, they even rely on the right amount of wide + space and thus on the formatting of the assembly. +
+ Afterwards, the chunks are reordered, some of them purged, and some + stripped of some useless instructions. Moreover, some instructions are + manipulated (eg, loads of fast entry points followed by indirect jumps + are replaced by direct jumps to the fast entry point). +
+ The EM knows which part of the code belongs to function prologues and
+ epilogues as STG C adds tags of the
+ form --- BEGIN ---
and --- END ---
the
+ assembler just before and after the code proper of a function starts.
+ It adds these tags using gcc's __asm__
feature.
+
+ Update: Gcc 2.96 upwards performs more aggressive basic
+ block re-ordering and dead code elimination. This seems to make the
+ whole --- END ---
tag business redundant -- in fact, if
+ proper code is generated, no --- END ---
tags survive gcc
+ optimiser.
+
+
+ +Last modified: Wed Aug 8 19:27:22 EST 2001 + + + + hunk ./ghc/docs/comm/the-beast/simplifier.html 1 + + +
+ +
+ Most of the optimising program transformations applied by GHC are
+ performed on an intermediate language called Core, which
+ essentially is a compiler-friendly formulation of rank-2 polymorphic
+ lambda terms defined in the module CoreSyn.lhs
.
+ The transformation engine optimising Core programs is called the
+ Simplifier and composed from a couple of modules located in the
+ directory fptools/ghc/compiler/simplCore/
.
+ The main engine of the simplifier is contained in Simplify.lhs
.
+ and its driver is the routine core2core
in SimplCore.lhs
.
+
+ The program that the simplifier has produced after applying its various
+ optimisations can be obtained by passing the option
+ -ddump-simpl
to GHC. Moreover, the various intermediate
+ stages of the optimisation process is printed when passing
+ -dverbose-core2core
.
+
+
+ The simplification process has to take special care when handling
+ recursive binding groups; otherwise, the compiler might loop.
+ Therefore, the routine reOrderRec
in OccurAnal.lhs
+ computes a set of loop breakers - a set of definitions that
+ together cut any possible loop in the binding group. It marks the
+ identifiers bound by these definitions as loop breakers by enriching
+ their occurence information. Loop
+ breakers will never be inlined by the simplifier; thus,
+ guaranteeing termination of the simplification procedure. (This is not
+ entirely accurate -- see rewrite rules below.)
+
+ The processes finding loop breakers works as follows: First, the
+ strongly connected components (SCC) of the graph representing all
+ function dependencies is computed. Then, each SCC is inspected in turn.
+ If it contains only a single binding (self-recursive function), this is
+ the loop breaker. In case of multiple recursive bindings, the function
+ attempts to select bindings where the decision not to inline them does
+ cause the least harm - in the sense of inhibiting optimisations in the
+ code. This is achieved by considering each binding in turn and awarding
+ a score between 0 and 4, where a lower score means that the
+ function is less useful for inlining - and thus, a better loop breaker.
+ The evaluation of bingings is performed by the function
+ score
locally defined in OccurAnal
.
+
+ Note that, because core programs represent function definitions as
+ one binding choosing between the possibly many equations in the
+ source program with a case
construct, a loop breaker cannot
+ inline any of its possibly many alternatives (not even the non-recursive
+ alternatives).
+
+
+ The application of rewrite rules is controlled in the module Simplify.lhs
+ by the function completeCall
. This function first checks
+ whether it should inline the function applied at the currently inspected
+ call site, then simplifies the arguments, and finally, checks whether
+ any rewrite rule can be applied (and also whether there is a matching
+ specialised version of the applied function). The actual check for rule
+ application is performed by the function Rules.lookupRule
.
+
+ It should be note that the application of rewrite rules is not subject + to the loop breaker check - i.e., rules of loop breakers will be applied + regardless of whether this may cause the simplifier to diverge. + +
+ +Last modified: Wed Aug 8 19:25:33 EST 2001 + + + + hunk ./ghc/docs/comm/the-beast/typecheck.html 1 + + +
+ +
+ Probably the most important phase in the frontend is the type checker,
+ which is located at fptools/ghc/compiler/typecheck/
.
+
+
+ During type checking, GHC maintains a type environment whose
+ details are fixed in TcEnv.lhs
.
+ Among other things, the environment contains all imported and local
+ instances as well as a list of global entities (imported and
+ local types and classes together with imported identifiers) and
+ local entities (locally defined identifiers). This environment
+ is threaded through the type checking monad.
+
+
+ Expressions are type checked by TcExpr.lhs
.
+
+ Usage occurences of identifiers are processed by the function
+ tcId
whose main purpose is to instantiate
+ overloaded identifiers. It essentially calls
+ TcInst.instOverloadedFun
once for each universally
+ quantified set of type constraints. It should be noted that overloaded
+ identifiers are replaced by new names that are first defined in the LIE
+ (Local Instance Environment?) and later promoted into top-level
+ bindings.
+
+
+ GHC implements overloading using so-called dictionaries. A + dictionary is a tuple of functions -- one function for each method in + the class of which the dictionary implements an instance. During type + checking, GHC replaces each type constraint of a function with one + additional argument. At runtime, the extended function gets passed a + matching class dictionary by way of these additional arguments. + Whenever the function needs to call a method of such a class, it simply + extracts it from the dictionary. +
+ This sounds simple enough; however, the actual implementation is a bit
+ more tricky as it wants to keep track of all the instances at which
+ overloaded functions are used in a module. This information is useful
+ to optimise the code. The implementation is the module Inst.lhs
.
+
+ The function instOverloadedFun
is invoked for each
+ overloaded usage occurence of an identifier, where overloaded means that
+ the type of the idendifier contains a non-trivial type constraint. It
+ proceeds in two steps: (1) Allocation of a method instance
+ (newMethodWithGivenTy
) and (2) instantiation of functional
+ dependencies. The former implies allocating a new unique identifier,
+ which replaces the original (overloaded) identifier at the currently
+ type-checked usage occurrence.
+
+ The new identifier (after being threaded through the LIE) eventually
+ will be bound by a top-level binding whose rhs contains a partial
+ application of the original overloaded identifier. This papp applies
+ the overloaded function to the dictionaries needed for the current
+ instance. In GHC lingo, this is called a method. Before
+ becoming a top-level binding, the method is first represented as a value
+ of type Inst.Inst
, which makes it easy to fold multiple
+ instances of the same identifier at the same types into one global
+ definition. (And probably other things, too, which I haven't
+ investigated yet.)
+
+
+ Note: As of 13 January 2001 (wrt. to the code in the
+ CVS HEAD), the above mechanism interferes badly with RULES pragmas
+ defined over overloaded functions. During instantiation, a new name is
+ created for an overloaded function partially applied to the dictionaries
+ needed in a usage position of that function. As the rewrite rule,
+ however, mentions the original overloaded name, it won't fire anymore
+ -- unless later phases remove the intermediate definition again. The
+ latest CVS version of GHC has an option
+ -fno-method-sharing
, which avoids sharing instantiation
+ stubs. This is usually/often/sometimes sufficient to make the rules
+ fire again.
+
+
+ +Last modified: Wed Aug 8 19:24:09 EST 2001 + + + + }