language-c and Unicode

Benedikt Huber benedikt.huber at gmail.com
Mon Jan 13 13:09:36 GMT 2014


On 13.01.2014, at 06:55, Ian Ross wrote:

> Hi Benedikt,
> 
> The changes you made do fix the problem I have, so if you could do a minor revision of language-c that would be great!
> 
> I'm not sure why reading a Unicode string literal didn't work for you: as far as I can tell, the InputStream code in language-c should work with any validly encoded UTF-8 input.  The only problem I could find was these character classes in the lexer that didn't cover the whole range of valid Unicode codepoints.  (The particular problem for C2HS was in the infname class used in CPP line directives because locale-dependent text was being produced in these by the C pre-processor.)
[cc to mailing list]
Hi Ian,
I think the patch gets rid of the parse error, and works fine for filenames.
I also think it is ok to require that the input is UTF-8 encoded (so far, only ASCII was supported).
However, at the moment C string literals are not handled in a correct way:
In a roundtrip test,
  char *animal = "Bär"; // i.e., B\xe4r
becomes
  char *animal = "B\303\244r";
I do not have a fix for this problem yet.

cheers, benedikt

> 
> Cheers,
> 
> Ian.
> 
> 
> 
> On 11 January 2014 20:35, Benedikt Huber <benedikt.huber at gmail.com> wrote:
> On 10.01.2014, at 23:17, Ian Ross wrote:
> > Hi Benedikt,
> >
> > I hope you're still the right person to contact about this: your name is on the Hackage page for language-c as the maintainer.  If you're not the person to talk to, I'd be grateful if you could let me know who is.
> >
> > Anyway, I've been figuring out why C2HS doesn't work in some locales (in particular zh_CN.utf8) and I've tracked the problem down to some definitions in the lexer of language-c.  What happens is that C2HS uses CPP to generate some code that ends up having locale-dependent text in it, which is then parsed using language-c.  The locale-dependent text is valid UTF-8 so can be handled by Alex OK, but the lexer definition in language-c is too narrow.  The relevant code is lines 84-86 of Language/C/Parser/Lexer.x.  Currently they say:
> >
> > $instr    = \0-\255 # [ \\ \" \n \r ]       -- valid character in a string literal
> > $anyButNL = \0-\255 # \n
> > $infname  = \ -\127 # [ \\ \" ]             -- valid character in a filename
> >
> > but to deal with all valid UTF-8 files, they should say:
> >
> > $instr    = . # [ \\ \" \n \r ]       -- valid character in a string literal
> > $anyButNL = . # \n
> > $infname  = . # [ \\ \" ]             -- valid character in a filename
> >
> > (the dot is Alex's notation for "any valid UTF-8 codepoint").
> >
> > Can you think of any problems that would be introduced by this change?  If not, do you think you could make the relevant changes to language-c at some point?
> Hi Ian,
> You are right, at the moment unicode source code is not supported by Language.C (and neither are \uXXXX escape sequences).
> I do not think the changes that you suggest would introduce any problem.  However, I'm not sure whether this is enough to solve the problem - did you try it out?
> 
> I applied these changes, tried to parse a UTF-8 string literal (with a non-ANSI character in it) and still got a lexical error. I think Language.C.Data.InputStream needs to be fixed (at least), but it does not look like a trivial issue.
> 
> So, I pushed the change you requested to the darcs repo, but I'm not sure whether it solves your problem...
> If it does, I'm happy to publish a minor revision on hackage.
> 
> cheers, benedikt
> 
> >
> > Thanks!
> >
> > Ian.
> >
> > --
> > Ian Ross   Tel: +43(0)6804451378   ian at skybluetrades.net   www.skybluetrades.net
> 
> 
> 
> 
> -- 
> Ian Ross   Tel: +43(0)6804451378   ian at skybluetrades.net   www.skybluetrades.net




More information about the Language-c mailing list