Turkish I issue #110

ertant · 2018-04-30T13:26:19Z

Hi,

I think there is a Turkish-I problem in Pegasus/Compiler/CodeGenerator/Grammar.weave file with 239 line.

While using "[a-z]i" regex expression string comparison is done with CurrentCultureIgnoreCase and parser cannot match any thing if "I" character is used. Its should be InvariantCultureIgnoreCase to solve this problem. I have solve this with "[a-zA-Z]i" as workaround for now. Maybe its best to pass string comparison enumeration to parser for other scenarios.

You can find details from https://blog.codinghorror.com/whats-wrong-with-turkey/

otac0n · 2018-05-27T19:21:13Z

So, there is actually an inconsistency in Pegasus' handling of the ignore case flag.

For strings, this logic is used:

if (ignoreCase ? substr.Equals(literal, StringComparison.OrdinalIgnoreCase) : substr == literal)

Whereas for character ranges, this logic is used:

match = (char.IsUpper(o) || char.IsLower(o)) && cs.Equals(o.ToString(), StringComparison.CurrentCultureIgnoreCase);

It sounds like we need to make this consistent and configurable, yes? Do you feel that it is likely that any given parser would need to specify Current Culture, Invariant Culture, and Ordinal in different spots in their parser?

It is important to note that changing these defaults will require a major version bump, as it is a breaking change.

ertant · 2018-05-27T23:02:28Z

Yes I think there is a bit inconsistent for strings and char ranges.

Also in same file in line 228 there is a character range comparison like;

match = c >= characterRanges[i] && c <= characterRanges[i + 1];

this is also needed to be culture aware. I think it's currently comparing by current culture implicitly.

To answer your question maybe it's best to give example. I'm using pegasus to parse SQL like expressions.

For aggregation formula i'm using;

fn_countdist = "CountDistinct"i

This is should be invariant because when type uppercase "COUNTDISTINCT" it's failing because "I" != "i" in current culture.

For table or column identifiers i used;

name <string> = ([a-z]i+[a-z_0-9]i*)

This is should be culture aware because columns may include non-english chars. For example: "AlıcıŞube"

Maybe it's best to pass an optional a culture parameter with current culture default and redefine regex "i" as ;

"regex"i as invariant
"regex"ic as culture aware

otac0n · 2019-04-06T02:31:13Z

A workaround for Pegasus 4.1 is to use a more specific lexical structure as C# does, e.g. using char.IsLetter:

name = (letter letterOrDigit*);
letter = c:. &{ char.IsLetter(c[0]) };
letterOrDigit = c:. &{ char.IsLetterOrDigit(c[0]) };

For 5.0, does it make sense to use OrdinalIgnoreCase for all usages of ""i or []i?

otac0n added breaking enhancement labels May 27, 2018

otac0n added this to the 5.0 milestone Mar 23, 2019

otac0n added the has-workaround label Jun 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turkish I issue #110

Turkish I issue #110

ertant commented Apr 30, 2018

otac0n commented May 27, 2018

ertant commented May 27, 2018

otac0n commented Apr 6, 2019

Turkish I issue #110

Turkish I issue #110

Comments

ertant commented Apr 30, 2018

otac0n commented May 27, 2018

ertant commented May 27, 2018

otac0n commented Apr 6, 2019