Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turkish I issue #110

Open
ertant opened this issue Apr 30, 2018 · 3 comments
Open

Turkish I issue #110

ertant opened this issue Apr 30, 2018 · 3 comments

Comments

@ertant
Copy link

ertant commented Apr 30, 2018

Hi,

I think there is a Turkish-I problem in Pegasus/Compiler/CodeGenerator/Grammar.weave file with 239 line.

While using "[a-z]i" regex expression string comparison is done with CurrentCultureIgnoreCase and parser cannot match any thing if "I" character is used. Its should be InvariantCultureIgnoreCase to solve this problem. I have solve this with "[a-zA-Z]i" as workaround for now. Maybe its best to pass string comparison enumeration to parser for other scenarios.

You can find details from https://blog.codinghorror.com/whats-wrong-with-turkey/

@otac0n
Copy link
Owner

otac0n commented May 27, 2018

So, there is actually an inconsistency in Pegasus' handling of the ignore case flag.

For strings, this logic is used:

if (ignoreCase ? substr.Equals(literal, StringComparison.OrdinalIgnoreCase) : substr == literal)

Whereas for character ranges, this logic is used:

match = (char.IsUpper(o) || char.IsLower(o)) && cs.Equals(o.ToString(), StringComparison.CurrentCultureIgnoreCase);

It sounds like we need to make this consistent and configurable, yes? Do you feel that it is likely that any given parser would need to specify Current Culture, Invariant Culture, and Ordinal in different spots in their parser?

It is important to note that changing these defaults will require a major version bump, as it is a breaking change.

@ertant
Copy link
Author

ertant commented May 27, 2018

Yes I think there is a bit inconsistent for strings and char ranges.

Also in same file in line 228 there is a character range comparison like;

match = c >= characterRanges[i] && c <= characterRanges[i + 1];

this is also needed to be culture aware. I think it's currently comparing by current culture implicitly.

To answer your question maybe it's best to give example. I'm using pegasus to parse SQL like expressions.

For aggregation formula i'm using;

fn_countdist = "CountDistinct"i

This is should be invariant because when type uppercase "COUNTDISTINCT" it's failing because "I" != "i" in current culture.

For table or column identifiers i used;

name <string> = ([a-z]i+[a-z_0-9]i*)

This is should be culture aware because columns may include non-english chars. For example: "AlıcıŞube"

Maybe it's best to pass an optional a culture parameter with current culture default and redefine regex "i" as ;

"regex"i as invariant
"regex"ic as culture aware

@otac0n otac0n added this to the 5.0 milestone Mar 23, 2019
@otac0n
Copy link
Owner

otac0n commented Apr 6, 2019

A workaround for Pegasus 4.1 is to use a more specific lexical structure as C# does, e.g. using char.IsLetter:

name = (letter letterOrDigit*);
letter = c:. &{ char.IsLetter(c[0]) };
letterOrDigit = c:. &{ char.IsLetterOrDigit(c[0]) };

For 5.0, does it make sense to use OrdinalIgnoreCase for all usages of ""i or []i?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants