Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve operator detection #150

Closed
wkeese opened this issue Oct 5, 2023 · 12 comments
Closed

improve operator detection #150

wkeese opened this issue Oct 5, 2023 · 12 comments
Labels
no-stale Prevent making as stale released

Comments

@wkeese
Copy link
Contributor

wkeese commented Oct 5, 2023

There are multiple multi-character operators in SQL such as >=, <=, <>, and maybe others. Currently, each of them except for != is split into multiple segments, for example:

{ name: 'special', content: '>' },
{ name: 'special', content: '=' },

This precludes some (likely uncommon) HTML formatting, such as drawing a box around each operator.

Ideally, these multi-character operators would be classified as a single segment, i.e.

{ name: 'special', content: '>=' },

For example (if building on top of my previous PR #148):

it('operators', () => {
  expect(getSegments('AGE >= 45'))
    .toStrictEqual([
      { name: 'identifier', content: 'AGE' },
      { name: 'whitespace', content: ' ' }
      { name: 'special', content: '>=' },
      { name: 'whitespace', content: ' ' },
      { name: 'number', content: '45' },
    ])
})

The current regexp for "special" segments is:

/(?<special>!=|[=%*/\-+,;:<>])/

Besides the problem/inefficiency I listed above, another problem with that regexp is that it misses some single-character operators, for example & and |. See https://www.w3schools.com/sql/sql_operators.asp for a list.

For those two reasons, I wonder if instead of the current regexp, we should just look for all sequences of punctuation, excluding the quotation marks that start strings and identifiers:

/(?<special>[^\w\s"'`]+)/

Admittedly as I wrote in #149 it's sometimes hard to differentiate between subtraction and negative numbers, and the regexp above would exacerbate that problem in corner cases like AGE*-5. It would work fine with spaces though, i.e. AGE * -5.

@scriptsbot

This comment was marked as off-topic.

@wkeese

This comment was marked as off-topic.

@scriptsbot scriptsbot removed the Stale label Oct 21, 2023
@scriptcoded scriptcoded added the no-stale Prevent making as stale label Oct 25, 2023
@parallels999
Copy link

parallels999 commented Jan 23, 2024

Hi, i have the same problem

another problem with that regexp is that it misses some single-character operators, for example & and |. See https://www.w3schools.com/sql/sql_operators.asp for a list.

Quick solution, add &|^ to regex for "special" segments

/(?<special>!=|[=%*/\-+,;:<>&\|\^])/

@scriptcoded
Copy link
Owner

Hey, sorry for not replying here. Simply put I haven't had that much time the last few weeks due to various reasons. But next week is vacation, maybe I'll have some time then 😄 No promises though

@PaolaRuby
Copy link

Quick solution, add &|^ to regex for "special" segments

/(?<special>!=|[=%*/\-+,;:<>&\|\^])/

That works for me, but could you do a PR?

@wkeese
Copy link
Contributor Author

wkeese commented Mar 12, 2024

Wait, that doesn't address the main problem I listed in this issue. Why not just go with what I originally suggested?

/(?<special>[^\w\s"'`]+)/

@scriptcoded
Copy link
Owner

As @wkeese mentioned in the description a fix for this issue might (or more likely will) affect #149. I'm very keen on resolving the number detection issue, and I think that coming up with a good solution to that issue first would be the better approach here. Then we could look at how to adapt the matching logic introduced there to work with multi character operators.

To add my five cents to the conversation regarding different regexes, without having actually tested these myself I believe implementing the more generic regex suggested by @wkeese would generally be a better approach since it takes away a bit of the burden maintaining that character list. That being there might be a risk it's a bit too greedy, so the tests will have to tell if it will actually work.

@wkeese
Copy link
Contributor Author

wkeese commented Mar 30, 2024

I believe implementing the more generic regex suggested by @wkeese would generally be a better approach since it takes away a bit of the burden maintaining that character list. That being there might be a risk it's a bit too greedy, so the tests will have to tell if it will actually work.

You made a lot of good points.

ISTM the root issue, probably impractical to solve, is that + and - can be unary or binary operators. So regardless of whether we have an explicit list of operators, this can be tokenized two ways:

x<>-5

And so can this:

x-5

But those are edge cases due to the lack of spaces.

I do think it makes sense for number matching to take priority as that will do the right thing for common expressions like

x > -5

@wkeese
Copy link
Contributor Author

wkeese commented Apr 2, 2024

PS: I added a PR #149 for the numbers. I points to the beta branch since presumably the master branch is now just for bug fixes.

wkeese added a commit to wkeese/sql-highlight that referenced this issue Apr 6, 2024
This fixes the "special" segments to include some missing operators from
https://www.w3schools.com/sql/sql_operators.asp.

It also keeps all the symbols previously specified as "special" even though some
of them aren't listed on https://www.w3schools.com/sql/sql_operators.asp,
specifically: != , ; : .

Fixes scriptcoded#150
wkeese added a commit to wkeese/sql-highlight that referenced this issue Apr 6, 2024
This fixes the "special" segments to include some missing operators from
https://www.w3schools.com/sql/sql_operators.asp and other sites.

I took the conservative approach of listing all the operators, as opposed to
the general regexp /(?<special>[^\w\s"'`]+)/, because the conservative approach
works better in certain cases such as "x>-5".  The downside is that you need
to list all the operators, so it's a bit fragile, especially since the exact
operators vary by SQL version (MySQL, TransactSQL, Postgres, etc.).

I kept all the symbols previously specified as "special" even though some
of them aren't operators , specifically: , ; : .

Fixes scriptcoded#150
wkeese added a commit to wkeese/sql-highlight that referenced this issue Apr 6, 2024
This fixes the "special" segments to include some missing operators from
https://www.w3schools.com/sql/sql_operators.asp and other sites.

I took the conservative approach of listing all the operators, as opposed to
the general regexp /(?<special>[^\w\s"'`]+)/, because the conservative approach
works better in certain cases such as "x>-5".  The downside is that you need
to list all the operators, so it's a bit fragile, especially since the exact
operators vary by SQL version (MySQL, TransactSQL, Postgres, etc.).

I kept all the symbols previously specified as "special" even though some
of them aren't operators, specifically: , ; : .

Fixes scriptcoded#150
@wkeese
Copy link
Contributor Author

wkeese commented Apr 13, 2024

I added #193, now we match all operators. Actually, it's using "special" as the bucket for anything that doesn't match the other regexps (number, identifier, string, etc.). But I added lots of tests and I've convinced myself that it isn't too greedy. In particular, that minus, plus, or dot before a digit is treated as part of the number.

scriptcoded pushed a commit that referenced this issue Jul 2, 2024
This fixes the "special" segments to include some missing operators from
https://www.w3schools.com/sql/sql_operators.asp and other sites.

I took the conservative approach of listing all the operators, as opposed to
the general regexp /(?<special>[^\w\s"'`]+)/, because the conservative approach
works better in certain cases such as "x>-5".  The downside is that you need
to list all the operators, so it's a bit fragile, especially since the exact
operators vary by SQL version (MySQL, TransactSQL, Postgres, etc.).

I kept all the symbols previously specified as "special" even though some
of them aren't operators, specifically: , ; : .

Fixes #150
scriptcoded pushed a commit to wkeese/sql-highlight that referenced this issue Jul 2, 2024
This fixes the "special" segments to include some missing operators from
https://www.w3schools.com/sql/sql_operators.asp and other sites.

I took the conservative approach of listing all the operators, as opposed to
the general regexp /(?<special>[^\w\s"'`]+)/, because the conservative approach
works better in certain cases such as "x>-5".  The downside is that you need
to list all the operators, so it's a bit fragile, especially since the exact
operators vary by SQL version (MySQL, TransactSQL, Postgres, etc.).

I kept all the symbols previously specified as "special" even though some
of them aren't operators, specifically: , ; : .

Fixes scriptcoded#150
scriptcoded pushed a commit that referenced this issue Jul 2, 2024
This fixes the "special" segments to include some missing operators from
https://www.w3schools.com/sql/sql_operators.asp and other sites.

I took the conservative approach of listing all the operators, as opposed to
the general regexp /(?<special>[^\w\s"'`]+)/, because the conservative approach
works better in certain cases such as "x>-5".  The downside is that you need
to list all the operators, so it's a bit fragile, especially since the exact
operators vary by SQL version (MySQL, TransactSQL, Postgres, etc.).

I kept all the symbols previously specified as "special" even though some
of them aren't operators, specifically: , ; : .

Fixes #150
@scriptcoded
Copy link
Owner

The fix for this in #193 has been merged to the beta branch and will be released very soon. I'll close this a bit prematurely, but feel free to reopen if you feel it's not fixed.

scriptcoded pushed a commit that referenced this issue Jul 2, 2024
This fixes the "special" segments to include some missing operators from
https://www.w3schools.com/sql/sql_operators.asp and other sites.

I took the conservative approach of listing all the operators, as opposed to
the general regexp /(?<special>[^\w\s"'`]+)/, because the conservative approach
works better in certain cases such as "x>-5".  The downside is that you need
to list all the operators, so it's a bit fragile, especially since the exact
operators vary by SQL version (MySQL, TransactSQL, Postgres, etc.).

I kept all the symbols previously specified as "special" even though some
of them aren't operators, specifically: , ; : .

Fixes #150
scriptsbot added a commit that referenced this issue Jul 2, 2024
# [5.0.0](v4.4.2...v5.0.0) (2024-07-02)

* chore!: add support for Node 22 ([9478bf1](9478bf1))

### Bug Fixes

* improve number detection ([02d459a](02d459a)), closes [#149](#149)
* improve operator detection ([183a4fb](183a4fb)), closes [#150](#150)
* typo in unknown segments ([70af287](70af287)), closes [#148](#148) [#178](#178) [#148](#148)

### Features

* add way to style identifiers ([25677d4](25677d4)), closes [#147](#147)
* release 5.1.0 ([cb0c0f1](cb0c0f1))

### BREAKING CHANGES

* The `default` segment has been split into `identifier` and `whitespace`
segments.  There's also a new `unknown` segment that will only show up for malformed
SQL such as an unclosed string.

However, the highlight() function works largely the same as before, both normal mode and HTML mode,
except for the bug fix to stop classifying identifiers as strings.  In other words, SQL like

select * from EMP where NAME="John Smith"

will get highlighted the same as before, i.e. no syntax highlighting for EMP or NAME.
* drop support for Node 14.
scriptsbot added a commit that referenced this issue Jul 2, 2024
# [5.0.0](v4.4.2...v5.0.0) (2024-07-02)

* chore!: add support for Node 22 ([9478bf1](9478bf1))

### Bug Fixes

* improve number detection ([02d459a](02d459a)), closes [#149](#149)
* improve operator detection ([183a4fb](183a4fb)), closes [#150](#150)
* typo in unknown segments ([70af287](70af287)), closes [#148](#148) [#178](#178) [#148](#148)

### Features

* add way to style identifiers ([25677d4](25677d4)), closes [#147](#147)

### BREAKING CHANGES

* The `default` segment has been split into `identifier` and `whitespace`
segments.  There's also a new `unknown` segment that will only show up for malformed
SQL such as an unclosed string.

However, the highlight() function works largely the same as before, both normal mode and HTML mode,
except for the bug fix to stop classifying identifiers as strings.  In other words, SQL like

select * from EMP where NAME="John Smith"

will get highlighted the same as before, i.e. no syntax highlighting for EMP or NAME.
* drop support for Node 14.
scriptsbot added a commit that referenced this issue Jul 2, 2024
# [6.0.0](v5.0.0...v6.0.0) (2024-07-02)

### Bug Fixes

* improve number detection ([02d459a](02d459a)), closes [#149](#149)
* improve operator detection ([183a4fb](183a4fb)), closes [#150](#150)
* typo in unknown segments ([70af287](70af287)), closes [#148](#148) [#178](#178) [#148](#148)

### Features

* add way to style identifiers ([25677d4](25677d4)), closes [#147](#147)
* release 5.1.0 ([3a58def](3a58def))

### BREAKING CHANGES

* The `default` segment has been split into `identifier` and `whitespace`
segments.  There's also a new `unknown` segment that will only show up for malformed
SQL such as an unclosed string.

However, the highlight() function works largely the same as before, both normal mode and HTML mode,
except for the bug fix to stop classifying identifiers as strings.  In other words, SQL like

select * from EMP where NAME="John Smith"

will get highlighted the same as before, i.e. no syntax highlighting for EMP or NAME.
@scriptsbot
Copy link
Collaborator

🎉 This issue has been resolved in version 6.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-stale Prevent making as stale released
Projects
None yet
Development

No branches or pull requests

5 participants