Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set collation of some string fields to 'C' #959

Merged
merged 2 commits into from
Jul 12, 2024

Conversation

psrok1
Copy link
Member

@psrok1 psrok1 commented Jul 12, 2024

Your checklist for this pull request

  • I've read the contributing guideline.
  • I've tested my changes by building and running the project, and testing changed functionality (if applicable)
  • I've added automated tests for my change (if applicable, optional)
  • I've updated documentation to reflect my change (if applicable)

What is the current behaviour?

I found in documentation that PostgreSQL B-tree index doesn't work by default for left-anchored LIKE queries for columns that doesn't have 'C' collation set.

The optimizer can also use a B-tree index for queries involving the pattern matching operators LIKE and ~ if the pattern is a constant and is anchored to the beginning of the string — for example, col LIKE 'foo%' or col ~ '^foo', but not col LIKE '%bar'. However, if your database does not use the C locale you will need to create the index with a special operator class to support indexing of pattern-matching queries; see [Section 11.10](https://www.postgresql.org/docs/current/indexes-opclass.html) below. It is also possible to use B-tree indexes for ILIKE and ~*, but only if the pattern starts with non-alphabetic characters, i.e., characters that are not affected by upper/lower case conversion.

Default collation for database is unset so it's got from operating system defaults (in case of our production we have en_US.UTF8). In the same time, we don't sort anything by these strings and it's not even needed to have language-specific order as string data stored by MWDB are not even considered as "natural language".

Collation "C" is also considered faster as The C and POSIX collations both specify “traditional C” behavior, in which only the ASCII letters “A” through “Z” are treated as letters, and sorting is done strictly by character code byte values.

What is the new behaviour?

This PR explicitly sets collation of some string columns to "C", especially these columns that may be searched via left-anchored queries like tag:"ripped:*"

Migration may take a while because B-Tree indexes may be rebuild to use a new ordering.

Co-authored-by: msm-cert <156842376+msm-cert@users.noreply.github.com>
@psrok1 psrok1 merged commit 416ab4b into master Jul 12, 2024
12 checks passed
@psrok1 psrok1 deleted the fix/set-collation-of-string-fields branch July 12, 2024 10:34
@psrok1 psrok1 mentioned this pull request Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants