Skip to content

groonga/groonga-normalizer-mysql

Repository files navigation

README

Name

groonga-normalizer-mysql

Description

Groonga-normalizer-mysql is a Groonga plugin. It provides MySQL compatible normalizers and a custom normalizers to Groonga.

Here are MySQL compatible normalizers:

  • NormalizerMySQLGeneralCI for utf8mb4_general_ci
  • NormalizerMySQLUnicodeCI for utf8mb4_unicode_ci
  • NormalizerMySQLUnicode520CI for utf8mb4_unicode_520_ci
  • NormalizerMySQLUnicode900 for utf8mb4_0900_ai_ci, utf8mb4_0900_as_ci, utf8mb4_0900_as_cs, utf8mb4_ja_0900_as_cs and utf8mb4_ja_0900_as_cs_ks.

Here are custom normalizers:

  • NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark
    • It's based on NormalizerMySQLUnicodeCI
  • NormalizerMySQLUnicode520CIExceptKanaCIKanaWithVoicedSoundMark
    • It's based on NormalizerMySQLUnicode520CI

They are self-descriptive name but long. They are variant normalizers of NormalizerMySQLUnicodeCI and NormalizerMySQLUnicode520CI. They have different behaviors. The followings are the different behaviors. They describes with NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark but they are true for NormalizerMySQLUnicode520CIExceptKanaCIKanaWithVoicedSoundMark.

  • NormalizerMySQLUnicodeCI normalizes all small Hiragana such as , to Hiragana such as , . NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark doesn't normalize to nor to . and are different characters. and are also different characters. This behavior is described by ExceptKanaCI in the long name. This following behaviors ared described by ExceptKanaWithVoicedSoundMark in the long name.
  • NormalizerMySQLUnicode normalizes all Hiragana with voiced sound mark such as to Hiragana without voiced sound mark such as . NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark doesn't normalize to . and are different characters.
  • NormalizerMySQLUnicode normalizes all Hiragana with semi-voiced sound mark such as to Hiragana without semi-voiced sound mark such as . NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark doesn't normalize to . and are different characters.
  • NormalizerMySQLUnicode normalizes all Katakana with voiced sound mark such as to Katakana without voiced sound mark such as . NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark doesn't normalize to . and are different characters.
  • NormalizerMySQLUnicode normalizes all Katakana with semi-voiced sound mark such as to Hiragana without semi-voiced sound mark such as . NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark doesn't normalize to . and are different characters.
  • NormalizerMySQLUnicode normalizes all halfwidth Katakana with voiced sound mark such as ガ to halfwidth Katakana without voiced sound mark such as . NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark normalizes all halfwidth Katakana with voided sound mark such as ガ to fullwidth Katakana with voiced sound mark such as .

NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark and NormalizerMySQLUnicode520CIExceptKanaCIKanaWithVoicedSoundMark are MySQL incompatible normalizers but they are useful for Japanese text. For example, ふらつく and ブラック has different means. NormalizerMySQLUnicodeCI identifies ふらつく with ブラック but NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark doesn't identify them.

Install

Debian GNU/Linux

Add apt-line for the Groonga deb package repository and install groonga-normalizer-mysql package:

% sudo apt-get -y install groonga-normalizer-mysql

Ubuntu

Add apt-line for the Groonga deb package repository and install groonga-normalizer-mysql package:

% sudo apt-get -y install groonga-normalizer-mysql

AlmaLinux 8

Install groonga-repository package:

% sudo dnf install -y https://packages.groonga.org/almalinux/8/groonga-release-latest.noarch.rpm

Then install groonga-normalizer-mysql package:

% sudo dnf install -y --enablerepo=epel groonga-normalizer-mysql

AlmaLinux 9

Install groonga-repository package:

% sudo dnf install -y https://packages.groonga.org/almalinux/9/groonga-release-latest.noarch.rpm

Then install groonga-normalizer-mysql package:

% sudo dnf install -y --enablerepo=epel groonga-normalizer-mysql

macOS - Homebrew

Install groonga package (which includes groonga-normalizer-mysql):

% brew install groonga

Windows

You need to build from source. Here are build instructions.

Build system

Install the following build tools:

Build Groonga

Download the latest Groonga source from packages.groonga.org. Source file name is formatted as groonga-X.Y.Z.zip.

Extract the source and move to the source folder:

> cd ...\groonga-X.Y.Z
groonga-X.Y.Z>

Run CMake. Here is a command line to install Groonga to C:\groonga folder:

groonga-X.Y.Z> cmake . -G "Visual Studio 14 Win64" -DCMAKE_INSTALL_PREFIX=C:\groonga

Build:

groonga-X.Y.Z> cmake --build . --config Release

Install:

groonga-X.Y.Z> cmake --build . --config Release --target Install

Build groonga-normalizer-mysql

Download the latest groonga-normalizer-mysql source from packages.groonga.org. Source file name is formatted as groonga-normalizer-X.Y.Z.zip.

Extract the source and move to the source folder:

> cd ...\groonga-normalizer-mysql-X.Y.Z
groonga-normalizer-mysql-X.Y.Z>

IMPORTANT!!!: Set PKG_CONFIG_PATH environment variable:

groonga-normalizer-mysql-X.Y.Z> set PKG_CONFIG_PATH=C:\groonga\local\lib\pkgconfig

Run CMake. Here is a command line to install Groonga to C:\groonga folder:

groonga-normalizer-mysql-X.Y.Z> cmake . -G "Visual Studio 14 Win64" -DCMAKE_INSTALL_PREFIX=C:\groonga

Build:

groonga-normalizer-mysql-X.Y.Z> cmake --build . --config Release

Install:

groonga-normalizer-mysql-X.Y.Z> cmake --build . --config Release --target Install

Usage

First, you need to register normalizers/mysql plugin:

groonga> register normalizers/mysql

Then, you can use NormalizerMySQLGeneralCI and NormalizerMySQLUnicodeCI as normalizers:

groonga> table_create Lexicon TABLE_PAT_KEY --default_tokenizer TokenBigram --normalizer NormalizerMySQLGeneralCI

Dependencies

  • Groonga >= 8.0.4

Mailing list

Thanks

  • Alexander Barkov <bar@udm.net>: The author of MYSQL_SOURCE/strings/ctype-utf8.c.
  • ...

Authors

License

LGPLv2 only. See doc/text/lgpl-2.0.txt for details.

This program uses normalization table defined in MySQL source code. So this program is derived work of MYSQL_SOURCE/strings/ctype-utf8.c, MYSQL_SOURCE/strings/uca900_data.h, MYSQL_SOURCE/strings/uca900_ja_data.h. This program is the same license as them and they are licensed under LGPLv2 only.