How to get accented chars to group with their unaccented versions? #21

tallforasmurf · 2015-03-20T18:37:21Z

I am writing a PyQt app and am unhappy with the performance of their table sorting. However it does do "locale-aware" sorting in what I believe to be the correct way. Given this word list:

words = ['apple', 'åpple', 'Apple', 'Äpple', 'Epple', 'Èpple', 'épple', 'epple']

and not ignoring case, Qt sorts in the order: apple, Apple, åpple, Äpple, epple, Epple, épple, Èpple

That is, all forms of A are grouped, then all forms of E. When I do the same sort in native Python using natsort:

if locale.setlocale(locale.LC_ALL) == 'C' :
    locale.setlocale(locale.LC_ALL,'en_US.UTF-8')
key_func_L = natsort.natsort_keygen( alg = natsort.ns.LOCALE )
print( sorted( words, key=key_func_L )  )

The resulting order is 'Apple', 'Epple', 'apple', 'epple', 'Äpple', 'Èpple', 'åpple', 'épple'

That is, all accented forms sort higher than un-accented forms. I am not so concerned that in the one, lowercase is first and the other, uppercase is first. I am concerned that in a long table, words starting with é may be hundreds of rows removed from words starting with e.

I am working in Python 3.4, PyQt5.4, Mac OS 10.10. Changing the locale to fr_FR and de_DE didn't make any difference.

The text was updated successfully, but these errors were encountered:

tallforasmurf · 2015-03-20T18:41:28Z

Note if I use alg = (natsort.ns.LOCALE | natsort.ns.GROUPLETTERS ) then uppercase groups with lowercase, but the accented versions still sort last, that is, 'a' is far away from 'å'.

SethMMorton · 2015-03-20T18:44:43Z

Do you have PyICU installed? I have found that python's built-in locale library (which does the work of understanding local-dependent sorting) does not work properly on some systems (specifically on Mac, which is what you are on). If you have PyICU installed, natsort will use that under the hood, and it gives better results. Can you try that?

tallforasmurf · 2015-03-20T20:59:07Z

I saw the note about PyICU in the docs, and specifically recommended for OSX. Before I install that rather large package, (a) what sequence would you expect the above code to print, if everything is working as you expect it (e.g. on your own test system)? and 2, would you expect changing locale from en_US to fr_FR or de_DE to make a difference?

SethMMorton · 2015-03-21T01:30:42Z

a. I would expect the sequence that Qt printed to be the correct sequence.
b. In my tests it makes no difference which locale was used.

I can confirm that using Mac OS X's locale library (python uses's the system's C locale library), I get the (incorrect) results that you see. Below is the test file I used.

# -*- coding: utf-8 -*-
from __future__ import print_function, unicode_literals
import locale
from natsort import natsort_keygen, ns

words = ['apple', 'åpple', 'Apple', 'Äpple', 'Epple', 'Èpple', 'épple', 'epple']
locale.setlocale(locale.LC_ALL, str('de_DE.UTF-8'))
key_func_L = natsort_keygen(alg=ns.LOCALE)
print(' '.join(sorted(words, key=key_func_L)))

When I disabled PyICU, I get:

Apple Epple apple epple Äpple Èpple åpple épple

When I turn on PyICU, I get:

apple Apple åpple Äpple epple Epple épple Èpple

This is identical to what Qt is reporting.

SethMMorton · 2015-03-21T01:34:33Z

Unfortunately, this is not something I can fix... it is a bug in the BSD locale implementation. There is a recent Python bug report on this... check it out: http://bugs.python.org/issue23195 (also check this out: http://stackoverflow.com/questions/3412933/python-not-sorting-unicode-properly-strcoll-doesnt-help). I'll definitely keep an eye on the bug report, but notice one of the solutions suggested is to install PyICU. Incidentally, it seems like the only affected locales are en_US, fr_FR and de_DE, which are the three you tried.

I'll make sure to update the docs in the next release to indicate that PyICU should only be needed on Mac OS X and BSD.

SethMMorton · 2015-03-21T01:38:01Z

BTW, if you use HomeBrew (and I recommend it!), you can easily install ICU and PyICU with the following commands:

brew install icu4c
CFLAGS=-I/usr/local/opt/icu4c/include LDFLAGS=-L/usr/local/opt/icu4c/lib pip install pyicu

HomeBrew does not link icu4c to the system to avoid conflicts, so you need to tell python where to find it when installing PyICU.

tallforasmurf · 2015-03-22T18:37:59Z

Yes, good. I had to add exports, pip didn't pick up the flags otherwise. Putting this in for reference for anybody else:

brew install icu4c
CFLAGS=-I/usr/local/opt/icu4c/include
export CFLAGS
LDFLAGS=-L/usr/local/opt/icu4c/lib pip install pyicu
export LDFLAGS
pip install pyuic

After which natsort did behave as you say.
Thank you for your prompt & detailed help.

SethMMorton added the question label Mar 21, 2015

tallforasmurf closed this as completed Mar 22, 2015

SethMMorton mentioned this issue Mar 7, 2016

natsort with ns.LOCALE error: ValueError: character U+110000 is not in range [U+0000; U+10ffff] #34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get accented chars to group with their unaccented versions? #21

How to get accented chars to group with their unaccented versions? #21

tallforasmurf commented Mar 20, 2015

tallforasmurf commented Mar 20, 2015

SethMMorton commented Mar 20, 2015

tallforasmurf commented Mar 20, 2015

SethMMorton commented Mar 21, 2015

SethMMorton commented Mar 21, 2015

SethMMorton commented Mar 21, 2015

tallforasmurf commented Mar 22, 2015

How to get accented chars to group with their unaccented versions? #21

How to get accented chars to group with their unaccented versions? #21

Comments

tallforasmurf commented Mar 20, 2015

tallforasmurf commented Mar 20, 2015

SethMMorton commented Mar 20, 2015

tallforasmurf commented Mar 20, 2015

SethMMorton commented Mar 21, 2015

SethMMorton commented Mar 21, 2015

SethMMorton commented Mar 21, 2015

tallforasmurf commented Mar 22, 2015