Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get accented chars to group with their unaccented versions? #21

Closed
tallforasmurf opened this issue Mar 20, 2015 · 7 comments
Closed
Labels

Comments

@tallforasmurf
Copy link

I am writing a PyQt app and am unhappy with the performance of their table sorting. However it does do "locale-aware" sorting in what I believe to be the correct way. Given this word list:

words = ['apple', 'åpple', 'Apple', 'Äpple', 'Epple', 'Èpple', 'épple', 'epple']

and not ignoring case, Qt sorts in the order: apple, Apple, åpple, Äpple, epple, Epple, épple, Èpple

That is, all forms of A are grouped, then all forms of E. When I do the same sort in native Python using natsort:

if locale.setlocale(locale.LC_ALL) == 'C' :
    locale.setlocale(locale.LC_ALL,'en_US.UTF-8')
key_func_L = natsort.natsort_keygen( alg = natsort.ns.LOCALE )
print( sorted( words, key=key_func_L )  )

The resulting order is 'Apple', 'Epple', 'apple', 'epple', 'Äpple', 'Èpple', 'åpple', 'épple'

That is, all accented forms sort higher than un-accented forms. I am not so concerned that in the one, lowercase is first and the other, uppercase is first. I am concerned that in a long table, words starting with é may be hundreds of rows removed from words starting with e.

I am working in Python 3.4, PyQt5.4, Mac OS 10.10. Changing the locale to fr_FR and de_DE didn't make any difference.

@tallforasmurf
Copy link
Author

Note if I use alg = (natsort.ns.LOCALE | natsort.ns.GROUPLETTERS ) then uppercase groups with lowercase, but the accented versions still sort last, that is, 'a' is far away from 'å'.

@SethMMorton
Copy link
Owner

Do you have PyICU installed? I have found that python's built-in locale library (which does the work of understanding local-dependent sorting) does not work properly on some systems (specifically on Mac, which is what you are on). If you have PyICU installed, natsort will use that under the hood, and it gives better results. Can you try that?

@tallforasmurf
Copy link
Author

I saw the note about PyICU in the docs, and specifically recommended for OSX. Before I install that rather large package, (a) what sequence would you expect the above code to print, if everything is working as you expect it (e.g. on your own test system)? and 2, would you expect changing locale from en_US to fr_FR or de_DE to make a difference?

@SethMMorton
Copy link
Owner

a. I would expect the sequence that Qt printed to be the correct sequence.
b. In my tests it makes no difference which locale was used.

I can confirm that using Mac OS X's locale library (python uses's the system's C locale library), I get the (incorrect) results that you see. Below is the test file I used.

# -*- coding: utf-8 -*-
from __future__ import print_function, unicode_literals
import locale
from natsort import natsort_keygen, ns

words = ['apple', 'åpple', 'Apple', 'Äpple', 'Epple', 'Èpple', 'épple', 'epple']
locale.setlocale(locale.LC_ALL, str('de_DE.UTF-8'))
key_func_L = natsort_keygen(alg=ns.LOCALE)
print(' '.join(sorted(words, key=key_func_L)))

When I disabled PyICU, I get:

Apple Epple apple epple Äpple Èpple åpple épple

When I turn on PyICU, I get:

apple Apple åpple Äpple epple Epple épple Èpple

This is identical to what Qt is reporting.

@SethMMorton
Copy link
Owner

Unfortunately, this is not something I can fix... it is a bug in the BSD locale implementation. There is a recent Python bug report on this... check it out: http://bugs.python.org/issue23195 (also check this out: http://stackoverflow.com/questions/3412933/python-not-sorting-unicode-properly-strcoll-doesnt-help). I'll definitely keep an eye on the bug report, but notice one of the solutions suggested is to install PyICU. Incidentally, it seems like the only affected locales are en_US, fr_FR and de_DE, which are the three you tried.

I'll make sure to update the docs in the next release to indicate that PyICU should only be needed on Mac OS X and BSD.

@SethMMorton
Copy link
Owner

BTW, if you use HomeBrew (and I recommend it!), you can easily install ICU and PyICU with the following commands:

brew install icu4c
CFLAGS=-I/usr/local/opt/icu4c/include LDFLAGS=-L/usr/local/opt/icu4c/lib pip install pyicu

HomeBrew does not link icu4c to the system to avoid conflicts, so you need to tell python where to find it when installing PyICU.

@tallforasmurf
Copy link
Author

Yes, good. I had to add exports, pip didn't pick up the flags otherwise. Putting this in for reference for anybody else:

brew install icu4c
CFLAGS=-I/usr/local/opt/icu4c/include
export CFLAGS
LDFLAGS=-L/usr/local/opt/icu4c/lib pip install pyicu
export LDFLAGS
pip install pyuic

After which natsort did behave as you say.
Thank you for your prompt & detailed help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants