UTF-8 support #151

mattwynne · 2013-05-15T20:35:47Z

Migrating over some of Cucumber's old features into Aruba, I've noticed a problem where we compare strings with UTF-8 characters in them.

This feature reproduces the problem (on my machine at least).

mattwynne · 2013-05-15T20:45:33Z

It's bizarre: these tests are passing on travis but failing on my machine. Here's what I get:

➤ cucumber features/utf-8.feature
Using the default profile...
Feature: UTF-8

  Scenario: Write then assert on the content of a file with UTF-8 characters in # features/utf-8.feature:3
    Given a file named "turning-japanese" with:                                 # lib/aruba/cucumber.rb:19
      """
      フィーチャ

      """
    And I run `cat turning-japanese`                                            # lib/aruba/cucumber.rb:60
    Then the output should contain exactly:                                     # lib/aruba/cucumber.rb:129
      """
      フィーチャ

      """
      incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
      ./lib/aruba/api.rb:186:in `assert_exact_output'
      ./lib/aruba/cucumber.rb:130:in `/^the output(?: from "(.*?)")? should contain exactly:$/'
      features/utf-8.feature:10:in `Then the output should contain exactly:'

Failing Scenarios:
cucumber features/utf-8.feature:3 # Scenario: Write then assert on the content of a file with UTF-8 characters in

1 scenario (1 failed)
3 steps (1 failed, 2 passed)
0m0.128s

plexus · 2013-05-15T21:04:15Z

what does this say?

ruby -e 'p Encoding.default_internal ; p Encoding.default_external'
echo $LANG

chrismdp · 2013-05-15T21:14:09Z

Does # encoding: utf-8 at the top of the feature make a difference?

plexus · 2013-05-15T21:14:19Z

I'm getting an error as well when running like this

LANG=C cucumber features/utf-8.feature

but it goes green when invoked this way

LANG=en_US.UTF-8 cucumber features/utf-8.feature

JonRowe · 2013-05-15T23:51:35Z

@mattwynne +1 for @chrismdp's comment, I believe the default file encoding is different on linux than OS X.

mattwynne · 2013-05-16T06:55:57Z

On 15 May 2013, at 22:14, Chris Parsons notifications@github.com wrote:

Does # encoding: utf-8 at the top of the feature make a difference

@chrismdp thanks for the suggestion but no, that doesn't make any difference. I'd be surprised if it did, really, because that header's only for Gherkin to use, and we're not invoking Cucumber at all here, just reading and writing files using Aruba and @jarib's childprocess.

jarl-dk · 2013-05-16T06:56:19Z

# encoding: utf-8 does not change a thing (It's not a ruby file, so it's not read by ruby interpreter)

mattwynne · 2013-05-16T06:58:10Z

On 15 May 2013, at 22:04, Arne Brasseur notifications@github.com wrote:

what does this say?

ruby -e 'p Encoding.default_internal ; p Encoding.default_external'
echo $LANG
ruby -e 'p Encoding.default_internal ; p Encoding.default_external'
nil
#<Encoding:ASCII-8BIT>
~/projects/aruba(support-utf-8)
➤ echo $LANG
en_US.UTF-8

I'm using OS X 10.8.3 and RVM, in case that's relevant.

mattwynne · 2013-05-16T06:59:04Z

On 15 May 2013, at 22:14, Arne Brasseur notifications@github.com wrote:

I'm getting an error as well when running like this

LANG=C cucumber features/utf-8.feature
but it goes green when invoked this way

LANG=en_US.UTF-8 cucumber features/utf-8.feature
Hmmm, those both fail for me, but it does feel like we're on to something. I'm using Ruby 2.0.0, and OS X. What about you?

jarl-dk · 2013-05-16T07:00:32Z

I have added another UTF-16 test. I suppose the goal is to make them work oblivious of the LANG setting, right?

jarl-dk · 2013-05-16T07:05:44Z

To be honest. I think it is just as much an issue in cucumber as it is in aruba (at least for the UTF-16 case). Is there any requirement (from cucumber usage) that feature files must be UTF-8 encoded?

mattwynne · 2013-05-16T07:20:11Z

On 16 May 2013, at 08:00, Jarl Friis notifications@github.com wrote:

I have added another UTF-16 test. I suppose the goal is to make them work oblivious of the LANG setting, right?

Yes, ideally. Or at least raise an error that gives people some guidance about what they need to do to get consistent behaviour from their tests across machines. Setting LANG isn't working for me.

mattwynne · 2013-05-16T07:23:24Z

On 16 May 2013, at 08:05, Jarl Friis notifications@github.com wrote:

To be honest. I think it is just as much an issue in cucumber as it is in aruba (at least for the UTF-16 case). Is there any requirement (from cucumber usage) that feature files must be UTF-8 encoded?

This issue has arisen because Cucumber's pre-aruba tests were forcing the encoding[1] of STDOUT to UTF-8. When we migrate the same test to use Aruba, it fails (on some machines).

Why do you say it's an issue in Cucumber?

[1] https://github.com/cucumber/cucumber/blob/master/legacy_features/support/env.rb#L104

jarl-dk · 2013-05-16T07:25:45Z

@mattwynne : Based on the stack trace if you run LANG=C bundle exec cucumber features/utf-16.feature. So it was just a guess.

jarl-dk · 2013-05-16T07:30:07Z

I have just pushed a patch (beb5fe9) on this branch. I am not sure this is the right solution, but it gives a clue of the utf-8 problem at least. However it doesn't seem to solve the UTF-16 problem.

The patch will make LANG=C bundle exec cucumber features/utf-8.feature succeed.

mattwynne · 2013-05-16T07:34:41Z

On 16 May 2013, at 08:30, Jarl Friis notifications@github.com wrote:

I have just pushed a patch (beb5fe9) on this branch. I am not sure this is the right solution, but it gives a clue of the utf-8 problem at least. However it doesn't seem to solve the UTF-16 problem.

The patch will make LANG=C bundle exec cucumber features/utf-8.feature succeed.

Works on my machine!

Ace. Can we ship it?

jarl-dk · 2013-05-16T08:00:29Z

Hang on regarding shipping...

plexus · 2013-05-16T08:21:16Z

    ➤ ruby -e 'p Encoding.default_internal ; p Encoding.default_external'
    nil
    #<Encoding:ASCII-8BIT>
    ~/projects/aruba(support-utf-8)
    ➤ echo $LANG
    en_US.UTF-8

That Encoding:ASCII-8BIT is the culprit, because of that Ruby will read
file input as ASCII-8BIT unless otherwise specified, for instance with File.open("test.txt", "r:UTF-8"). I would have thought that it uses
LANG to determine the default though, this must have changed with Ruby 2.0.
You can also set it when invoking Ruby with --encoding.

The real solution would be for Cucumber to support a magic comment like # encoding: UTF-8. I've seen some tickets that hint at that but if it's
there it isn't working. The sensible thing would be to use UTF-8 when no
encoding is specified.

mattwynne · 2013-05-16T09:00:49Z

@arnebrasseur the behaviour of my machine seems inconsistent with the Ruby docs:

http://ruby-doc.org/core-2.0/Encoding.html#label-External+encoding

➤ rvm use 2.0.0
Using /Users/matt/.rvm/gems/ruby-2.0.0-p0
~/projects/cucumber(master)
➤ irb
>> ENV['LANG']
=> "UTF-8"
>> Encoding.default_external
=> #<Encoding:ASCII-8BIT>

WTF. Maybe my Ruby installation is just broken. It seems like Ruby is ignoring the value of the LANG env var.

mattwynne · 2013-05-16T09:01:17Z

I get the same behaviour from 1.9.3 too.

mattwynne · 2013-05-16T09:07:23Z

@arnebrasseur Cucumber does have that magic comment, but I don't think Aruba pays attention to it. I think it's simply there to ensure the Gherkin parser uses the correct encoding when parsing the feature file.

So are you suggesting that, instead of @jarl-dk's fix, we should force the encoding of STDOUT/STDERR to use the encoding specified in the feature?

jarl-dk · 2013-05-16T09:24:34Z

@arnebrasseur : your comments are good. I guess that what you are commenting on is that the feature file is always read as UTF-8, and that is why the UTF-16 case is a cucumber issue, am I right?

However my fix is forcing the output from a command into the same encoding as the expected (which, due to cucumber, is always UTF-8). The output from a command will have the encoding corresponding to the LANG setting.

I don't mind shipping my patch, but I will consider it a "workaround" that only solves the problem for feature files encoded in UTF-8, and therefore I suggest that the test for UTF-16 is taken out (for now) and maybe should be taken to the cucumber project...

@arnebrasseur : I think your proposal to cucumber to honor magic comments are not too bad, already cucumber honors magic comments like # language: da (for features in danish or other non-english language)

jarl-dk · 2013-05-16T09:26:00Z

@mattwynne : what if you set LANG to en_US.UTF-8 and not only UTF-8?

mattwynne · 2013-05-16T09:28:29Z

On 16 May 2013, at 10:26, Jarl Friis notifications@github.com wrote:

@mattwynne : what if you set LANG to en_US.UTF-8 and not only UTF-8?\

It makes no difference. I just tried it on a chruby installed Ruby on the same machine and it worked fine, so I think this is just a problem with my crufty old RVM Ruby installation.

Nevertheless, it could happen if someone's LANG env variable is set wrong, am I correct? Could we reproduce it for the UTF-8 case on Travis by setting that variable in the cat command? I'll try it.

plexus · 2013-05-16T09:32:17Z

I'm not actually very familiar with Cucumber or Aruba, I just got intrigued by your tweet since I've spent quite some time digging into encoding issues. So for a specific solution you guys can use your best judgement.

In general though it's a good idea to use UTF-8 unless there's explicit reason to do otherwise. Especially when the default encoding is ASCII-8BIT, since it's a subset of UTF-8 but still ruby refuses to convert one to the other when needed, so utf8_string + ascii_string blows up, same for gsub, string interpolation, etc.

This is a deliberate choice of the Ruby devs, even when there's a lossless conversion to a compatible encoding, Ruby doesn't do that for you. This seems to be mostly due to the Japanese being sceptical of Unicode, because of something the Unicode consortium did called Han unification. But for the rest of the world it's quite annoying.

jarl-dk · 2013-05-16T09:32:24Z

@mattwynne : Are you sure cucumber honors magic comment like #encoding: utf-16? If so I guess it is broken.

If I add that to the UTF-16 test, the stack trace indicates that cucumber pukes. Seems like aruba step definitions are never reached.

mattwynne · 2013-05-16T09:36:42Z

On 16 May 2013, at 10:32, Jarl Friis notifications@github.com wrote:

@mattwynne : Are you sure cucumber honors magic comment like #encoding: utf-16? If so I guess it is broken.

I think the gherkin parser honours them, but not cucumber. I'm not really sure though, I'd have to dig through the tests.
If I add that to the UTF-16 test, the stack trace indicates that cucumber pukes. Seems like aruba step definitions are never reached.

Yeah. Let's concentrate on the UTF-8 thing for now. I think your fix is good enough.

jarl-dk · 2013-05-16T09:43:14Z

I agree, lets focus on UTF-8 for this issue and create another for UTF-16 (which may depend on cucumber issue).
For the UTF-8 case,

jarl-dk · 2013-05-16T09:46:23Z

Shall we squash the addition and removal of utf-16 test when merging to master so they cancel out? I vote for that.

mattwynne · 2013-05-16T09:48:38Z

Yep, sounds good to me.

On 16 May 2013, at 10:46, Jarl Friis notifications@github.com wrote:

Shall we squash the addition and removal of utf-16 test when merging to master so they cancel out? I vote for that.

—
Reply to this email directly or view it on GitHub.

jarl-dk · 2013-05-16T09:50:57Z

Note, my fix works only for ruby-1.9 :-(

mattwynne · 2013-05-16T09:55:22Z

Do we care about 1.8.7 anymore?

On 16 May 2013, at 10:50, Jarl Friis notifications@github.com wrote:

Note, my fix works only for ruby-1.9 :-(

—
Reply to this email directly or view it on GitHub.

jarl-dk · 2013-05-16T10:50:03Z

So removing the UTF-16 commits #152 is good to go, right?

mattwynne · 2013-05-16T10:56:15Z

Thanks @jarl-dk!

@chrismdp

This reverts commit 5d315b6. @chrismdp there's something funny going on with Aruba and UTF-8. I presume these features are passing on your machine, and I can see they're passing on Travis, but they fail on my machine. See cucumber/aruba#151 for details. I'm reverting this commit while we sort this out.

Add a failing feature for UTF-8 support

c164a46

Added test for a second non-ASCII encoding

1b30648

Initial patch for UTF-8 test

beb5fe9

More encoding conversion

3768072

Removing UTF-16 test as it is unrelated to utf-8 problem

56c9e2e

Taking care of ruby 1.8 where everything is binary

5431bf1

jarl-dk mentioned this pull request May 16, 2013

Support for utf 8 #152

Merged

mattwynne closed this May 16, 2013

mattwynne deleted the support-utf-8 branch May 16, 2013 10:56

chrismdp mentioned this pull request May 16, 2013

Moved language help to new features (another attempt) cucumber/cucumber-ruby#469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 support #151

UTF-8 support #151

mattwynne commented May 15, 2013

mattwynne commented May 15, 2013

plexus commented May 15, 2013

chrismdp commented May 15, 2013

plexus commented May 15, 2013

JonRowe commented May 15, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

plexus commented May 16, 2013

mattwynne commented May 16, 2013

mattwynne commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

plexus commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

UTF-8 support #151

UTF-8 support #151

Conversation

mattwynne commented May 15, 2013

mattwynne commented May 15, 2013

plexus commented May 15, 2013

chrismdp commented May 15, 2013

plexus commented May 15, 2013

JonRowe commented May 15, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

plexus commented May 16, 2013

mattwynne commented May 16, 2013

mattwynne commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

plexus commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013

jarl-dk commented May 16, 2013

mattwynne commented May 16, 2013