Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 support #151

Closed
wants to merge 6 commits into from
Closed

UTF-8 support #151

wants to merge 6 commits into from

Conversation

mattwynne
Copy link
Member

Migrating over some of Cucumber's old features into Aruba, I've noticed a problem where we compare strings with UTF-8 characters in them.

This feature reproduces the problem (on my machine at least).

@mattwynne
Copy link
Member Author

It's bizarre: these tests are passing on travis but failing on my machine. Here's what I get:

➤ cucumber features/utf-8.feature
Using the default profile...
Feature: UTF-8

  Scenario: Write then assert on the content of a file with UTF-8 characters in # features/utf-8.feature:3
    Given a file named "turning-japanese" with:                                 # lib/aruba/cucumber.rb:19
      """
      フィーチャ

      """
    And I run `cat turning-japanese`                                            # lib/aruba/cucumber.rb:60
    Then the output should contain exactly:                                     # lib/aruba/cucumber.rb:129
      """
      フィーチャ

      """
      incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
      ./lib/aruba/api.rb:186:in `assert_exact_output'
      ./lib/aruba/cucumber.rb:130:in `/^the output(?: from "(.*?)")? should contain exactly:$/'
      features/utf-8.feature:10:in `Then the output should contain exactly:'

Failing Scenarios:
cucumber features/utf-8.feature:3 # Scenario: Write then assert on the content of a file with UTF-8 characters in

1 scenario (1 failed)
3 steps (1 failed, 2 passed)
0m0.128s

@plexus
Copy link

plexus commented May 15, 2013

what does this say?

ruby -e 'p Encoding.default_internal ; p Encoding.default_external'
echo $LANG

@chrismdp
Copy link

Does # encoding: utf-8 at the top of the feature make a difference?

@plexus
Copy link

plexus commented May 15, 2013

I'm getting an error as well when running like this

LANG=C cucumber features/utf-8.feature

but it goes green when invoked this way

LANG=en_US.UTF-8 cucumber features/utf-8.feature

@JonRowe
Copy link
Contributor

JonRowe commented May 15, 2013

@mattwynne +1 for @chrismdp's comment, I believe the default file encoding is different on linux than OS X.

@mattwynne
Copy link
Member Author

On 15 May 2013, at 22:14, Chris Parsons notifications@github.com wrote:

Does # encoding: utf-8 at the top of the feature make a difference

@chrismdp thanks for the suggestion but no, that doesn't make any difference. I'd be surprised if it did, really, because that header's only for Gherkin to use, and we're not invoking Cucumber at all here, just reading and writing files using Aruba and @jarib's childprocess.

@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

# encoding: utf-8 does not change a thing (It's not a ruby file, so it's not read by ruby interpreter)

@mattwynne
Copy link
Member Author

On 15 May 2013, at 22:04, Arne Brasseur notifications@github.com wrote:

what does this say?

ruby -e 'p Encoding.default_internal ; p Encoding.default_external'
echo $LANG
ruby -e 'p Encoding.default_internal ; p Encoding.default_external'
nil
#<Encoding:ASCII-8BIT>
~/projects/aruba(support-utf-8)
➤ echo $LANG
en_US.UTF-8

I'm using OS X 10.8.3 and RVM, in case that's relevant.

@mattwynne
Copy link
Member Author

On 15 May 2013, at 22:14, Arne Brasseur notifications@github.com wrote:

I'm getting an error as well when running like this

LANG=C cucumber features/utf-8.feature
but it goes green when invoked this way

LANG=en_US.UTF-8 cucumber features/utf-8.feature
Hmmm, those both fail for me, but it does feel like we're on to something. I'm using Ruby 2.0.0, and OS X. What about you?

@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

I have added another UTF-16 test. I suppose the goal is to make them work oblivious of the LANG setting, right?

@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

To be honest. I think it is just as much an issue in cucumber as it is in aruba (at least for the UTF-16 case). Is there any requirement (from cucumber usage) that feature files must be UTF-8 encoded?

@mattwynne
Copy link
Member Author

On 16 May 2013, at 08:00, Jarl Friis notifications@github.com wrote:

I have added another UTF-16 test. I suppose the goal is to make them work oblivious of the LANG setting, right?

Yes, ideally. Or at least raise an error that gives people some guidance about what they need to do to get consistent behaviour from their tests across machines. Setting LANG isn't working for me.

@mattwynne
Copy link
Member Author

On 16 May 2013, at 08:05, Jarl Friis notifications@github.com wrote:

To be honest. I think it is just as much an issue in cucumber as it is in aruba (at least for the UTF-16 case). Is there any requirement (from cucumber usage) that feature files must be UTF-8 encoded?

This issue has arisen because Cucumber's pre-aruba tests were forcing the encoding[1] of STDOUT to UTF-8. When we migrate the same test to use Aruba, it fails (on some machines).

Why do you say it's an issue in Cucumber?

[1] https://github.com/cucumber/cucumber/blob/master/legacy_features/support/env.rb#L104

@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

@mattwynne : Based on the stack trace if you run LANG=C bundle exec cucumber features/utf-16.feature. So it was just a guess.

@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

I have just pushed a patch (beb5fe9) on this branch. I am not sure this is the right solution, but it gives a clue of the utf-8 problem at least. However it doesn't seem to solve the UTF-16 problem.

The patch will make LANG=C bundle exec cucumber features/utf-8.feature succeed.

@mattwynne
Copy link
Member Author

On 16 May 2013, at 08:30, Jarl Friis notifications@github.com wrote:

I have just pushed a patch (beb5fe9) on this branch. I am not sure this is the right solution, but it gives a clue of the utf-8 problem at least. However it doesn't seem to solve the UTF-16 problem.

The patch will make LANG=C bundle exec cucumber features/utf-8.feature succeed.

Works on my machine!

Ace. Can we ship it?

@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

Hang on regarding shipping...

@plexus
Copy link

plexus commented May 16, 2013

    ➤ ruby -e 'p Encoding.default_internal ; p Encoding.default_external'
    nil
    #<Encoding:ASCII-8BIT>
    ~/projects/aruba(support-utf-8)
    ➤ echo $LANG
    en_US.UTF-8

That Encoding:ASCII-8BIT is the culprit, because of that Ruby will read
file input as ASCII-8BIT unless otherwise specified, for instance with File.open("test.txt", "r:UTF-8"). I would have thought that it uses
LANG to determine the default though, this must have changed with Ruby 2.0.
You can also set it when invoking Ruby with --encoding.

The real solution would be for Cucumber to support a magic comment like # encoding: UTF-8. I've seen some tickets that hint at that but if it's
there it isn't working. The sensible thing would be to use UTF-8 when no
encoding is specified.

@mattwynne
Copy link
Member Author

@arnebrasseur the behaviour of my machine seems inconsistent with the Ruby docs:

http://ruby-doc.org/core-2.0/Encoding.html#label-External+encoding

➤ rvm use 2.0.0
Using /Users/matt/.rvm/gems/ruby-2.0.0-p0
~/projects/cucumber(master)
➤ irb
>> ENV['LANG']
=> "UTF-8"
>> Encoding.default_external
=> #<Encoding:ASCII-8BIT>

WTF. Maybe my Ruby installation is just broken. It seems like Ruby is ignoring the value of the LANG env var.

@mattwynne
Copy link
Member Author

I get the same behaviour from 1.9.3 too.

@mattwynne
Copy link
Member Author

@arnebrasseur Cucumber does have that magic comment, but I don't think Aruba pays attention to it. I think it's simply there to ensure the Gherkin parser uses the correct encoding when parsing the feature file.

So are you suggesting that, instead of @jarl-dk's fix, we should force the encoding of STDOUT/STDERR to use the encoding specified in the feature?

@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

@arnebrasseur : your comments are good. I guess that what you are commenting on is that the feature file is always read as UTF-8, and that is why the UTF-16 case is a cucumber issue, am I right?

However my fix is forcing the output from a command into the same encoding as the expected (which, due to cucumber, is always UTF-8). The output from a command will have the encoding corresponding to the LANG setting.

I don't mind shipping my patch, but I will consider it a "workaround" that only solves the problem for feature files encoded in UTF-8, and therefore I suggest that the test for UTF-16 is taken out (for now) and maybe should be taken to the cucumber project...

@arnebrasseur : I think your proposal to cucumber to honor magic comments are not too bad, already cucumber honors magic comments like # language: da (for features in danish or other non-english language)

@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

@mattwynne : what if you set LANG to en_US.UTF-8 and not only UTF-8?

@mattwynne
Copy link
Member Author

On 16 May 2013, at 10:26, Jarl Friis notifications@github.com wrote:

@mattwynne : what if you set LANG to en_US.UTF-8 and not only UTF-8?\

It makes no difference. I just tried it on a chruby installed Ruby on the same machine and it worked fine, so I think this is just a problem with my crufty old RVM Ruby installation.

Nevertheless, it could happen if someone's LANG env variable is set wrong, am I correct? Could we reproduce it for the UTF-8 case on Travis by setting that variable in the cat command? I'll try it.

@plexus
Copy link

plexus commented May 16, 2013

I'm not actually very familiar with Cucumber or Aruba, I just got intrigued by your tweet since I've spent quite some time digging into encoding issues. So for a specific solution you guys can use your best judgement.

In general though it's a good idea to use UTF-8 unless there's explicit reason to do otherwise. Especially when the default encoding is ASCII-8BIT, since it's a subset of UTF-8 but still ruby refuses to convert one to the other when needed, so utf8_string + ascii_string blows up, same for gsub, string interpolation, etc.

This is a deliberate choice of the Ruby devs, even when there's a lossless conversion to a compatible encoding, Ruby doesn't do that for you. This seems to be mostly due to the Japanese being sceptical of Unicode, because of something the Unicode consortium did called Han unification. But for the rest of the world it's quite annoying.

@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

@mattwynne : Are you sure cucumber honors magic comment like #encoding: utf-16? If so I guess it is broken.

If I add that to the UTF-16 test, the stack trace indicates that cucumber pukes. Seems like aruba step definitions are never reached.

@mattwynne
Copy link
Member Author

On 16 May 2013, at 10:32, Jarl Friis notifications@github.com wrote:

@mattwynne : Are you sure cucumber honors magic comment like #encoding: utf-16? If so I guess it is broken.

I think the gherkin parser honours them, but not cucumber. I'm not really sure though, I'd have to dig through the tests.
If I add that to the UTF-16 test, the stack trace indicates that cucumber pukes. Seems like aruba step definitions are never reached.

Yeah. Let's concentrate on the UTF-8 thing for now. I think your fix is good enough.

@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

I agree, lets focus on UTF-8 for this issue and create another for UTF-16 (which may depend on cucumber issue).
For the UTF-8 case,

@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

Shall we squash the addition and removal of utf-16 test when merging to master so they cancel out? I vote for that.

@mattwynne
Copy link
Member Author

Yep, sounds good to me.

On 16 May 2013, at 10:46, Jarl Friis notifications@github.com wrote:

Shall we squash the addition and removal of utf-16 test when merging to master so they cancel out? I vote for that.


Reply to this email directly or view it on GitHub.

@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

Note, my fix works only for ruby-1.9 :-(

@mattwynne
Copy link
Member Author

Do we care about 1.8.7 anymore?

On 16 May 2013, at 10:50, Jarl Friis notifications@github.com wrote:

Note, my fix works only for ruby-1.9 :-(


Reply to this email directly or view it on GitHub.

@jarl-dk jarl-dk mentioned this pull request May 16, 2013
@jarl-dk
Copy link
Member

jarl-dk commented May 16, 2013

So removing the UTF-16 commits #152 is good to go, right?

@mattwynne mattwynne closed this May 16, 2013
@mattwynne mattwynne deleted the support-utf-8 branch May 16, 2013 10:56
@mattwynne
Copy link
Member Author

Thanks @jarl-dk!

chrismdp pushed a commit to cucumber/cucumber-ruby that referenced this pull request May 16, 2013
This reverts commit 5d315b6.

@chrismdp there's something funny going on with Aruba and UTF-8. I
presume these features are passing on your machine, and I can see
they're passing on Travis, but they fail on my machine.

See cucumber/aruba#151 for details.

I'm reverting this commit while we sort this out.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants