Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: Warn when reading BOM text with headers option #301

Closed
JunichiIto opened this issue May 4, 2024 · 16 comments
Closed

feature request: Warn when reading BOM text with headers option #301

JunichiIto opened this issue May 4, 2024 · 16 comments

Comments

@JunichiIto
Copy link
Contributor

When BOM exists in CSV text, you cannot read first column by field name:

require 'csv'

text = "\u{feff}id,name\n1,Alice"
csv = CSV.parse(text, headers: true)
csv[0]['id']   #=> nil
csv[0]['name'] #=> "Alice"

I understand there are some workarounds for this problem:

  • Remove BOM from text before parse
  • Include BOM to get value (eg. csv[0]["\u{feff}id"])
  • Specify encoding like bom|utf-8 for CSV.foreach, CSV.read, CSV.open etc.

But the biggest problem is it's too hard to notice the existence of BOM. Actually, when I came across this issue, I had no idea why I failed to get the id value. I guess most people never wants to keep BOM in field name.

So I'd be happy if CSV.parse or some other reading methods would warn when input text contains BOM and headers: true is specified, like this:

text = "\u{feff}id,name\n1,Alice"
csv = CSV.parse(text, headers: true)
#=> warning: first header name starts with U+FEFF (zero width no-break space, BOM)
csv = CSV.read('with-bom.csv', headers: true)
#=> warning: first header name starts with U+FEFF (zero width no-break space, BOM)

# warning is not required when encoding like "bom|utf-8" is specified
csv = CSV.read('with-bom.csv', headers: true, encoding: 'bom|utf-8')
#=> (no warnings)

By the way, why do we put BOM so often? This is because Excel cannot open UTF-8 CSV correctly without BOM!
Save CSV file in UTF-8 with BOM. Fix Korean text from being corrupted… | by Hyunbin | Medium

Related issue: #43

@kou
Copy link
Member

kou commented May 4, 2024

I can understand the motivation but I'm not sure that this is a proper approach.

How about enabling BOM detection in CSV.open by default instead?

@JunichiIto
Copy link
Contributor Author

How about enabling BOM detection in CSV.open by default instead?

Do you mean BOM is removed automatically without specifying encoding?

# like this
csv = CSV.open('with-bom.csv', headers: true)
csv[0]['id'] #=> 1

It would be much better than warning!

@kou
Copy link
Member

kou commented May 5, 2024

Yes.

How about this?

diff --git a/lib/csv.rb b/lib/csv.rb
index b016b8f..b5a4fe6 100644
--- a/lib/csv.rb
+++ b/lib/csv.rb
@@ -1581,7 +1581,14 @@ class CSV
     def open(filename, mode="r", **options)
       # wrap a File opened with the remaining +args+ with no newline
       # decorator
-      file_opts = options.dup
+      file_opts = {}
+      have_encoding_options = (options.key?(:encoding) or
+                               options.key?(:external_encoding) or
+                               mode.include?(":"))
+      if not have_encoding_options and Encoding.default_external == Encoding::UTF_8
+        file_opts[:encoding] = "bom|utf-8"
+      end
+      file_opts.merge!(options)
       unless file_opts.key?(:newline)
         file_opts[:universal_newline] ||= false
       end

@JunichiIto
Copy link
Contributor Author

Great! That works fine. Is it possible to apply to CSV.parse too?

require './lib/csv'

text = "\u{feff}id,name\n1,Alice"
File.write('tmp/with-bom.csv', text)

csv = CSV.open('tmp/with-bom.csv', headers: true).read
p csv[0]['id']
p csv[0]['name']

csv = CSV.read('tmp/with-bom.csv', headers: true)
p csv[0]['id']
p csv[0]['name']

CSV.foreach('tmp/with-bom.csv', headers: true) do |row|
  p csv[0]['id']
  p csv[0]['name']
end

csv = CSV.parse(File.read('tmp/with-bom.csv'), headers: true)
p csv[0]['id']
p csv[0]['name']
"1"
"Alice"
"1"
"Alice"
"1"
"Alice"
nil
"Alice"

@kou
Copy link
Member

kou commented May 8, 2024

Is it possible to apply to CSV.parse too?

No. It's an user's responsibility that BOM is handled before calling CSV.parse because it receives an opened IO or String.

Why do you want to use CSV.parse(File.read) not CSV.open (or something)?

@JunichiIto
Copy link
Contributor Author

I used CSV.parse for testabilty. For example, like this:

require 'csv'

class MyClass
  def do_something
    CSV.parse(my_csv_text, headers: true).map do |row|
      row['name']
    end
  end

  # extract to a method to make it easier to stub
  def my_csv_text
    File.read('my_file.csv', encoding: 'bom|utf-8')
  end
end

RSpec.describe MyClass do
  let(:my_csv_text) do
    <<~CSV
      name,age
      Alice,20
      Bob,30
    CSV
  end
  it 'does something' do
    my_class = MyClass.new
    allow(my_class).to receive(:my_csv_text).and_return(my_csv_text)
    expect(my_class.do_something).to eq(['Alice', 'Bob'])
  end
end

@kou
Copy link
Member

kou commented May 12, 2024

The example works well because it uses encoding: 'bom|utf-8' for File.read, right?

I this case, both of the current CSV.parse and your implementation don't have any problem, right?

@JunichiIto
Copy link
Contributor Author

Yes, but my first implementation was like this:

require 'csv'

class MyClass
  def do_something
    CSV.parse(my_csv_text, headers: true).map do |row|
      row['name']
    end
  end

  # extract to a method to make it easier to stub
  def my_csv_text
    # my first implementation (it didn't work)
    File.read('my_file.csv')
  end
end

RSpec.describe MyClass do
  let(:my_csv_text) do
    <<~CSV
      name,age
      Alice,20
      Bob,30
    CSV
  end
  it 'does something' do
    my_class = MyClass.new
    allow(my_class).to receive(:my_csv_text).and_return(my_csv_text)
    expect(my_class.do_something).to eq(['Alice', 'Bob'])
  end
end

So I was wondering why I couldn't get values by column name.

@kou
Copy link
Member

kou commented May 14, 2024

I think that it's your program's problem.
If your program opens a file, your program is responsible for processing BOM.

(In this case, I think that it's better that you don't use a stub for better testing.)

@JunichiIto
Copy link
Contributor Author

OK, I understand your idea.

But as a library user, it's hard to notice the behavior difference between parse and open. Developers might be confused when row['name'] will not return value only when using parse. So at least the specification should be well documented.

@kou
Copy link
Member

kou commented May 15, 2024

Do you have a suggested documentation change?

@JunichiIto
Copy link
Contributor Author

How about this?

Please make sure if your text contains BOM or not. CSV.parse will not remove BOM automatically. You might want to remove BOM before calling CSV.parse:

# remove BOM on calling File.open
csv_table = File.open(path, encoding: 'bom|utf-8') do |file|
  CSV.parse(file, headers: true)
end

@kou
Copy link
Member

kou commented May 16, 2024

It looks good to me. Could you open a PR that adds it to the CSV.parse document?

csv/lib/csv.rb

Lines 1620 to 1731 in 73b877d

#
# :call-seq:
# parse(string) -> array_of_arrays
# parse(io) -> array_of_arrays
# parse(string, headers: ..., **options) -> csv_table
# parse(io, headers: ..., **options) -> csv_table
# parse(string, **options) {|row| ... }
# parse(io, **options) {|row| ... }
#
# Parses +string+ or +io+ using the specified +options+.
#
# - Argument +string+ should be a \String object;
# it will be put into a new StringIO object positioned at the beginning.
# :include: ../doc/csv/arguments/io.rdoc
# - Argument +options+: see {Options for Parsing}[#class-CSV-label-Options+for+Parsing]
#
# ====== Without Option +headers+
#
# Without {option +headers+}[#class-CSV-label-Option+headers] case.
#
# These examples assume prior execution of:
# string = "foo,0\nbar,1\nbaz,2\n"
# path = 't.csv'
# File.write(path, string)
#
# ---
#
# With no block given, returns an \Array of Arrays formed from the source.
#
# Parse a \String:
# a_of_a = CSV.parse(string)
# a_of_a # => [["foo", "0"], ["bar", "1"], ["baz", "2"]]
#
# Parse an open \File:
# a_of_a = File.open(path) do |file|
# CSV.parse(file)
# end
# a_of_a # => [["foo", "0"], ["bar", "1"], ["baz", "2"]]
#
# ---
#
# With a block given, calls the block with each parsed row:
#
# Parse a \String:
# CSV.parse(string) {|row| p row }
#
# Output:
# ["foo", "0"]
# ["bar", "1"]
# ["baz", "2"]
#
# Parse an open \File:
# File.open(path) do |file|
# CSV.parse(file) {|row| p row }
# end
#
# Output:
# ["foo", "0"]
# ["bar", "1"]
# ["baz", "2"]
#
# ====== With Option +headers+
#
# With {option +headers+}[#class-CSV-label-Option+headers] case.
#
# These examples assume prior execution of:
# string = "Name,Count\nfoo,0\nbar,1\nbaz,2\n"
# path = 't.csv'
# File.write(path, string)
#
# ---
#
# With no block given, returns a CSV::Table object formed from the source.
#
# Parse a \String:
# csv_table = CSV.parse(string, headers: ['Name', 'Count'])
# csv_table # => #<CSV::Table mode:col_or_row row_count:5>
#
# Parse an open \File:
# csv_table = File.open(path) do |file|
# CSV.parse(file, headers: ['Name', 'Count'])
# end
# csv_table # => #<CSV::Table mode:col_or_row row_count:4>
#
# ---
#
# With a block given, calls the block with each parsed row,
# which has been formed into a CSV::Row object:
#
# Parse a \String:
# CSV.parse(string, headers: ['Name', 'Count']) {|row| p row }
#
# Output:
# # <CSV::Row "Name":"foo" "Count":"0">
# # <CSV::Row "Name":"bar" "Count":"1">
# # <CSV::Row "Name":"baz" "Count":"2">
#
# Parse an open \File:
# File.open(path) do |file|
# CSV.parse(file, headers: ['Name', 'Count']) {|row| p row }
# end
#
# Output:
# # <CSV::Row "Name":"foo" "Count":"0">
# # <CSV::Row "Name":"bar" "Count":"1">
# # <CSV::Row "Name":"baz" "Count":"2">
#
# ---
#
# Raises an exception if the argument is not a \String object or \IO object:
# # Raises NoMethodError (undefined method `close' for :foo:Symbol)
# CSV.parse(:foo)

@JunichiIto
Copy link
Contributor Author

Thank you for your review. I created PR here: #305

@kou
Copy link
Member

kou commented May 17, 2024

I've applied #301 (comment) .

@JunichiIto
Copy link
Contributor Author

Thank you! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants