Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Julia 0.6-1.0, drop support for 0.4-0.5 #24

Merged
merged 4 commits into from
Aug 20, 2018
Merged

Conversation

galenlynch
Copy link
Contributor

Sorry about the changes to white space, I didn't realize that some parts of the
code used tabs until it was too late.

@galenlynch
Copy link
Contributor Author

Failures on 1.0 are from Zip.jl

@fhs
Copy link
Owner

fhs commented Aug 18, 2018

Sorry about the tabs. I've changed them to spaces. Can you rebase? Thanks.

Sorry about the changes to white space, I didn't realize that some parts of the
code used tabs until it was too late.
@galenlynch
Copy link
Contributor Author

done

Copy link
Owner

@fhs fhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes!

src/NPZ.jl Outdated
@@ -58,36 +57,37 @@ end
# be fixed by rehashing the Dict when the module is
# loaded.

readle(ios::IO, ::Type{UInt16}) = htol(read(ios, UInt16))
readle(ios::IO, ::Type{T}) where T = ltoh(read(ios, T)) #ltoh is inverse of htol
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a space after # like all other comments

src/NPZ.jl Outdated
writele(ios::IO, x::UInt16) = writele(ios, reinterpret(UInt8, [htol(x)]))
# Endianness only pertains to multi-byte things
writele(ios::IO, x::AbstractVector{UInt8}) = writecheck(ios, x)
# NPY headers only have ascii strings (bytes) so endianness doesn't matter
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should be assuming we're writing NPY header. In the future, we might be writing string data. Either change it back take a String input instead of AbstractString so that the input is UTF-8 encoded or handle byte order for any type of string (e.g. UTF-16).

In fact, I don't understand why all the String needs to be converted to AbstractString in the functions below. Since the header is ASCII, String works for us. There is no need to be more general, which has potential to slow down parsing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm my comment isn't quite accurate, it should work on any utf-8 string, which is the only string encoding that I've come across in Julia since Julia 0.5. My intent with using AbstractString type bound is to catch the many AbstractStrings in Julia that use utf-8. In Julia 0.7 more functions produce SubString etc, which in fact leads to breakage in NPZ.jl (see my comment in #23).

Excluding UTF-16, and other string types that I haven't seen in julia, there are still plenty of AbstractStrings that would work for these functions, and these functions need to handle at least SubString. The change regarding production of non-String AbstractStrings by base functions that bites here is JuliaLang/julia#22496, where all strip functions return SubStrings. As you can see below, all AbstractString subtypes in base are utf-8 encoded, but manipulate the order, extent, or content of an underlying String.

julia> subtypes(AbstractString)
4-element Array{Any,1}:
 String            
 SubString         
 SubstitutionString
 Test.GenericString

In fact, utf-8 is always a sequence of single bytes, even when codepoints are not themselves single bytes, so utf-8 strings are indifferent to endianness [#1]. Ascii is a subset of utf-8, and I think can be safely treated as always being utf-8.

However, you are right that someone may want to write other utf-encodings using strings generated from some package or other source. I'll change it so the codeunits are inspected before being written. This should allow it to work for all utf-8 AbstractStrings

src/NPZ.jl Outdated
function parsechar(s::String, c::Char)
writele(ios::IO, x::UInt16) = writecheck(ios, htol(x))

function parsechar(s::AbstractString, c::Char)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was already a bit wrong in that it assumes c is an ASCII character. If it's not, it'll crash:

julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"

julia> typeof(s)
String

julia> s[1]
'∀': Unicode U+2200 (category Sm: Symbol, math)

julia> s[2:end]
ERROR: UnicodeError: invalid character index 2 (0x88 is a continuation byte)
Stacktrace:
 [1] macro expansion at ./REPL.jl:97 [inlined]
 [2] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73

Now that it's AbstractString, it may be UTF-16 and become always wrong.

Copy link
Contributor Author

@galenlynch galenlynch Aug 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it was already broken for more than just the character argument. String can be (and is) utf8, and you just demonstrated that it breaks on utf8.

To demonstrate...

julia> foo(s::String) = s
foo (generic function with 1 method)
julia> bar(s::String) = s[2]
bar (generic function with 1 method)
julia> s = "♨★  𝕦ᑎ𝐢ς𝐎𝐝є!!  ☯✌"
"♨★  𝕦ᑎ𝐢ς𝐎𝐝є!!  ☯✌"
julia> foo(s)
"♨★  𝕦ᑎ𝐢ς𝐎𝐝є!!  ☯✌"
julia> bar(s)
ERROR: StringIndexError("♨★  𝕦ᑎ𝐢ς𝐎𝐝є!!  ☯✌", 2)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] getindex_continued(::String, ::Int64, ::UInt32) at ./strings/string.jl:216
 [3] getindex at ./strings/string.jl:209 [inlined]
 [4] bar(::String) at ./REPL[31]:1
 [5] top-level scope at none:0

@galenlynch
Copy link
Contributor Author

galenlynch commented Aug 19, 2018

I fixed the unicode problems.

@fhs
Copy link
Owner

fhs commented Aug 20, 2018

I can't say I'm happy with the new string indexing -- it's too verbose. But it's acceptable.

@fhs fhs merged commit 0bd3fbf into fhs:master Aug 20, 2018
@fhs
Copy link
Owner

fhs commented Aug 20, 2018

Thanks for all the fixes!

@galenlynch galenlynch deleted the jl-0.7 branch August 20, 2018 00:26
@galenlynch
Copy link
Contributor Author

Seems like the price you have to pay if you want to play with unicode. Thanks again for maintaining such a useful package!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants