-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restrict indexing into strings to a special ByteIndex
or StringIndex
type
#9297
Comments
Remove indexing totally, only allow accessing strings via iterators? Access is O(1), and it knows the string it applies to. |
@elextr Not remove: as the title says, restrict it to |
@nalimilan I was continuing on from your last comment about having a reference to the string in the index. That makes it possible to use it as an iterator, so there is no chance of applying it to the wrong string, and it can be passed to functions as a single parameter. And moving away from the indexing syntax removes the implication that it is guaranteed to be O(1). |
Ah, OK. I'm not sure this is very practical since indexing is quite a common pattern, and anyway you will need a function to extract a character or a substring at a given position. Though that would allow for more explicit names like |
Yes, |
I've considered that kind of |
@StefanKarpinski In my mind the reference to the string object was only meant as an optimization, so if it hurts it could be avoided. The minimal version of |
I've argued before that indexing should be removed completely in the future, so I'll argue that again. (1) We know there's no clearly right definition of indexing. This proposal is meant to force people to remove ambiguity about what they hope indexing will mean. So it's a huge step forward. But there's still point (2). (2) I believe a lot of the people who will find indexing into strings intuitive will be distraught to learn that you can't mutate strings via |
Actually I'd be rather in favor of removing string indexing. Making strings look like array sounds too confusing to me, because of the immutability, but also because of the absence of equivalence between bytes and codepoints. But we would need to identify common use cases and how they would work with a new API. For example, the incorrect |
FWIW, I've just bumped into the Rust issue where string indexing was removed: rust-lang/rust#12710 See also http://doc.rust-lang.org/guide-strings.html But they still allow taking byte slices via an explicit call, raising a run-time error on things like Rust also provides iterators for bytes, codepoints and graphemes, which must be chosen explicitly. This also sounds like a good idea to me, removing all issues from naive uses of Unicode strings. And it's possible thanks to @stevengj's PR #9261. Go doesn't sound like a good model to follow IMHO, not least because they allow arbitrary bytes (in my words, "random garbage") in their |
I don't think disallowing indexing into strings is helpful or sensible. You need a way of talking about positions in strings. Otherwise how do you talk about things like "give me the next character after this position in this string" or "give me the substring between this position and this position"? You can't use characters to talk about positions because characters can occur multiple times and even if they didn't this would imply doing an O(n) search to find a position, which is clearly a non-starter. What might be sensible is disallowing arithmetic with indices into strings. Whenever someone writes immutable StringIndex{S<:String}
string::S
index::Int
end
Base.getindex(s::String, si::StringIndex) =
s === si.string ? s[si.index] : s[chr2ind(s, ind2chr(si.string, si.index))]
function +(si::StringIndex, j::Integer)
j < 0 && return si - (-j)
i = si.index
while j > 0
i = nextind(si.string, i)
j -= 1
end
return StringIndex(si.string, i)
end
function -(si::StringIndex, j::Integer)
j < 0 && return si + (-j)
i = si.index
while j > 0
i = prevind(si.string, i)
j -= 1
end
return StringIndex(si.string, i)
end That's pretty complicated code given that by far the most common use case is |
I've tried a bunch of tricks to try to make this |
I think there are two issues:
Regarding 1), I don't have a strong opinion though I'm not a fan of indexing strings. (Actually I'm more concerned about iteration having different legitimate meanings.) Regarding 2), I agree with @StefanKarpinski that inefficient arithmetic would be a problem, as it would let people use |
@StefanKarpinski: I don't believe the concept of "position" has any coherent meaning for strings. After all, our current indexing rules refer to indexing into bytes, whereas Python 3's indexing rules refer to indexing into codepoints. In light of multiple ways to break strings apart into pieces, |
@johnmyleswhite No, the byte offset is a good index of a position within a string: it's like a pointer to a memory area. But it's an implementation detail which should never leak to the user. You should be able to ask for the grapheme at position X (in bytes), given that this position is obtained as an opaque |
The consequences of not having any notion of position for strings (even an opaque one that just happens to correspond to byte offsets) would be massive string API bloat and complication. Consider the very simple question: given some string, does "foo" occur earlier than "bar" as a substring, assuming both occur? With a notion of position that is at least ordered, we can answer this question very simply: search(s, "foo")[1] < search(s, "bar")[1] Without an ordered notion of position, this becomes much harder. What does contains(search(s, "foo").before, "bar") Of course, that only works for "foo" and "bar" because they share no letters and cannot overlap. What if the strings are "abba" and "baab" instead? I'm not even going to try to express the logic for figuring out which one occurs first in a string without a notion of position. |
I guess one way of doing this would be to compare the lengths of the before parts: length(search(s, "foo").before) < length(search(s, "bar").before) Of course, that adds an extra pair of O(n) operations to the solution, which is lousy. |
One idea I've toyed with in the past is maintaining a list of positions Indexing into code points or graphemes then becomes well defined. For The downside, of course, is the overhead of an extra array, which would be On Wednesday, December 10, 2014, Stefan Karpinski notifications@github.com
|
@kmsquire, I think the usual feeling (when discussions of variable-length encodings come up) is that schemes like yours are unnecessary, because randomly accessing the n-th codepoint or grapheme in a string is actually quite rare. Nearly all string processing is either sequential or sequential search followed by random access at the indices returned by search (or sequential accesses starting at these indices). That is, you only ever need "random" access at indices that are valid by construction (because they are merely cached). @StefanKarpinski, I like the idea of an |
It feels weirdly half-baked to prevent some kinds of cross-indexing errors but not others. In the above |
It doesn't seem weird to me. The main point of a |
@StefanKarpinski an iterator is a position in a string, and, with the usual implementation of pointer to string and offset, it can be ordered simply by overloading <to consider only the offset. Arithmetic will be possible with overloading of + and - and will then DTRT with respect to incrementing by byte, codepoint or graphene and the extraction functions |
@StefanKarpinski yes we can add codegen cases for |
Unfortunately, it seems like neither of those things suffice. The necessary steps seem to be inlining of the getindex call and elimination of the construction of the |
@elextr, technically an iterable type in Julia is not a position in the string. The iterable combined with the iteration state (which is created by |
@JeffBezanson yes, I was thinking like Python/C++ iterators where the state is part of the iterator object. (my excuse is that searching the docs for "iterator" does not take you to the section http://docs.julialang.org/en/latest/stdlib/base/#iteration :) |
@StefanKarpinski What's the use case for indexing a string with an index obtained on another one? Even if done correctly (by counting code points or graphemes), I guess it wouldn't make much sense unless the two strings start with exactly the same content. And if that's the case, people would better directly index at a given byte offset instead of using the code you suggested, which is is O(n) for different strings. In short, I think it would make for a more reasonable design if cross-indexing of strings was not supported by |
Slightly contrived example: you have a data parsing framework that lets you specify fields using something like this |
@nalimilan: "No, the byte offset is a good index of a position within a string" - Not even this is safe - [in my idea of a string implementation.. That is, if position means everything after that byte follows in the string.] I support this issue. Would like to see indexing gone in 0.4 (or deprecated). |
This might be a good candidate for .5, where lots of changes to indexing semantics is planned. |
Dropping in 0.5 might be good enough, but do we then need to deprecated in 0.4? Not sure how it is done, just documenting it or needs there be a warning? Anyway I think that is a minor change to code to emit warnings and will not break any code.. [Dropping indexing-functionality is also simple, no, except breaks code.. Not sure if much code breaks however..] |
Just to be clear, I'm not necessarily advocating for the removal of string indexing - just pointing out that changes to string indexing behavior might as well happen at the same time as changes to array indexing behavior. |
Two comments:
|
It would indeed be interesting to try again with #15259 and see whether performance has improved. Regarding your second point, it seems that the issue of invalid sequences is mostly orthogonal to The area where |
This is a proposal that's been done in Rust and apparently is still under discussion (see rust-lang/rust#10044 (comment)), but I thought it could be interesting for Julia.
The idea is that since indexing strings with a number like
s[3]
only makes sense when3
actually corresponds to the boundary of a unicode code point, it represents a trap for developers who only test it on ASCII, making bugs appear only in production when used with non-ASCII text. Typical cases are the naive:the tempting:
(Julia equivalent of http://www.reddit.com/r/rust/comments/1zlq21/should_rust_be_more_careful_with_unicode/cfush88)
or the slightly more involved:
Instead of letting people do incorrect-but-easy things like this, it could be useful to restrict string indexing to a special type, say
ByteIndex
, instead of a plain integer.match
would provideoffset
as that type too. It would prevent both naive indexing using integers as well as doing incorrect arithmetic on indexes you get from functions, encouraging people to always use dedicated functions.It might also make sense to allow arithmetic operations on this type, so that
idx + 1
means "the code point after the one at positionidx
", which would be O(n) but starting from the index -- and usually you don't take very large offsets. I'm not saying this is necessarily a good idea, though, becauseByteIndex
implies a reasoning in bytes, and then arithmetic operations would switch to a reasoning in code points. It could be namedStringIndex
instead, and made opaque so that people never see the integer index which is in bytes.Finally, it might be possible to perform some optimizations by removing checks that the index corresponds to the start of a code point, if the index held a reference to the string it was build from, so that it can be checked that it matches the indexed string. Not sure it would be significant, though.
The text was updated successfully, but these errors were encountered: