Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as_array() on varchar-based array_agg() column - and semicolon separator #590

Closed
mmd-osm opened this issue Jul 31, 2022 · 18 comments · Fixed by #609
Closed

as_array() on varchar-based array_agg() column - and semicolon separator #590

mmd-osm opened this issue Jul 31, 2022 · 18 comments · Fixed by #609

Comments

@mmd-osm
Copy link

mmd-osm commented Jul 31, 2022

We've experienced some issue when using as_array() on a varchar-based array_agg() column:

  auto values = psql_array_to_vector(row["tag_v"].as_array());
(psql_array_to_vector is a simple wrapper method calling the libpqxx provided array parser)
std::vector<std::string> psql_array_to_vector(pqxx::array_parser&& parser) {
  std::vector<std::string> result;

  auto obj = parser.get_next();
  while (obj.first != pqxx::array_parser::done)
  {
    if (obj.first == pqxx::array_parser::string_value) {
      result.push_back(obj.second);
    }
    obj = parser.get_next();
  }
  return result;
}

Source:

For our payload {use_sidepath,secondary,3,1,yes,50,"Rijksweg Noord",asphalt,left|through;right}, we expected the libpqxx array parser to extract 9 values, rather than 10.

10 values:
    use_sidepath
    secondary
    3
    1
    yes
    50
    Rijksweg Noord
    asphalt
    left|through
    right

As shown in the payload above, the last element in the array ( asphalt,left|through;right ) is both unquoted, and contains a semicolon character.

libpqxx seems to treat this semicolon as a separator character, although the underlying datatype is a varchar. We didn't find any way to influence which characters should be treated as a separator character. As Postgresql wouldn't escape a string containing semicolons, it would be good not to consider semicolon as a separator here.

zerebubuth/openstreetmap-cgimap#276 (comment) has a bit more discussion

Relevant query
SELECT w.id, w.visible, 
         to_char(w.timestamp,'YYYY-MM-DD\T\HH24:MI:SS\Z\') AS timestamp, 
         w.changeset_id, w.version, t.keys as tag_k, t.values as tag_v, 
         wn.node_ids as node_ids 
       FROM current_ways w 
         LEFT JOIN LATERAL 
           (SELECT array_agg(k) as keys, array_agg(v) as values 
            FROM current_way_tags WHERE w.id=way_id) t ON true 
         LEFT JOIN LATERAL 
           (SELECT array_agg(node_id) as node_ids 
            FROM 
              (SELECT node_id FROM current_way_nodes WHERE w.id=way_id 
               ORDER BY sequence_id) x) wn ON true 
       WHERE w.id = 4000392205
       ORDER BY w.id

(based on https://github.com/zerebubuth/openstreetmap-cgimap/blob/master/test/structure.sql)

Downstream issue: zerebubuth/openstreetmap-cgimap#276

Thanks for looking in this!

@tomhughes, @pnorman: please chime in, in case I forgot something.

@tomhughes
Copy link

tomhughes commented Jul 31, 2022

I think that basically covers it - as the array parser has no idea of the type of the value it's operating on and can't access the type info to get the separator I think the separator needs to be an argument to the parser so it can be configured.

If it defaulted to comma then existing code would still work in almost all cases and the odd cases where semicolon was needed would need to change.

That would also allow custom types that use other separators to use it by configuring the separator.

@jtv
Copy link
Owner

jtv commented Jul 31, 2022

You're right, that's always been a problem waiting to happen... I think at the time I thought that the backend would quote such strings. I also just discovered another thing I thought about array quoting was wrong: I thought there was such a thing as single-quoted array elements. See #587.

So we do need to pass a separator, which can default to a comma. But it gets more difficult when an array combines different types, with different separators. I think the only way that can happen though is in a multi-dimensional array where the ultimate elements use semicolon as the separator. Do you agree?

If so, we may want a separate provision for multi-dimensional arrays. TBH array handling in libpqxx is still in its infancy. For the existing as_array() I think it would make sense to pass the separator as a template argument. That fits best with the direction I'm taking with text encodings.

@github-actions
Copy link

There has been no activity on this ticket. Consider closing it.

@pnorman
Copy link

pnorman commented Sep 30, 2022

I don't think this should be closed, as it's still a bug that prevents using text[] arrays where there is no constraint excluding semicolons.

I think the only way that can happen though is in a multi-dimensional array where the ultimate elements use semicolon as the separator.

Checking pg_type, the only type that use something other than , as a delimiter is the box type. Can you give an example of another situation where a ; would be the delimiter?

postgres=# SELECT typname, typdelim FROM pg_type WHERE typdelim <> ',';
 typname | typdelim
---------+----------
 box     | ;
 _box    | ;

Of course, other types could added with different delimiters. Notably, PostGIS has types with : as a separator.

It's always possible for a user to create custom types with other delimiters, but this is unlikely.

CREATE TYPE silly;
create function sillyin(cstring) returns silly language internal IMMUTABLE PARALLEL SAFE STRICT as $function$textin$function$;
create function sillyout(silly) returns cstring language internal IMMUTABLE PARALLEL SAFE STRICT AS $function$textout$function$;
CREATE TYPE silly (INPUT = sillyin, OUTPUT = sillyout, LIKE = text, DELIMITER = 'a');
select ARRAY['b','c']::silly[];

gives {bac} as the text representation for an array with two elements.

@jtv
Copy link
Owner

jtv commented Sep 30, 2022 via email

@pnorman
Copy link

pnorman commented Oct 1, 2022

The delimiter character is always the one defined before the type, including multidimensional arrays. e.g. SELECT ARRAY[['meeting', 'lunch'], ['training', 'presentation']]::silly[][] gives {{meetingalunch}a{"training"a"presentation"}}. As far as I can tell, an array can only ever have exactly one delimiter.

Making it a template argument prevents one from querying for the separator character from the DB, but this is probably fine.

@github-actions github-actions bot closed this as completed Oct 8, 2022
@jtv
Copy link
Owner

jtv commented Oct 8, 2022

Github bot, what are you doing!? I'm still working on this one!

@jtv jtv reopened this Oct 8, 2022
@mmd-osm
Copy link
Author

mmd-osm commented Oct 8, 2022

Maybe try removing the no-issue-activity label?

@jtv
Copy link
Owner

jtv commented Oct 8, 2022

Ah, I hadn't noticed the label, thanks. The button to remove it showed up below the conversation (where there's a second "Labels" section. Not great UI work.)

@jtv
Copy link
Owner

jtv commented Oct 24, 2022

I've come to the conclusion that I can't fix this properly in array_parser without (1) breaking compatibility and (2) complicating an already awkward API that little bit more.

So I'm changing array_parser so that it will only work with comma separators, and designing a new API for parsing arrays.

jtv added a commit that referenced this issue Oct 24, 2022
The `array_parser` was seriously broken (#590): SQL arrays may contain elements that have a semicolon in them... _and the back-end won't put them in quotes._  The parser would always see that as a field separator.  Same thing for commas in e.g. the SQL "box" type, which uses the semicolon as its separator but uses commas inside an object.

So I'm limiting `array_parser` for use with comma-separated types only, and planning a better, friendlier, faster, more flexible API for parsing arrays.

At the same time, I did manage to specialise `array_parser` internally to different encodings, which should make it considerably faster.  These changes will also benefit the future array parsing API.
@github-actions
Copy link

There has been no activity on this ticket. Consider closing it.

@pnorman
Copy link

pnorman commented Dec 24, 2022

Please remove the no activity label so it doesn't autoclose

@jtv
Copy link
Owner

jtv commented Dec 25, 2022

Thanks for the reminder. For the record, there has been activity on the ticket but it's been moving slowly - #609 is a new, very different array parser that converts an SQL array into something like a C++ container.

@pnorman
Copy link

pnorman commented Jan 2, 2023

Can this be re-opened as its not yet fixed?

@jtv
Copy link
Owner

jtv commented Jan 2, 2023

Yes, and I'll merge the fix right away. Which I think will automatically close it again. :-)

@jtv jtv reopened this Jan 2, 2023
@jtv
Copy link
Owner

jtv commented Jan 2, 2023

The fix, by the way, is a new pqxx::array class. You construct it with a string representing an SQL array, and a connection (just so it knows which encoding to use), and it holds the values, converted to a type of your choice.

I still need to sort out iteration though. It's got indexing, but iteration gets a little complicated for multi-dimensional arrays.

@jtv jtv closed this as completed in #609 Jan 2, 2023
jtv added a commit that referenced this issue Jan 2, 2023
New array parser: `pqxx::array`.  Parses an SQL array (in string form) and keeps the values internally as an array of converted values.

This should address #590.  Still need to sort out iteration though.
jtv added a commit that referenced this issue Jan 2, 2023
Further to the work on #590.

Implementing simple array iteration. And when I say "simple," I really mean the simplest I could do: storage-order iteration, going through all dimensions.

The alternative would be to define something like a multidimensional subrange, and iterate over those. In the two-dimensional case, that would be like iterating a `result` to get `row` elements, and then iterating each `row` to get the `field` elements.  But would it really be all that useful?  It's a lot of code overhead and cognitive load, both on the libpqxx side and on the application side.  I don't want all that complexity again — streaming queries have gotten us away from that for query results and it's been wonderful.  Plus, in practice, all you may want is to iterate over a one-dimensional array.  Or you may just want to visit all elements and not care at all about the order.

It may become easier once we have `std::mdspan`.  Depending on the conventions that form, we may have a problem on our hands with `size()`: is that going to be the size including all elements, or is it just the size along the outer of the dimensions?  The latter seems almost arbitrary.  We'll have to cross that bridge when we come to it.
@alexolog
Copy link

So we're not supposed to be using field::to_array anymore?

@jtv
Copy link
Owner

jtv commented May 27, 2024

@alexolog I think that's right. The whole thing has flushed out of my mental cache by now but... you can use field::as_array() but it's just not a very good API.

So perhaps we should just deprecate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants