Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming runtime-sized row with stream_from #357

Closed
georgthegreat opened this issue Aug 6, 2020 · 30 comments
Closed

Streaming runtime-sized row with stream_from #357

georgthegreat opened this issue Aug 6, 2020 · 30 comments

Comments

@georgthegreat
Copy link
Contributor

georgthegreat commented Aug 6, 2020

I am trying to switch my application from pqxx6 to pqxx7.
For the scope of the issue we can assume that my application is uploading CSV file to PostgreSQL and vice versa. I would like to treat empty CSV values as SQL NULLs.

While pqxx::stream_to can stream std::vector<std::string> right into it, this API does not have any options for writing NULL values (same problem can appear with integers, yet PostgreSQL looks being capable to properly unquote them).

Migration to stream_from is even more complicated, as it does not support runtime-sized rows (only compile-time std::tuple-like classes are supported).

Are my assumptions correct? Which migration options do I have?

@jtv
Copy link
Owner

jtv commented Aug 6, 2020

There's no option to treat empty strings as nulls, but you can stream null values.

One way to get null string values in a stream is to use std::optional<string>. If you leave the std::optional empty, it's a null. If you put an empty string into the std::optional, it's an empty string. Similar for integers, and other types that do not have a "natural null value."

(A C-style char const * would also work. A null pointer becomes a null value. But I don't think that generalises to other types.)

About the second part: yes, if you want to stream values into an object using stream_from, you need to know at compile time how many columns there are in the data. If you really need to be able to stream an unknown number of fields, we can look into making that happen. But another way to do what you need might be to stream your stream_from into a stream_to. In that case, the number of columns just doesn't matter. But the precondition is that you can express any data conversions that you're doing along the way, in the SQL rather than in C++.

@georgthegreat
Copy link
Contributor Author

I am writing a simple function which uploads generic CSV file into a generic table (thus table size is not known at compile time, though the table schema is).

I would like to migrate without refactoring the whole application.

In order to solve the escaping issue, an ability to write std::vector<std::variant<nullptr, string, int, ...add you own types here>> would be fine.

@jtv
Copy link
Owner

jtv commented Aug 7, 2020

Shouldn't be too hard to add a member function in stream_from which lets you read a data into an output iterator. You'd still have to size the container appropriately, of course.

@georgthegreat
Copy link
Contributor Author

I assume the data will be coming in escaped form, as it is returned by COPY FROM?

@jtv
Copy link
Owner

jtv commented Aug 7, 2020

The stream would un-escape the strings. So parsing them wouldn't be very hard.

jtv added a commit that referenced this issue Aug 10, 2020
Lets you read rows of data from a `stream_from` without knowing at
compile time how many columns the stream has.

See #357.
@jtv
Copy link
Owner

jtv commented Aug 10, 2020

@georgthegreat could you have a look at the latest master and see whether stream_from::read_row() does what you need?

@georgthegreat
Copy link
Contributor Author

I will try to take a look and return in a couple of days.

@georgthegreat
Copy link
Contributor Author

It looks like stream_to ctor with pqxx::from_table does not support schema-qualified names:

Upon changing

pqxx::tablereader reader{txn, schemaName + "." + tableName, std::string(NULL_VALUE)};

to

pqxx::stream_from reader{txn, pqxx::from_table, schemaName + "." + tableName};

I am getting the following exception

ERROR: relation "test_schema.test_table" does not exist

I have added some debug printing to check if the table exists.

auto result = txn.exec("SELECT COUNT(*) FROM " + schemaName + "." + tableName);
std::cerr << "TOTAL ROW COUNT IS " << result[0][0].as<int64_t>() << std::endl;

says

TOTAL ROW COUNT IS 2

@georgthegreat
Copy link
Contributor Author

georgthegreat commented Aug 10, 2020

Schema qualified names should be supported by COPY TO command, so the issue should be on the client side.

@georgthegreat
Copy link
Contributor Author

The following ctor works fine:

pqxx::stream_from reader{
        txn,
        pqxx::from_query,
        "SELECT * FROM " + schemaName + "." + tableName
};

Now my code fails upon attempt to convert empty string to int64_t. How should I handle NULL values with stream_from::read_row() API?

@jtv
Copy link
Owner

jtv commented Aug 10, 2020

The documentation for read_row() covers nulls, but perhaps not clearly enough: nulls show up as zview values whose data pointer is null.

I think the problem with the schema name happens because the constructor quotes and escapes your table name, and the schema and table parts each need their own pair of quotes. To handle this properly I'm going to need a new constructor with a "schema" parameter. This is a royal pain given how many overloads there already are. It all gets easier with C++20, but even then, the old constructors will still be in the way. I hope the language will give us keyword arguments soon!

What I can do for now is to document that pitfall.

@georgthegreat
Copy link
Contributor Author

georgthegreat commented Aug 10, 2020

The documentation for read_row() covers nulls, but perhaps not clearly enough: nulls show up as zview values whose data pointer is null.

C++ explicitly forbids creation of std::string_view with data == nullptr and size != 0. See cppreference.com

The behavior is undefined if [s, s+count) is not a valid range (even though the constructor may not access any of the elements of this range). After construction, data() is equal to s, and size() is equal to count.

So, this way of handling NULL values does not allow to distinct between NULL value of text column and empty string. Such limitation is OK in C++ (where null string and empty string are considered to be the same), but it will not work in SQL.

@georgthegreat
Copy link
Contributor Author

Also, something is wrong with the classic pattern:

while (reader) {
   row = reader.read_row()
}

While my test data count only 2 rows, this loop is executed at least three times according to debug printing. It looks like the reader changes its state upon invocation of read_row() method (thus, an extra check after read_row() fixes the thing).

So, finally I was able to fix all the issues. Howerer, I can not call new API very convenient, as it adds at least two sources of errors.

@georgthegreat
Copy link
Contributor Author

georgthegreat commented Aug 10, 2020

Changing read_row API to return std::optional<std::vector<zview>> would solve the issue mentioned above. The stream reading pattern would be:

while (auto maybeRow = stream.read_row()) {
    //inside the loop maybeRow is never null
    for (auto field: *maybeRow) {
        //do something with field;
    }
}

@jtv
Copy link
Owner

jtv commented Aug 10, 2020

C++ explicitly forbids creation of std::string_view with data == nullptr and size != 0. See cppreference.com

When the data pointer is null, the size is 0. I figured that was obvious enough to be left implied, exactly for this reason, but I can make it more explicit if you think it's not clear.

So, this way of handling NULL values does not allow to distinct between NULL value of text column and empty string. Such limitation is OK in C++ (where null string and empty string are considered to be the same), but it will not work in SQL.

I don't see what you mean here... A null string and an empty string are not considered the same thing, whether in C++ or in SQL.

@georgthegreat
Copy link
Contributor Author

How do you implement empty non-null text field in zview?
For me it could be either zview(nullptr, 0) or zview(some_pointer_pointing_to_zero_byte, 0) (thus the default constructed zview would correspond to empty string, matching the behaviour of std::string).

@jtv
Copy link
Owner

jtv commented Aug 11, 2020

An empty field has a pointer pointing to a zero byte, yes.

A default-constructed zview has a null data pointerj. So it would correspond to a null string, not an empty one.

@jtv
Copy link
Owner

jtv commented Aug 11, 2020

The std::optional idea is an interesting one. Slight performance cost, but probably not a lot. These loops tend to get a little ugly because you need to try the action first, then check whether it succeeded, and only then make use of the results. It's an awkward fit for the standard "condition at the head" loop style.

Here's a different way we could slice it:

while (stream.parse_line())
    for (auto field : stream.get_fields())
        process(field);

@georgthegreat
Copy link
Contributor Author

georgthegreat commented Aug 11, 2020

The different way would cause problems if parse_line() and get_fields() methods will be invoked independently.
Reducing internal state of the class looks like a better idea, but I would leave to the project owner to decide.

Any ideas about solving the same problem for pqxx::stream_to?

@georgthegreat
Copy link
Contributor Author

An empty field has a pointer pointing to a zero byte, yes.

A default-constructed zview has a null data pointerj. So it would correspond to a null string, not an empty one.

This is indeed contradictory. As I see, empty() method was not overriden, so both NULL-representing zview and empty zview would report as being empty().

@KayEss
Copy link
Contributor

KayEss commented Aug 11, 2020

while (auto maybeRow = stream.read_row()) {
    //inside the loop maybeRow is never null
    for (auto field: *maybeRow) {
        //do something with field;
    }
}

This will be a common pattern with coroutines in C++20 as it'll be needed for asynchronous generators. It'll just be very slightly different:

while (auto maybeRow = co_await stream.read_how()) { /* ... */ }

@georgthegreat
Copy link
Contributor Author

@jtv, could we finish the work on this issue? At the time I am stuck in the middle of upgrade.

I would like to conclude the long list of the comments above and collapse it into the following:

  1. Adapt pqxx:zview by explicitly adding null-specific predicates into the pubic interface:
bool is_null() const {
  return data() == nullptr;
}

bool empty() const {
  return (data() != nullptr && size() == 0)
}

The latter overrides the default std::string_view method.

  1. Change stream_from interface to return std::optional (empty optional would be returned upon reading all the data from the stream).
  2. Add similar interface to stream_to.

I can submit the PR for 1., but I will need you help with 2 and 3.

@jtv
Copy link
Owner

jtv commented Aug 15, 2020 via email

@georgthegreat
Copy link
Contributor Author

georgthegreat commented Aug 15, 2020

As for stream_to, it is completely unclear how to write NULL value into the stream.

As far, as I see from the code, there is no text type capable of holding null values.

internal/conversions.hxx:425:template<std::size_t N> struct nullness<char[N]> : no_null<char[N]>
internal/conversions.hxx:458:template<> struct nullness<std::string> : no_null<std::string>
internal/conversions.hxx:498:template<> struct nullness<std::string_view> : no_null<std::string_view>
internal/conversions.hxx:525:template<> struct nullness<zview> : no_null<zview>
internal/conversions.hxx:542:template<> struct nullness<std::stringstream> : no_null<std::stringstream>

(with zview it looks like a bug again, as the major part of above discussion was about the ability of zview to hold null values).

char* / const char* can hold such values, but is hardly usable.

UPD: was able to make it work with std::optional<std::string>, but an infinite recursion bug occurred (see PR below)

@jtv
Copy link
Owner

jtv commented Aug 15, 2020

One way to produce a null int, say, is to use std::optional<int>. See the "streams" documentation, under "Null values" (near the top): https://libpqxx.readthedocs.io/en/stable/a01344.html

jtv added a commit that referenced this issue Aug 15, 2020
You can now use `read_row()` in loops like:

    while (auto fields = stream.read_row())
        process(*fields);

See #357.
@jtv
Copy link
Owner

jtv commented Aug 15, 2020

Okay, I've changed read_row() to return a pointer to the std::vector<pqxx::zview>. Could you give this version a try?

I didn't want to return std::optional<std::vector<pqxx::zview>> because it means copying the vector. This is an API that people use for high-performance code, after all.

@georgthegreat
Copy link
Contributor Author

New interface works fine, thanks.

I also would like to propose this additional small change: #365

I also have an idea of speeding stream_to up by allowing the user to write individual fields, but this is the subject of
future issue / PR.

@georgthegreat
Copy link
Contributor Author

JFYI, I have executed some benchmarks to compare old streams agains new streams:

Was (pqxx6 / tablereader):

--- Table reader without parsing.
Loaded 100000 in 0.228427.
--- Table reader.
Loaded 100000 in 0.203817.
--- PQXX standard parsing.
Loaded 100000 in 0.312323s (execute time 0.122477s).

Now (pqxx7 / stream_from):

--- Table reader without parsing.
Loaded 100000 in 0.124407.
--- Table reader.
Loaded 100000 in 0.129337.
--- PQXX standard parsing.
Loaded 100000 in 0.327017s (execute time 0.112914s).

So, new streams is about 30% faster. These numbers should not be considered as a thorough benchmark though.

@jtv
Copy link
Owner

jtv commented Aug 17, 2020

Hey, thanks for the benchmark numbers. It may be anecdotal, but it's music to my ears.

Looks to me as if the table reader is significantly faster and the parsing is sligggghtly slower — but the difference there is small enough to be random measurement jitter, or the cost of additional overflow checks and such.

If you're happy with the new API, I guess we can close this ticket. If any more ideas for improvement come up, feel free of course to open a new ticket.

@georgthegreat
Copy link
Contributor Author

Thanks!

We shall continue the discussion in further PRs / issues then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants