cursor thread safety #1438

jerch · 2022-03-21T07:57:12Z

jerch
Mar 21, 2022

I am trying to get faster COPY FROM throughput with less memory wasting with psycopg2. The problem I basically face is a "tandem mode", where first python fills a BytesIO buffer, and in the second step copy_from uses this as input file. (Note that I dont face the issue with psycopg3, as it exposes a writable file object itself, but I cannot use that yet, since django is still on psycopg2).

So what I came up with is this thread construct, which writes bigger chunks once in a while into a pipe (plz ignore the table details, it is just to compare raw throughput with psycopg3):

def threaded(cur, fr):
    cur.copy_from(fr, 'temp_table', size=131072, columns=('f1','f2','f3','f4','f5','f6','f7','f8','f9','f10'))

def copy_insert2_new(cur, data):
    cur.execute('CREATE TEMPORARY TABLE temp_table (pk serial, f1 int,f2 int,f3 int,f4 int,f5 int,f6 int,f7 int,f8 int,f9 int,f10 int)')

    r, w = os.pipe()
    fr = os.fdopen(r, "rb")
    fw = os.fdopen(w, "wb")

    t = threading.Thread(target=threaded, args=[cur, fr])
    t.start()

    counter = 0
    lines = []
    for o in data:
        line = f'{o.f1}\t{o.f2}\t{o.f3}\t{o.f4}\t{o.f5}\t{o.f6}\t{o.f7}\t{o.f8}\t{o.f9}\t{o.f10}\n'.encode('utf-8')
        lines.append(line)
        counter += len(line)
        if counter > 131072:
            fw.write(b''.join(lines))
            lines.clear()
            counter = 0
    if lines:
        fw.write(b''.join(lines))
    fw.close()
    t.join()
    fr.close()

    cur.execute('DROP TABLE temp_table')

In my tests this works as intended and runtime drops from ~3s to ~1.8s for 1M records fully saturating the postgres process (before it was at ~50%).

Question:
Is it safe to pass the cursor object into a thread like this while I make sure not to use it in the original thread? Or should this be done only with the connection object (as suggested by the docs)?

Edit: With proper pipe buffer alignment to 65536 bytes (default on linux) I get it down to ~1.75s (now python is at 110% CPU time).

Answered by dvarrazzo

Mar 21, 2022

If you don't touch the cursor in the original thread you"SHOULD" be ok. But what stops you to create a separate cursor to pass to the thread?

View full answer

dvarrazzo · 2022-03-21T10:19:58Z

dvarrazzo
Mar 21, 2022
Maintainer

If you don't touch the cursor in the original thread you"SHOULD" be ok. But what stops you to create a separate cursor to pass to the thread?

4 replies

jerch Mar 21, 2022
Author

The problem is, that the connection object lives higher up in the caller chain. Its all singled threaded down to that point, so this looked most straight forward to me. But I guess I can savely create 2 cursors and pass them along, one only dedicated for the thread. Thanks 👍

dvarrazzo Mar 21, 2022
Maintainer

You can use cursor.connection.cursor() to create a new cursor from an existing one :)

jerch Mar 21, 2022
Author

Oh thats handy, thx for the headsup.

jerch Mar 21, 2022
Author

Made it working with optional threading, when the payload exceeds the first 64kB. Below that the direct BytesIO approach is way faster than waiting for thread start/join.

So thx again for your help.

jerch · 2022-03-21T14:55:06Z

jerch
Mar 21, 2022
Author

For completeness, this is fastest I was able to find for psycopg2:

def threaded_copy(cur, fr, tname, columns):
    cur.copy_from(fr, tname, size=65536, columns=columns)

def copy_insert2(cur, data):
    # TODO: to be set from args
    tname = 'temp_table'
    columns = ('f1','f2','f3','f4','f5','f6','f7','f8','f9','f10')
    cur.execute(f'CREATE TEMPORARY TABLE {tname} (pk serial, f1 int,f2 int,f3 int,f4 int,f5 int,f6 int,f7 int,f8 int,f9 int,f10 int)')

    use_thread = False
    payload = bytearray()
    for o in data:
        # TODO: make line formatter cumstomizable
        payload += f'{o.f1}\t{o.f2}\t{o.f3}\t{o.f4}\t{o.f5}\t{o.f6}\t{o.f7}\t{o.f8}\t{o.f9}\t{o.f10}\n'.encode('utf-8')
        if len(payload) > 65535:
            # if we exceed 64k, switch to threaded chunkwise processing
            if not use_thread:
                r, w = os.pipe()
                fr = os.fdopen(r, 'rb')
                fw = os.fdopen(w, 'wb')
                t = threading.Thread(target=threaded_copy, args=[cur.connection.cursor(), fr, tname, columns])
                t.start()
                use_thread = True
            length = len(payload)
            m = memoryview(payload)
            pos = 0
            while length - pos > 65535:
                # write all full 64k chunks (in case some line payload went overboard)
                fw.write(m[pos:pos+65536])
                pos += 65536
            # carry remaining data forward
            payload = bytearray(m[pos:])
    if use_thread:
        if payload:
            fw.write(payload)
        fw.close()
        t.join()
        fr.close()
    elif payload:
        f = BytesIO(payload)
        cur.copy_from(f, tname, size=65536, columns=columns)
        f.close()

    cur.execute(f'DROP TABLE {tname}')

and runtime drops to 1.65s for 1M records with 4 digit numbers (~50MB payload). This is only slightly above python's line formatting speed, which clearly is the limiting factor (line formatting alone takes ~1.50s). The optional threading keeps things snappy for lowish payload, and saves memory and runtime for big payloads (~1.8 times faster).

@dvarrazzo
I tested a quite similar approach with psycopg3 - but it seems the copy cursor file object cannot handle memoryview or bytearray? Difference between bytes and those was 5-8% faster processing for psycopg2, maybe support is worth to be added in psycopg3 as well?

9 replies

dvarrazzo Mar 21, 2022
Maintainer

Oh well, guess I should try the C version first before stating things about psycopg3. (wasnt even aware, that I am on the python version smile_cat).

That might be somewhat faster, yes. The C version of write_row() is much faster, but write() still saves some memcopy when it can.

dont know if copy.write is guaranteed to have finished the underlying thread logic, before it returns

No, it doesn't. copy.write() normally appends the buffer in a queue and leaves it to a worker thread, which will write it down the network. This can parallelize two IO-bound operations (if, let's say, you read from a file and write to the server via network using write()) or separate the CPU-bound load from the IO-bound load (if you are using write_row() instead, so you format the buffer in the main thread and hose the server in the worker thread).

So no, if you use write(bytearray) you shouldn't mutate bytearray right after the write() call. The thread is guaranteed to be closed only on copy block exit.

jerch Mar 22, 2022
Author

Ah yepp, with the pscopg3 C version .write is as fast as my psycopg2 thread shim above (finished in ~1.63s), even with the needed explicit bytes conversion (the C version also complains about memoryview and bytearray), while the python version is at ~2.55s. The game changer is write_row though - it finished in C in ~3.39s vs. ~21.4s in python. Thats awesome, because it already contains the proper field encoders.

As you already might have guessed - I am trying to get a COPY FROM based replacement for django's bulk_update done. The current impl suffers really bad under high payload (in terms of cols * rows) with exploding runtime - the way how the values are placed in CASE statements smells like O(k^n) for k columns of n rows, if the DBMS does not transform the value cascade in some sort of a hash table. I already managed to get a partial UPDATE FROM VALUES replacement done (showing 8-10 times perf increase even for moderate payload), which is supported by all recent versions of sqlite, mysql and so on. But for postgres there is more in the books with COPY FROM, which is even 4 times faster than the UPDATE FROM VALUES variant for big payload (for >1M records).

Why I am telling you all this? Well, for one such a COPY replacement is all about speed - django model objects come in, and should update the tables as fast as possible. This is why I am trying to rule out any perf smells upfront, like the low level optimizations done here. But ofc it is also about correctness - field values may not get misinterpreted by postgres or, even worse, open the door for any sort of injections. And thats the point where I am scratching my head - the field encoders are really expensive when done in python (I voted for the TEXT format), thats why I am really happy to see an interface like write_row in psycopg3. But I cannot use that in psycopg2 yet. So I came up with another approach, which is only slightly worse than write_row (finishes the same task in ~5s): prealloc field encoders to the type of model fields, e.g. an IntegerField would only allow to pass int | None types, text-like fields needs the control char escapes and so on. At this point I wonder, if such a "column prealloc'ed encoder mode" would be even faster for write_row, as it could save the python type --> TEXT encoder matching.

If you have ideas regarding my bulk_update issue with psycopg2 - would love to hear them. About psycopg3 - maybe thats better moved over to the psycopg repo?

Edit: FYI - I am somewhat tracking my conceptual progress in a gist before doing the full repo/package thingy (plz ignore the code example itself, thats already outdated).

dvarrazzo Mar 22, 2022
Maintainer

Hello @jerch thank you for the report.

No, I didn't guess it was for Django 😄 that's good to know. In Psycopg the type dispatching is in dictionaries and it's a pretty hot loop which has seen a good amount of optimization during development.

In Psycopg 3 there is no problem of inconsistency between copy and query, because adapting the types from Python to Postgres uses the same adapters and code path (if they are incorrect, they both are the same way 😄) and both use the same customization mechanism to allow the user to change adaptation rules.

Getting COPY right is a consequence of one of the most important (and non-backwared-compatible) change that was made in Psycopg 3; there is no chance that will be ported to psycopg2. However, I wrote a prototype for a Psycopg 3 driver for Django and someone is working to get it into the project (psycopg/psycopg#156). Maybe your finding would make porting even more interesting.

Your preallocation of the columns works in some specific cases, but it is not generic enough: psycopg/psycopg#112 for instance was caused by a similar assumption of homogeneity of the columns. In Django it may work if you use the models information. Note that there is already support for column types preallocation in Psycopg 3, at least in COPY, via copy.set_types() and I was considering to add it for normal query too (psycopg/psycopg#163) but I'm not sure how beneficial it would be. Note that a certain amount of type dispatching is still necessary: a Python int might be dumped to Postgres int2, int4, int8, numeric and if you choose a type too big Postgres will fail an insert because casting down is not implicit as casting up is. copy.set_types() was introduced pretty much to disable this mechanism, because, in binary copy, implicit casts are disabled: in that case, specifying that the column is a Python int is not enough: you must specify which Postgres flavour of int it is, so, if you want to use binary copy, the information in the Django models must be 100% accurate, more than what is normally required.

Preallocating columns, in psycopg2, doesn't save you from writing new adapters: psycopg2 adapters adapt to pretty much a snippet of SQL, such as 'foo' or '2000-01-01'::date, which are not values that can be used to compose COPY input data. So, implementing the optimization you mention in psycopg2 amounts pretty much to rewriting a sizable chunk of Psycopg 3. I think that porting Django to Psycopg 3 is a much better plan.

I have created psycopg/psycopg#254 to extend the copy.write() interface to other bytes-like types, if you would like to play with it.

jerch Mar 22, 2022
Author

Ah you already implemented it - copy.set_types sounds like what I meant with the preallocation.

Is the int2 vs. int4 distinction really relevant for COPY in TEXT format? Wouldn't postgres try to apply the defined column type automatically, thus implicitly do a value::intXY cast (and fail appropriately, if things dont fit)?

About the adapter issue - my current playground implementation registers strict type checking encoders against OIDs, basically like this:

@register_oids([(20, 1016), (21, 1005), (23, 1007)])
def Int(v):
    if isinstance(v, int):
        return v
    raise TypeError('expected int type')

@register_oids_nullable([(20, 1016), (21, 1005), (23, 1007)])
def IntOrNone(v):
    if v is None:
        return '\\N'
    if isinstance(v, int):
        return v
    raise TypeError('expected int or None type')

Things are overriable at runtime or even custom postgres types could be attached. The array path needs another handling for nullish value, whether it is at top level or within the array (that was the reason for my other question about Json).

The standard OID mapping currently trusts on proper exceptions from postgres, if the column type cast does not work (e.g. the int value is too big for smallint). But there is nothing stopping anyone from registering explicit bigint | integer | smallint encoders with proper range checks.

So far I found this setup quite useful, but yeah, it prolly resembles much of psycopg3 (for the OID part I mostly relied on psycopg2 source 😸). And with the speediness of write_row my efforts are really questionable.

I have created psycopg/psycopg#254 to extend the copy.write() interface to other bytes-like types, if you would like to play with it.

Sure, will test that out.

jerch Mar 22, 2022
Author

The downside of that approach - it currently relies on proper __str__ conversion from python types, that are known to fit the TEXT format, simply for speed reasons. Thats really not yet how it should be done, as anyone could have subclassed and overloaded __str__ to return something nasty. 🙈

dvarrazzo · 2022-03-22T13:39:35Z

dvarrazzo
Mar 22, 2022
Maintainer

Is the int2 vs. int4 distinction really relevant for COPY in TEXT format? Wouldn't postgres try to apply the defined column type automatically, thus implicitly do a value::intXY cast (and fail appropriately, if things dont fit)?

In normal mode (normal querying and COPY in TEXT mode), Postgres automatically casts to a larger type (int2 -> int4) but doesn't cast down (int4 -> int2). That's why the upgrade() mechanism in the IntDumper looks for the smaller type to really perform the dump.
In binary COPY mode there is no cast performed by Postgres, and the format must be binary-compatible with the typreceive function, which will fail otherwise. That's the reason for set_types()

Things are overriable at runtime or even custom postgres types could be attached

Using a decorator you are creating only a mechanism configured at import time. Custom types should be managed at connection scope because extension types get different OIDs in different databases. That's why loaders/dumpers registration take a context as parameter in Psycopg 3. In psycopg2 the parameter is only available on loaders, not on dumpers, which sort of worked anyway because psycopg2 dumpers don't have to (and cannot) specify OIDs.

The array path needs another handling for nullish value, whether it is at top level or within the array (that was the reason for my other question about Json).

Nulls in arrays and Json are different things. Arrays in psycopg2 are another of those things that will not work in COPY, as they use the ARRAY[] construct, which is only available at SQL level, whereas Psycopg 3 does it right and creates a text (or binary) representation of an array which might contain null values (such as {foo,,bar} - augmented with the text[] OID as metadata, instead of ARRAY['foo',NULL,'bar'] - if I remember the syntax of top of my mind).

But there is nothing stopping anyone from registering explicit bigint | integer | smallint encoders with proper range checks.

That's the Dumper.upgrade() mechanism "left as exercise for the user"...

The downside of that approach - it currently relies on proper str conversion from python types, that are known to fit the TEXT format, simply for speed reasons.

There is no guarantee whatsoever that str(obj) creates a valid Postgres representation of the type, apart from the security concerns.

3 replies

jerch Mar 22, 2022
Author

Not sure if I get this one right:

... but doesn't cast down (int4 -> int2) ...

To me this clearly works as intended:

postgres=# create table test (k smallint);
CREATE TABLE
postgres=# copy test from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1
>> 123
>> 123456
>> ERROR:  value "123456" is out of range for type smallint
CONTEXT:  COPY test, line 3, column k: "123456"

similar with psycopg2:

>>> c.copy_from(BytesIO(b'1\n123\123456\n'), 'test')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type smallint: "123S456"
CONTEXT:  COPY test, line 2, column k: "123S456"

(though idk whats that 'S' is about in the exception message)

Using a decorator you are creating only a mechanism configured at import time. ...

No, I am not using an explicit decorator at all, this was just for brevity here in the comments. The decorator function gets called during first attempt, where the OIDs are pulled from postgres' type names (e.g. this still would work, if postgres ever changes the int OID). The mapping itself is stored in a connection hashmap.

About arrays:
Yes I cannot use any of the psycopg2 adapters for that reason, as COPY does not support anything like ARRAY[...] or even '\xABCDEF'::bytea. In that sense, the copy syntax is much different of what the psycopg adapters return. For arrays I basically have to construct the text repr myself with descent into {...}. Json is somewhat special because of the double meaning of None, whether sql or data level is meant (but simply worked around that with custom encoders).

There is no guarantee whatsoever that str(obj) creates a valid Postgres representation of the type, apart from the security concerns.

No there is no guarantee, but convention. The text representation of many python types is exactly the same what postgres would expect. Furthermore postgres is somewhat relaxed here - it even would accept python's -inf as -Infinity, and so on. Seems the postgres devs already anticipated those "minor deviations" in text representations. So while I dont quite follow your argument here, I would still prefer a proper conversion, because of possible security implications.

dvarrazzo Mar 22, 2022
Maintainer

To me this clearly works as intended:

the issue with OIDs is related to context where OIDs must be specified. This is not the case for psycopg2 and TEXT copy. Let's leave that out.

(though idk whats that 'S' is about in the exception message)

You are missing an \n in b'1\n123\123456\n'. \123 is octal for S.

No there is no guarantee, but convention.

Postgres input and output formats are well documented and usually input has a certain flexibility, sure, However you have already seen that lists, bytes, json str don't produce valid Postgres lists, bytea, json. So convention only works in a limited number of cases.

The risk of someone changing the implementation of __str__ of their object doesn't strike me as a security risk: If someone can change a program's code, most likely they are hacking themselves. The risk comes from taking untrusted input and making SQL out of it. If you are only dealing with copy you might not get exposed to it.

jerch Mar 22, 2022
Author

You are missing an \n in b'1\n123\123456\n'. \123 is octal for S.

Woops, how did i miss that? 😅

About COPY security:
Yes the __str__ thingy is way overstretching, as it basically questions any type behavior at encoding barriers. Idk if COPY FROM is vulnerable to any malformed data, I would expect it to be much less problematic than any SQL trickery due to a very reduced syntax (did not find any reports in this regard). I'd guess though, that the BINARY format might have some shenanigans hidden, as it is very raw and even postgres version dependent. While the TEXT format looks very straight forward to me (thats why I didnt bother with BINARY in the first place).

Just had a look at the dumpers in psycopg3's types modules - well, thats the same as I am trying with my encoders. I think there is no need to repeat all of that nasty work for a soon to be deprecated psycopg2. While django would benefit alot from it, imho it is better to get the transition to psycopg3 done first.

jerch · 2022-03-23T12:53:43Z

jerch
Mar 23, 2022
Author

Some final conclusions from my speed tests:

The C version of psycopg3's write_row is the clear winner. It does full field text escaping, and is still the fastest one. In python written encoders can only compete, if they do "sloppy encoding" by treating known to work types as "TEXT format safe" with no further escaping, and let the string formatting happen from internal __str__ calls (which drops to C level speed for many builtin types). As soon as explicit escaping of unsafe chars is needed, the runtime multiplies. But explicit escaping should be done to avoid col/row offset fake manipulations, so sloppy is not really an option, even for "known to work types".

Secondly write_row conveniently hides the nasty low level details, which makes using it a breeze - for pumping django model object field values into postgres, the needed code reduces to a simple for loop with an attrgetter (beside the pre/post ORM preparations). And for more directed stuff there is set_types and BINARY support. Its abit like Christmas with psycopg3's enhanced copy support. 😸

@dvarrazzo
While playing with my encoder idea under psycopg2 I found a few perf and memory smells, esp. around escaping and bytea fields. If you want, I can check if the psycopg3 impl can be optimized in some regards (esp. the python version).

1 reply

dvarrazzo Mar 23, 2022
Maintainer

Hi @jerch thank you for the heads up! It would be cool to see these features used by Django.

If you have any observation to make, please feel free to open a ticket in the psycopg 3 issue tracker. I appreciate your analysis.

Cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cursor thread safety #1438

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 17 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

cursor thread safety #1438

jerch Mar 21, 2022

Replies: 4 comments · 17 replies

dvarrazzo Mar 21, 2022 Maintainer

jerch Mar 21, 2022 Author

dvarrazzo Mar 21, 2022 Maintainer

jerch Mar 21, 2022 Author

jerch Mar 21, 2022 Author

jerch Mar 21, 2022 Author

dvarrazzo Mar 21, 2022 Maintainer

jerch Mar 22, 2022 Author

dvarrazzo Mar 22, 2022 Maintainer

jerch Mar 22, 2022 Author

jerch Mar 22, 2022 Author

dvarrazzo Mar 22, 2022 Maintainer

jerch Mar 22, 2022 Author

dvarrazzo Mar 22, 2022 Maintainer

jerch Mar 22, 2022 Author

jerch Mar 23, 2022 Author

dvarrazzo Mar 23, 2022 Maintainer

jerch
Mar 21, 2022

Replies: 4 comments 17 replies

dvarrazzo
Mar 21, 2022
Maintainer

jerch Mar 21, 2022
Author

dvarrazzo Mar 21, 2022
Maintainer

jerch Mar 21, 2022
Author

jerch Mar 21, 2022
Author

jerch
Mar 21, 2022
Author

dvarrazzo Mar 21, 2022
Maintainer

jerch Mar 22, 2022
Author

dvarrazzo Mar 22, 2022
Maintainer

jerch Mar 22, 2022
Author

jerch Mar 22, 2022
Author

dvarrazzo
Mar 22, 2022
Maintainer

jerch Mar 22, 2022
Author

dvarrazzo Mar 22, 2022
Maintainer

jerch Mar 22, 2022
Author

jerch
Mar 23, 2022
Author

dvarrazzo Mar 23, 2022
Maintainer