Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very bad performance using insert many #346

Closed
akashgurava opened this issue Aug 21, 2018 · 2 comments
Closed

Very bad performance using insert many #346

akashgurava opened this issue Aug 21, 2018 · 2 comments

Comments

@akashgurava
Copy link

  • asyncpg version: 0.17.0
  • PostgreSQL version: 10
  • Do you use a PostgreSQL SaaS? If so, which? Can you reproduce
    the issue with a local PostgreSQL install?
    : local
  • Python version: 3.7
  • Platform: Windows
  • Do you use pgbouncer?: No
  • Did you install asyncpg with pip?: Yes
  • If you built asyncpg locally, which version of Cython did you use?: No
  • Can the issue be reproduced under both asyncio and
    uvloop?
    : did not use UVLoop

I am building a DB for storing financial tick data, so was testing insert performance for different DBs using different libs. For postgres I used psycopg2 in sync mode, aiopg, and asyncpg. I was inserting 120000 rows of symbol and OHLC data.
What I was getting very bad insert performances using asyncpg insertmany
For creating sample data

import asyncio
import string
import random
from time import time

import numpy as np
import pandas as pd

import asyncpg
import psycopg2
import aiopg

# Number of securirities to insert
SECURITIES = 2000


def string_gen(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))


def generate_random_data():
    """Generate random data for inserting to DB"""
    index = pd.date_range(start='2018-01-01 00:00:01', end='2018-01-01 00:01:00', freq='s')
    
    dflist = []
    for _ in range(SECURITIES):
        data = np.random.rand(len(index), 4)
        data = pd.DataFrame(data, index=index, columns=['open', 'high', 'low', 'close'])
        data['symbol'] = string_gen()
        dflist.append(data)
    data = pd.concat(dflist)
    data.index.name = 'time'
    data = data.reset_index()
    data = data[['time', 'symbol', 'open', 'high', 'low', 'close']]

    return [tuple(x) for x in data.values]

For psycopg2 insert took 5 to 6 seconds

args_str = b','.join(cur.mogrify(string, row) for row in data)
args_str = args_str.decode('utf-8') # Convert byte string to UTF-8
cur.execute("INSERT INTO ohlc (time, symbol, open, high, low, close) VALUES " + args_str)
conn.commit()

For asyncpg insertmany took 30 seconds so i did something like by creating a insert statement using psycopg2 and insert using asyncpg.. still took 7 to 8 seconds

p_conn = psycopg2.connect(user='postgres', password='postgres')
cur = p_conn.cursor()
string = '(' + ('%s,' * len(data[0]))[:-1] + ')'
args_str = b','.join(cur.mogrify(string, row) for row in data)
cur.close()
args_str = args_str.decode('utf-8')
insert_str = "INSERT INTO ohlc (time, symbol, open, high, low, close) VALUES " + args_str
await conn.execute(insert_str)

similarly for aiopg i was getting 6 to 7 seconds..
I guess i am doing something wrong.. since the DB is same for all 3 libs.. we can ignore performance issues of DB.

@elprans
Copy link
Member

elprans commented Aug 26, 2018

You are comparing apples to oranges. connection.executemany() essentially runs the INSERT query 2000 times, whereas in your psycopg2 test you mogrify arguments and run the query only once. In your case I recommend using copy_records_to_table() instead:

        await conn.copy_records_to_table(
            'ohlc', records=data,
            columns=['time', 'symbol', 'open', 'high', 'low', 'close'])

For me this took 6 seconds versus 9 seconds I got from psycopg2.

elprans added a commit that referenced this issue Aug 26, 2018
Connection.executemany() is not the fastest choice for bulk insert,
copy_records_to_table() is a better choice, so make it easier to find by
putting a note in executemany() documentation.  See #346 for an example
of a performance confusion.
@elprans
Copy link
Member

elprans commented Aug 26, 2018

Also, #295, once merged will improve the performance of executemany() considerably in the general case.

@elprans elprans closed this as completed Aug 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants