Performance seems way below expectations #251

Themanwithoutaplan · 2015-06-26T18:01:01Z

I have a relatively simple import from CSV to Postgres that seems to be running quite slow. The import is of about 500000 rows with around 70 columns, mainly integers and four indices. The import currently takes around 5 hours. My hard disk will happily do over 10 MB/s. Running pgbench, at which I'm not an expert, suggests TPS between 70 and 200. Even with the lower number I'd expect the important to take about half-an-hour.

When running the import it does, indeed seem as a lot of time is spent in pgloader rather than in Postgres.

What should I be looking at to improve performance?

dimitri · 2015-06-26T20:31:20Z

How many rows where rejected during the load? if none, then try using COPY directly and report its timing for comparison purpose. If more than zero, then try to COPY out the loaded data in a clean CSV file then COPY IN again from that clean file, and report the timing.

I'm interested into making pgloader as fast as possible of course, but your case will need quite some more information before anything useful can be attempted...

Themanwithoutaplan · 2015-06-27T11:49:38Z

No rows were rejected. Running copy seems to give similar results with disk speed rarely getting above 100 kB/s so I guess the problem is related to the nature of the data and/or server configuration. Any idea what I should be looking at?

Themanwithoutaplan · 2015-06-27T13:26:07Z

From further experimentation it looks like it the indices that are throttling performance. Dropping them and the load is done in less than a minute.

Themanwithoutaplan · 2015-06-27T13:53:40Z

Looks like a text index is the real bottleneck here. My other indices and single trigger hardly seem to matter.

For the docs it might be worth noting that if performance varies widely from what pg_bench suggests then indices could be a bottleneck.

table name	read	imported	time
fetch	0	0	0.013
before load	1	1	0.045s
pages	482966	482966	1m37.037s
after load	1	1	2m16.884s
Total import time	482966	482966	3m53.979s

dimitri · 2015-06-27T15:51:06Z

Oh, yeah, never bulk load data with indexes present, remove them before loading, add them again at the end, which is what pgloader does for the database-like sources and when targeting an empty table.

I should maybe add an option like disable triggers that would drop then create the indexes again at the end of the run.

Themanwithoutaplan · 2015-06-27T16:07:06Z

I'm not sure if all triggers should be disabled – I happen to use one for this particular case to convert stupid dates as strings to real dates – and the real hit is only around the text index which I need to help normalise a stupidly denormalised source in the after-load clause.

But consistent performance across modes would definitely make sense.

dimitri · 2015-06-27T16:45:27Z

You could normalize your input right within pgloader, several examples of date mungling are given already as transformation functions. See #245 (comment) for a full example of that.

So maybe just a warning about indexes and triggers being present on the target table with potential impact on loading performances would be in order...

Themanwithoutaplan · 2015-06-27T17:03:12Z

Good to know that this can be done in the loading script. The normalisation, however, can't. It would involve upserting some data from the source, getting the relevant foreign key and substituting it… Going from 5 hours to a minute + a few minutes to recreate the index is the big win. I've tried, and failed, to get the source normalised.

Pre-existing indexes will reduce data loading performances and it's generally better to DROP the index prior to the load and CREATE them again once the load is done. See #251 for an example of that. In that patch we just add a WARNING against the situation, the next patch will also add support for a new WITH clause option allowing to have pgloader take care of the DROP/CREATE dance around the data loading.

dimitri · 2015-07-16T10:22:54Z

You may try the drop indexes option to have the dance being operated by pgloader all automatically now.

Themanwithoutaplan · 2015-07-17T11:15:30Z

Thanks, but it looks like it may need some work:
2015-07-17T12:18:20.398000+02:00 ERROR Database error 2BP01: cannot drop index pages_date_url because constraint pages_date_url on table pages requires it HINT: You can drop constraint pages_date_url on table pages instead. QUERY: DROP INDEX pages_date_url; 275MiB 0:22:51 [ 206kiB/s] [ <=> ] 2015-07-17T13:02:06.615000+02:00 ERROR Database error 42P07: relation "pages_date_url" already exists

dimitri · 2015-07-17T12:58:57Z

Thanks for the feedback, I only cared that way about primary key indexes, I didn't do the general constraint case (UNIQUE, EXCLUDE). Will add that as soon as possible, sorry about that.

dimitri · 2015-07-17T15:07:31Z

Should be good now.

Themanwithoutaplan · 2015-07-17T16:11:10Z

Seems to be working much better, thank you. If only building on MacOS without Brew was easier! Even just support for make install would be great.

dimitri · 2015-07-17T16:17:44Z

If you're in a position to tell me what's missing and how to make it simpler, please open an issue about that and we'll see if we can improve the situation here.

Themanwithoutaplan · 2015-07-17T16:46:30Z

Reopening #161 would probably be best.

dimitri · 2015-07-18T11:00:47Z

So you're saying it's a problem with finding the shared objects (.so) files?

Themanwithoutaplan · 2015-07-18T11:51:05Z

Are the indices being created twice? FWIW manual index creation varies significantly between the two indices here. The index of the date field is quick to recreate and also doesn't impose much of an overhead if kept when importing. The big bottleneck here is the text index.

table name	read	imported	time
fetch	0	0	0.012s
drop indexes	1	1	0.073s
pages	483669	483669	1m50.449s
Index Build Completion	0	0	17m8.347s
Create Indexes	1	1	17m7.864s
Primary Keys	0	0	0.000s
after load	1	1	1m23.428s
Total import time	483669	483669	37m30.173s

dimitri · 2015-07-18T11:55:10Z

Seems like I've been a tad too lazy and the Index Building time isn't properly accounted for as a parallel background task. It's doubly counted but should appear only once, will fix later.

Themanwithoutaplan · 2015-07-18T11:57:18Z

I don't think you can be called lazy! It's just not really your itch to scratch. Your work on this is much appreciated. Being able to work with this data in Postgres is so much nicer than MySQL.

The new option 'drop indexes' reuses the existing code to build all the indexes in parallel but failed to properly account for that fact in the summary report with timings. While fixing this, also fix the SQL used to re-establish the indexes and associated constraints to allow for parallel execution, the ALTER TABLE statements would block in ACCESS EXCLUSIVE MODE otherwise and make our efforts vain.

dimitri · 2015-07-18T21:43:20Z

Looks better now, was kind of a worm hole really, because respecting the pg_dump way of doing things was too naive to allow for the kind of parallelism that was expected. Add some MySQL compatibility issues and the quick hack now takes a couple hours. Ah well, it should be all ok now!

Thanks for your continued reports, that helps make a better software.

Themanwithoutaplan · 2015-07-20T07:30:44Z

I'm still seeing similar times for Index Build Completion and Create Indexes. Is this to be expected?

dimitri · 2015-07-20T08:06:38Z

Well yes. pgloader starts as many CREATE INDEX process as you have indexes to build against a single table in parallel, and then waits for all the thread to be done. The Create Indexes section counts the time it took to create the indexes in total while the Build Completion section counts how much time we had to still wait when all the other things to do were already finished.

This double accounting of sorts is more relevant in the loading from a database scenario where often enough we don't have to actually wait much for the indexes, because most of them have been already created in parallel during the other tables loading.

Maybe I should review using the same time categories in the report for single-table loading here, it looks like I shared too much code...

Themanwithoutaplan · 2015-07-25T10:28:23Z

Looks like there are still some gremlins in this. Running again with a new dataset and it looks like the script is confused by constraints which it created last time.

2015-07-24T17:04:53.874000+02:00 ERROR Database error 42704: constraint "pages_urlshort_labeldate_key1" of relation "pages" does not exist
QUERY: ALTER TABLE pages DROP CONSTRAINT pages_urlShort_labelDate_key1;
2015-07-24T17:04:53.874000+02:00 ERROR Database error 42704: constraint "pages_urlshort_labeldate_key" of relation "pages" does not exist
QUERY: ALTER TABLE pages DROP CONSTRAINT pages_urlShort_labelDate_key;
 276MiB 1:27:54 [53,7kiB/s] [          <=>                                                                                           ]

table name	read	imported	errors	time
fetch	0	0		0
drop indexes	2	0	2	0.176s
pages	483551	483551	0	3h19m19.264s
Index Build Completion	0	0	0	39m38.657s
Create Indexes	2	2	0	56m38.589s
Primary Keys	0	0	0	0.000s
after load	1	1	0	6m54.066s
Total import time	483551	483551	0	5h2m30.766s

dimitri · 2015-07-25T12:33:57Z

Can you please run the following query, it might be that some indexes have been only partly deleted and that we should then not worry about them here...

select indrelid::regclass, indisvalid, indcheckxmin, indisready, indislive,
       pg_get_indexdef(indexrelid)
  from pg_index
 where indrelid = 'pages'::regclass;

The other situation where I would expect your error messages is a concurrency issue where two pgloader are working in parallel against the table, thus one of those just deleted the indexes and constraint in the time between the other process having listed the indexes and wanting to delete them...

Is a concurrency issue possible in your use case?

Themanwithoutaplan · 2015-07-25T14:49:05Z

indrelid	indisvalid	indcheckxmin	indisready	indislive	pg_get_indexdef
pages	t	f	t	t	CREATE UNIQUE INDEX "pages_urlShort_labelDate_key" ON pages USING btree ("urlShort", "labelDate")
pages	t	f	t	t	CREATE UNIQUE INDEX "pages_urlShort_labelDate_key1" ON pages USING btree ("urlShort", "labelDate")
pages	t	f	t	t	CREATE UNIQUE INDEX "pages_urlShort_labelDate_key2" ON pages USING btree ("urlShort", "labelDate")
pages	t	f	t	t	CREATE UNIQUE INDEX "pages_urlShort_labelDate_key3" ON pages USING btree ("urlShort", "labelDate")

Not sure what you mean by not worrying about them. Because they're not being managed properly the load time goes up from a minute to over 3 hours and the indexes then take longer to recreate.

dimitri · 2015-07-25T15:42:00Z

Here's the query that pgloader uses to list constraints and indexes that need to be handled, can you run it for me and paste its output here?

select i.relname,
       indrelid::regclass,
       indrelid,
       indisprimary,
       indisunique,
       pg_get_indexdef(indexrelid),
       c.conname,
       pg_get_constraintdef(c.oid)
  from pg_index x
       join pg_class i ON i.oid = x.indexrelid
       left join pg_constraint c ON c.conindid = i.oid
 where indrelid = 'pages'::regclass

I though before that maybe the constraint definition that pgloader wanted to take care of where actually invalid or stray definitions, hence the errors, but it seems not to be that.

Are you running several pgloader commands at once?

Themanwithoutaplan · 2015-07-25T15:50:38Z

No, only running a single command. Here's the result.

relname	indrelid	indrelid	indisprimary	indisunique	pg_get_indexdef	conname	pg_get_constraintdef
pages_urlShort_labelDate_key1	pages	16945	f	t	CREATE UNIQUE INDEX "pages_urlShort_labelDate_key1" ON pages USING btree ("urlShort", "labelDate")	pages_urlShort_labelDate_key1	UNIQUE ("urlShort", "labelDate")
pages_urlShort_labelDate_key	pages	16945	f	t	CREATE UNIQUE INDEX "pages_urlShort_labelDate_key" ON pages USING btree ("urlShort", "labelDate")	pages_urlShort_labelDate_key	UNIQUE ("urlShort", "labelDate")
pages_urlShort_labelDate_key2	pages	16945	f	t	CREATE UNIQUE INDEX "pages_urlShort_labelDate_key2" ON pages USING btree ("urlShort", "labelDate")	pages_urlShort_labelDate_key2	UNIQUE ("urlShort", "labelDate")
pages_urlShort_labelDate_key3	pages	16945	f	t	CREATE UNIQUE INDEX "pages_urlShort_labelDate_key3" ON pages USING btree ("urlShort", "labelDate")	pages_urlShort_labelDate_key3	UNIQUE ("urlShort", "labelDate")

dimitri · 2015-07-25T15:55:18Z

I still don't understand the error messages on the constraint that doesn't exists, because the constraint is listed here. Now, why do you have 4 times the same index? 2 load attempts with the same error on DROP I presume?

Themanwithoutaplan · 2015-07-25T15:58:26Z

pgloader is doing all the work so presumably it's getting something wrong when it tries to drop them and therefore goes on to create duplicates.

dimitri · 2015-07-25T19:01:52Z

Can you give me a reproducible test-case so that I can then fix this bug? An example is https://github.com/dimitri/pgloader/blob/master/test/csv-districts.load which still needs a data file, or if you can prepare one all-included take https://github.com/dimitri/pgloader/blob/master/test/csv-before-after.load as a base example.

Themanwithoutaplan · 2015-07-26T10:22:18Z

You can use the import script at https://bitbucket.org/charlie_x/python-httparchive/src/7f7d8a3cae1652a789096d3432e2eacbba65e05e/db/httparchive.load?at=default

The relevant Postgres schema is at https://bitbucket.org/charlie_x/python-httparchive/src/7f7d8a3cae1652a789096d3432e2eacbba65e05e/db/pages.sql?at=default

Data can be imported from http://httparchive.org/downloads.php (any CSV dump for pages after 2014-06-01).

Import gzip -d -c httparchive_Jul_15_2015_pages.csv.gz | pgloader db/httparchive.load

dimitri · 2015-07-26T13:41:21Z

Thanks for the complete use case, I could fix the issue at hand, namely pgloader trying to second guess the spelling of the constraint and indexes (down casing and normalizing them as if they came from a MySQL or SQLite database). The latest patch stops this madness by having the drop indexes related code path force quoting the names just as we got them.

Should be good now, in as much as I could reproduce your problem then fix it!

Themanwithoutaplan · 2015-07-26T13:59:20Z

Second guessing is almost always the wrong way to go but sometimes there's no choice. FWIW you might appreciate some historical context behind this import: I tried and failed to get the original MySQL improved. It would have obviated the need to work around this bottleneck index: https://code.google.com/p/httparchive/issues/detail?id=65

Any progress on my other ticket about building on OS X with MacPorts?

dimitri · 2015-07-26T20:58:57Z

Thanks for the interesting context! It's also nice to see those .load files in another Open Source project ;-)

About #261 let's say that all this shared object dependency hell is some over my head. I also have #159 and #160 on my plate, more generally see Build System for a listing.

I need to find a proper way to make pgloader easier to install for everyone. What normally happens is that packagers show up and do the work for each distro, like I did for debian. It's yet to happen for other OSes apparently.

Themanwithoutaplan · 2015-07-27T12:30:10Z

Well, in case I didn't make it clear enough: I failed in my attempt to get the schema cleaned up so I forked the site from PHP to Pyramid. Then I kept hitting MySQL's limitations so started to port to Postgres for my own reporting. I haven't got all the way to properly cloning the crawler and stats part…

What is interesting, however, is how fast the MySQL import is, even with indexes on. Of course, this is done at the cost of a table lock and schema changes are very painful: table has to be dumped, altered and imported. This often leads to the disk running full. You can see how this encourages the persistence of bad design decisions: schema changes are expensive; you won't be punished for not normalising the data.

Wish I could be more help with the build instructions but I'm afraid it's something I've got little experience with myself.

dimitri added the WishList label Jun 26, 2015

dimitri closed this as completed in 49bf7e5 Jul 16, 2015

dimitri reopened this Jul 17, 2015

dimitri closed this as completed in 8511294 Jul 17, 2015

dimitri reopened this Jul 25, 2015

dimitri closed this as completed in 5e7e539 Jul 26, 2015

hdanet mentioned this issue Sep 5, 2019

Speed is as slow as 1.5Mps Using pgloader (docker) under OSX Mojova #1019

Closed

Performance seems way below expectations #251

Performance seems way below expectations #251

Comments

Themanwithoutaplan commented Jun 26, 2015

dimitri commented Jun 26, 2015

Themanwithoutaplan commented Jun 27, 2015

Themanwithoutaplan commented Jun 27, 2015

Themanwithoutaplan commented Jun 27, 2015

dimitri commented Jun 27, 2015

Themanwithoutaplan commented Jun 27, 2015

dimitri commented Jun 27, 2015

Themanwithoutaplan commented Jun 27, 2015

dimitri commented Jul 16, 2015

Themanwithoutaplan commented Jul 17, 2015

dimitri commented Jul 17, 2015

dimitri commented Jul 17, 2015

Themanwithoutaplan commented Jul 17, 2015

dimitri commented Jul 17, 2015

Themanwithoutaplan commented Jul 17, 2015

dimitri commented Jul 18, 2015

Themanwithoutaplan commented Jul 18, 2015

dimitri commented Jul 18, 2015

Themanwithoutaplan commented Jul 18, 2015

dimitri commented Jul 18, 2015

Themanwithoutaplan commented Jul 20, 2015

dimitri commented Jul 20, 2015

Themanwithoutaplan commented Jul 25, 2015

dimitri commented Jul 25, 2015

Themanwithoutaplan commented Jul 25, 2015

dimitri commented Jul 25, 2015

Themanwithoutaplan commented Jul 25, 2015

dimitri commented Jul 25, 2015

Themanwithoutaplan commented Jul 25, 2015

dimitri commented Jul 25, 2015

Themanwithoutaplan commented Jul 26, 2015

dimitri commented Jul 26, 2015

Themanwithoutaplan commented Jul 26, 2015

dimitri commented Jul 26, 2015

Themanwithoutaplan commented Jul 27, 2015