Create realistic datasets of 10GB, 100GB, and 1TB in size #20558

evantahler · 2022-12-16T01:33:54Z

Closes #20556

We need big data to test with (specifically, @akashkulk does)!

Idea 1: Sync data using Faker

We made 3 source-faker connections which would produce the desired amount of data

SELECT 
    pg_size_pretty(pg_total_relation_size('"public"."users_10M"')) -- 2909MB
	, pg_size_pretty(pg_total_relation_size('"public"."purchases_10M"')) -- 1024MB
	
-- (2909+2237)/1024 = 5.02GB

So... we are going to need 2 Billion faker users for 1TB
10,000,000*(1024/5.02) = 2,039,840,637
200 Million faker users for 100GB
10,000,000*(100/5.02) = 199,203,187
20 Million faker users for 10GB
10,000,000*(10/5.02) = 19,920,318

While the first small sync completed (20M faker users, 10GB of data), the remaining 2 keep hitting platform errors and failing after a few days. Due to the current instability of the platform, this is unlikely to work. Also of note, each attempt leaves lingering airbyte_tmp tables around, ever increasing the cost of the server's storage...

Idea 2: Generate CSV files

Borrowing from the Performance Research, the next plan is to produce a number of CSV files which will contain the proper data.

To do this, we will:

Modify the faker source to emit data over stdout in CSV format
Pipe that data to a csv file on disk in 10GB chunks. We can manually manipulate faker's /secrets/state.json to move our cursor forward each chunk
Upload that data to Google Cloud Storage
Load that data into postgres via the "import from cloud storage" option

A bonus of this approach is that we will be able to persist these CSV files for repeated use on other databases

Ideas about how to take a stream of JSON and turn it into CSV -> https://stackoverflow.com/questions/32960857/how-to-convert-arbitrary-simple-json-to-csv-using-jq

evantahler · 2022-12-16T16:21:57Z

Fun side effect - Using ALL of the ram for the terminal...

evantahler · 2022-12-16T20:52:29Z

Strategy update:

Generating the first 100GB of data is going well (1/3 of the way done)
Uploading this data will take a few hours, but likely will be done today
Rather than generate totally unique data for the remaining ~900GB, we can use the existing data and duplicate it (and just adjust the primary keys). So user 1 and user 203984065 will be identical save the id. Their purchase history will also match exactly.

evantahler · 2022-12-16T21:32:50Z

SQL notes:

-- assuming the 10M dataset is complete and you want to fill the 20M dataset...

/** PRODUCTS **/
-- the products table is always always an exact copy of our 100 products
INSERT INTO "20m_users"."products" (SELECT * FROM "10m_users"."products");

/** USERS **/
-- copying all data from the 10M users twice, sans primary key, will work because auto-increment will move the `id` column along for us (null=id)
-- note that you need to use id=null for all inserts to the primary key sequnece increments (you can't copy the IDs for the first batch)
INSERT INTO "20m_users"."users" (age, name, email, title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree) (SELECT age, name, email, title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree FROM "10m_users"."users" ORDER BY ID ASC LIMIT 10000000);
-- we modify "email" to keep those columns unique in subsequent runs
INSERT INTO "20m_users"."users" (age, name, email, title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree) (SELECT age, name, CONCAT(split_part(email, '@', 1), '+1@', split_part(email, '@', 2)), title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree FROM "10m_users"."users" ORDER BY ID ASC LIMIT 10000000);

/** PURCHASES **/
-- first, copy over the normal data for puchases
INSERT INTO "20m_users"."purchases" (user_id, product_id, returned_at, purchased_at, added_to_cart_at) (SELECT user_id, product_id, returned_at, purchased_at, added_to_cart_at FROM "10m_users"."purchases");
-- then, do it again with the new ids offset by the duplication count used to insert users above
INSERT INTO "20m_users"."purchases" (user_id, product_id, returned_at, purchased_at, added_to_cart_at) (SELECT (user_id + 10000000), product_id, returned_at, purchased_at, added_to_cart_at FROM "10m_users"."purchases");

Postgres has loops?!

do $$
declare 
	counter INTEGER := 0;
	total INTEGER := (SELECT count(*) from "200m_users"."users");
	goal INTEGER := 200000000;
begin

    -- reset the tables
	TRUNCATE "200m_users"."users" RESTART IDENTITY CASCADE;
	TRUNCATE "200m_users"."purchases" RESTART IDENTITY CASCADE;
	TRUNCATE "200m_users"."products" RESTART IDENTITY CASCADE;
	raise notice 'TRUNCATE complete';

    -- static copy
    INSERT INTO "200m_users"."products" (SELECT * FROM "10m_users"."products");
	raise notice 'Staic copy complete';

    -- dynamic copy
	while total < goal loop
		counter := counter + 1;
		INSERT INTO "200m_users"."users" (age, name, email, title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree) (SELECT age, name, CONCAT(split_part(email, '@', 1), '+', counter, '@', split_part(email, '@', 2)), title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree FROM "10m_users"."users" ORDER BY ID);
		INSERT INTO "200m_users"."purchases" (user_id, product_id, returned_at, purchased_at, added_to_cart_at) (SELECT (user_id + (counter * 10000000)), product_id, returned_at, purchased_at, added_to_cart_at FROM "10m_users"."purchases");
		raise notice 'counter: %', counter;
		total := (SELECT count(*) from "200m_users"."users");
		raise notice 'total users: %', total;
	end loop;

	raise notice 'Done!';
end; $$
;;

evantahler · 2022-12-16T22:06:45Z

Backup of the final table schemas:

CREATE TABLE "1m_users"."users" (
    "id" BIGSERIAL NOT NULL,
    "age" int8,
    "name" text,
    "email" varchar,
    "title" varchar,
    "gender" varchar,
    "height" float8,
    "weight" int4,
    "language" varchar,
    "telephone" varchar,
    "blood_type" varchar,
    "created_at" timestamptz,
    "occupation" varchar,
    "updated_at" timestamptz,
    "nationality" varchar,
    "academic_degree" varchar,
    PRIMARY KEY ("id")
);



CREATE TABLE "1m_users"."purchases" (
    "id" BIGSERIAL NOT NULL,
    "user_id" int8,
    "product_id" int8,
    "returned_at" timestamptz,
    "purchased_at" timestamptz,
    "added_to_cart_at" timestamptz,
    PRIMARY KEY ("id")
);


CREATE TABLE "1m_users"."products" (
    "id" BIGSERIAL NOT NULL,
    "make" text,
    "year" text,
    "model" text,
    "price" float8,
    "created_at" timestamptz,
    PRIMARY KEY ("id")
);

CREATE INDEX "purchases_user_id_fk" ON "1m_users"."purchases" USING BTREE ("user_id");
CREATE INDEX "idx_email" ON "1m_users"."users" USING BTREE ("email");
CREATE INDEX "idx_created_at" ON "1m_users"."users" USING BTREE ("created_at");
CREATE INDEX "idx_updated_at" ON "1m_users"."users" USING BTREE ("updated_at");

evantahler · 2022-12-16T23:36:45Z

Data backed up to GCS airbyte-performance-testing-public/sample-data

* [faker] decouple stream state * add PR # * commit Stream instantiate changes * fixup expected record * skip backward test for this version too * Apply suggestions from code review Co-authored-by: Augustin <augustin@airbyte.io> * lint * Create realistic datasets of 10GB, 100GB, and 1TB in size (#20558) * Faker CSV Streaming utilities * readme * don't do a final pipe to jq or you will run out or ram * doc * Faker gets 250% faster (#20741) * Faker is 250% faster * threads in spec + lint * pass tests * revert changes to record helper * cleanup * update expected_records * bump default records-per-slice to 1k * enforce unique email addresses * cleanup * more comments * `parallelism` and pass tests * update expected records * cleanup notes * update readme * update expected records * auto-bump connector version Co-authored-by: Augustin <augustin@airbyte.io> Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>

Faker CSV Streaming utilities

03d9c15

octavia-squidington-iv added area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/faker labels Dec 16, 2022

readme

62fc3eb

don't do a final pipe to jq or you will run out or ram

333eae0

evantahler changed the title ~~Evan/faker csv stream~~ Create realistic datasets of 10GB, 100GB, and 1TB in size Dec 16, 2022

doc

9865cd3

evantahler changed the base branch from master to evan/faker-hydra December 19, 2022 23:40

evantahler mentioned this pull request Dec 20, 2022

Faker gets 250% faster #20741

Merged

evantahler marked this pull request as ready for review December 22, 2022 22:54

evantahler merged commit f178b14 into evan/faker-hydra Dec 22, 2022

evantahler deleted the evan/faker-csv-stream branch December 22, 2022 22:54

evantahler mentioned this pull request Jan 18, 2023

Create realistic datasets of 10GB, 100GB, and 1TB in size #20556

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create realistic datasets of 10GB, 100GB, and 1TB in size #20558

Create realistic datasets of 10GB, 100GB, and 1TB in size #20558

evantahler commented Dec 16, 2022 •

edited

Loading

evantahler commented Dec 16, 2022

evantahler commented Dec 16, 2022 •

edited

Loading

evantahler commented Dec 16, 2022 •

edited

Loading

evantahler commented Dec 16, 2022 •

edited

Loading

evantahler commented Dec 16, 2022 •

edited

Loading

Create realistic datasets of 10GB, 100GB, and 1TB in size #20558

Create realistic datasets of 10GB, 100GB, and 1TB in size #20558

Conversation

evantahler commented Dec 16, 2022 • edited Loading

Idea 1: Sync data using Faker

Idea 2: Generate CSV files

evantahler commented Dec 16, 2022

evantahler commented Dec 16, 2022 • edited Loading

evantahler commented Dec 16, 2022 • edited Loading

evantahler commented Dec 16, 2022 • edited Loading

evantahler commented Dec 16, 2022 • edited Loading

evantahler commented Dec 16, 2022 •

edited

Loading

evantahler commented Dec 16, 2022 •

edited

Loading

evantahler commented Dec 16, 2022 •

edited

Loading

evantahler commented Dec 16, 2022 •

edited

Loading

evantahler commented Dec 16, 2022 •

edited

Loading