Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create realistic datasets of 10GB, 100GB, and 1TB in size #20558

Merged
merged 4 commits into from
Dec 22, 2022

Conversation

evantahler
Copy link
Contributor

@evantahler evantahler commented Dec 16, 2022

Closes #20556


We need big data to test with (specifically, @akashkulk does)!

Idea 1: Sync data using Faker

Screenshot 2022-12-15 at 3.50.06 PM.png

We made 3 source-faker connections which would produce the desired amount of data

SELECT 
    pg_size_pretty(pg_total_relation_size('"public"."users_10M"')) -- 2909MB
	, pg_size_pretty(pg_total_relation_size('"public"."purchases_10M"')) -- 1024MB
	
-- (2909+2237)/1024 = 5.02GB
So... we are going to need 2 Billion faker users for 1TB
10,000,000*(1024/5.02) = 2,039,840,637
200 Million faker users for 100GB
10,000,000*(100/5.02) = 199,203,187
20 Million faker users for 10GB
10,000,000*(10/5.02) = 19,920,318

While the first small sync completed (20M faker users, 10GB of data), the remaining 2 keep hitting platform errors and failing after a few days. Due to the current instability of the platform, this is unlikely to work. Also of note, each attempt leaves lingering airbyte_tmp tables around, ever increasing the cost of the server's storage...

Idea 2: Generate CSV files

Borrowing from the Performance Research, the next plan is to produce a number of CSV files which will contain the proper data.

To do this, we will:

  1. Modify the faker source to emit data over stdout in CSV format
  2. Pipe that data to a csv file on disk in 10GB chunks. We can manually manipulate faker's /secrets/state.json to move our cursor forward each chunk
  3. Upload that data to Google Cloud Storage
  4. Load that data into postgres via the "import from cloud storage" option

A bonus of this approach is that we will be able to persist these CSV files for repeated use on other databases

Ideas about how to take a stream of JSON and turn it into CSV -> https://stackoverflow.com/questions/32960857/how-to-convert-arbitrary-simple-json-to-csv-using-jq

@octavia-squidington-iv octavia-squidington-iv added area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/faker labels Dec 16, 2022
@evantahler
Copy link
Contributor Author

Fun side effect - Using ALL of the ram for the terminal...
Screenshot 2022-12-16 at 8 19 37 AM

@evantahler
Copy link
Contributor Author

evantahler commented Dec 16, 2022

Strategy update:

  • Generating the first 100GB of data is going well (1/3 of the way done)
  • Uploading this data will take a few hours, but likely will be done today
  • Rather than generate totally unique data for the remaining ~900GB, we can use the existing data and duplicate it (and just adjust the primary keys). So user 1 and user 203984065 will be identical save the id. Their purchase history will also match exactly.

Screenshot 2022-12-16 at 12 50 42 PM

@evantahler
Copy link
Contributor Author

evantahler commented Dec 16, 2022

SQL notes:

-- assuming the 10M dataset is complete and you want to fill the 20M dataset...

/** PRODUCTS **/
-- the products table is always always an exact copy of our 100 products
INSERT INTO "20m_users"."products" (SELECT * FROM "10m_users"."products");

/** USERS **/
-- copying all data from the 10M users twice, sans primary key, will work because auto-increment will move the `id` column along for us (null=id)
-- note that you need to use id=null for all inserts to the primary key sequnece increments (you can't copy the IDs for the first batch)
INSERT INTO "20m_users"."users" (age, name, email, title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree) (SELECT age, name, email, title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree FROM "10m_users"."users" ORDER BY ID ASC LIMIT 10000000);
-- we modify "email" to keep those columns unique in subsequent runs
INSERT INTO "20m_users"."users" (age, name, email, title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree) (SELECT age, name, CONCAT(split_part(email, '@', 1), '+1@', split_part(email, '@', 2)), title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree FROM "10m_users"."users" ORDER BY ID ASC LIMIT 10000000);

/** PURCHASES **/
-- first, copy over the normal data for puchases
INSERT INTO "20m_users"."purchases" (user_id, product_id, returned_at, purchased_at, added_to_cart_at) (SELECT user_id, product_id, returned_at, purchased_at, added_to_cart_at FROM "10m_users"."purchases");
-- then, do it again with the new ids offset by the duplication count used to insert users above
INSERT INTO "20m_users"."purchases" (user_id, product_id, returned_at, purchased_at, added_to_cart_at) (SELECT (user_id + 10000000), product_id, returned_at, purchased_at, added_to_cart_at FROM "10m_users"."purchases");

Postgres has loops?!

do $$
declare 
	counter INTEGER := 0;
	total INTEGER := (SELECT count(*) from "200m_users"."users");
	goal INTEGER := 200000000;
begin

    -- reset the tables
	TRUNCATE "200m_users"."users" RESTART IDENTITY CASCADE;
	TRUNCATE "200m_users"."purchases" RESTART IDENTITY CASCADE;
	TRUNCATE "200m_users"."products" RESTART IDENTITY CASCADE;
	raise notice 'TRUNCATE complete';

    -- static copy
    INSERT INTO "200m_users"."products" (SELECT * FROM "10m_users"."products");
	raise notice 'Staic copy complete';

    -- dynamic copy
	while total < goal loop
		counter := counter + 1;
		INSERT INTO "200m_users"."users" (age, name, email, title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree) (SELECT age, name, CONCAT(split_part(email, '@', 1), '+', counter, '@', split_part(email, '@', 2)), title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree FROM "10m_users"."users" ORDER BY ID);
		INSERT INTO "200m_users"."purchases" (user_id, product_id, returned_at, purchased_at, added_to_cart_at) (SELECT (user_id + (counter * 10000000)), product_id, returned_at, purchased_at, added_to_cart_at FROM "10m_users"."purchases");
		raise notice 'counter: %', counter;
		total := (SELECT count(*) from "200m_users"."users");
		raise notice 'total users: %', total;
	end loop;

	raise notice 'Done!';
end; $$
;;

@evantahler evantahler changed the title Evan/faker csv stream Create realistic datasets of 10GB, 100GB, and 1TB in size Dec 16, 2022
@evantahler
Copy link
Contributor Author

evantahler commented Dec 16, 2022

Backup of the final table schemas:

CREATE TABLE "1m_users"."users" (
    "id" BIGSERIAL NOT NULL,
    "age" int8,
    "name" text,
    "email" varchar,
    "title" varchar,
    "gender" varchar,
    "height" float8,
    "weight" int4,
    "language" varchar,
    "telephone" varchar,
    "blood_type" varchar,
    "created_at" timestamptz,
    "occupation" varchar,
    "updated_at" timestamptz,
    "nationality" varchar,
    "academic_degree" varchar,
    PRIMARY KEY ("id")
);



CREATE TABLE "1m_users"."purchases" (
    "id" BIGSERIAL NOT NULL,
    "user_id" int8,
    "product_id" int8,
    "returned_at" timestamptz,
    "purchased_at" timestamptz,
    "added_to_cart_at" timestamptz,
    PRIMARY KEY ("id")
);


CREATE TABLE "1m_users"."products" (
    "id" BIGSERIAL NOT NULL,
    "make" text,
    "year" text,
    "model" text,
    "price" float8,
    "created_at" timestamptz,
    PRIMARY KEY ("id")
);

CREATE INDEX "purchases_user_id_fk" ON "1m_users"."purchases" USING BTREE ("user_id");
CREATE INDEX "idx_email" ON "1m_users"."users" USING BTREE ("email");
CREATE INDEX "idx_created_at" ON "1m_users"."users" USING BTREE ("created_at");
CREATE INDEX "idx_updated_at" ON "1m_users"."users" USING BTREE ("updated_at");

@evantahler
Copy link
Contributor Author

evantahler commented Dec 16, 2022

Data backed up to GCS airbyte-performance-testing-public/sample-data

@evantahler evantahler changed the base branch from master to evan/faker-hydra December 19, 2022 23:40
@evantahler evantahler marked this pull request as ready for review December 22, 2022 22:54
@evantahler evantahler merged commit f178b14 into evan/faker-hydra Dec 22, 2022
@evantahler evantahler deleted the evan/faker-csv-stream branch December 22, 2022 22:54
octavia-approvington pushed a commit that referenced this pull request Jan 3, 2023
* [faker] decouple stream state

* add PR #

* commit Stream instantiate changes

* fixup expected record

* skip backward test for this version too

* Apply suggestions from code review

Co-authored-by: Augustin <augustin@airbyte.io>

* lint

* Create realistic datasets of 10GB, 100GB, and 1TB in size (#20558)

* Faker CSV Streaming utilities

* readme

* don't do a final pipe to jq or you will run out or ram

* doc

* Faker gets 250% faster (#20741)

* Faker is 250% faster

* threads in spec + lint

* pass tests

* revert changes to record helper

* cleanup

* update expected_records

* bump default records-per-slice to 1k

* enforce unique email addresses

* cleanup

* more comments

* `parallelism` and pass tests

* update expected records

* cleanup notes

* update readme

* update expected records

* auto-bump connector version

Co-authored-by: Augustin <augustin@airbyte.io>
Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/faker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create realistic datasets of 10GB, 100GB, and 1TB in size
2 participants