-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create realistic datasets of 10GB, 100GB, and 1TB in size #20558
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
octavia-squidington-iv
added
area/connectors
Connector related issues
area/documentation
Improvements or additions to documentation
connectors/source/faker
labels
Dec 16, 2022
Strategy update:
|
SQL notes: -- assuming the 10M dataset is complete and you want to fill the 20M dataset...
/** PRODUCTS **/
-- the products table is always always an exact copy of our 100 products
INSERT INTO "20m_users"."products" (SELECT * FROM "10m_users"."products");
/** USERS **/
-- copying all data from the 10M users twice, sans primary key, will work because auto-increment will move the `id` column along for us (null=id)
-- note that you need to use id=null for all inserts to the primary key sequnece increments (you can't copy the IDs for the first batch)
INSERT INTO "20m_users"."users" (age, name, email, title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree) (SELECT age, name, email, title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree FROM "10m_users"."users" ORDER BY ID ASC LIMIT 10000000);
-- we modify "email" to keep those columns unique in subsequent runs
INSERT INTO "20m_users"."users" (age, name, email, title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree) (SELECT age, name, CONCAT(split_part(email, '@', 1), '+1@', split_part(email, '@', 2)), title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree FROM "10m_users"."users" ORDER BY ID ASC LIMIT 10000000);
/** PURCHASES **/
-- first, copy over the normal data for puchases
INSERT INTO "20m_users"."purchases" (user_id, product_id, returned_at, purchased_at, added_to_cart_at) (SELECT user_id, product_id, returned_at, purchased_at, added_to_cart_at FROM "10m_users"."purchases");
-- then, do it again with the new ids offset by the duplication count used to insert users above
INSERT INTO "20m_users"."purchases" (user_id, product_id, returned_at, purchased_at, added_to_cart_at) (SELECT (user_id + 10000000), product_id, returned_at, purchased_at, added_to_cart_at FROM "10m_users"."purchases"); Postgres has loops?! do $$
declare
counter INTEGER := 0;
total INTEGER := (SELECT count(*) from "200m_users"."users");
goal INTEGER := 200000000;
begin
-- reset the tables
TRUNCATE "200m_users"."users" RESTART IDENTITY CASCADE;
TRUNCATE "200m_users"."purchases" RESTART IDENTITY CASCADE;
TRUNCATE "200m_users"."products" RESTART IDENTITY CASCADE;
raise notice 'TRUNCATE complete';
-- static copy
INSERT INTO "200m_users"."products" (SELECT * FROM "10m_users"."products");
raise notice 'Staic copy complete';
-- dynamic copy
while total < goal loop
counter := counter + 1;
INSERT INTO "200m_users"."users" (age, name, email, title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree) (SELECT age, name, CONCAT(split_part(email, '@', 1), '+', counter, '@', split_part(email, '@', 2)), title, gender, height, weight, language, telephone, blood_type, created_at, occupation, updated_at, nationality ,academic_degree FROM "10m_users"."users" ORDER BY ID);
INSERT INTO "200m_users"."purchases" (user_id, product_id, returned_at, purchased_at, added_to_cart_at) (SELECT (user_id + (counter * 10000000)), product_id, returned_at, purchased_at, added_to_cart_at FROM "10m_users"."purchases");
raise notice 'counter: %', counter;
total := (SELECT count(*) from "200m_users"."users");
raise notice 'total users: %', total;
end loop;
raise notice 'Done!';
end; $$
;; |
evantahler
changed the title
Evan/faker csv stream
Create realistic datasets of 10GB, 100GB, and 1TB in size
Dec 16, 2022
Backup of the final table schemas: CREATE TABLE "1m_users"."users" (
"id" BIGSERIAL NOT NULL,
"age" int8,
"name" text,
"email" varchar,
"title" varchar,
"gender" varchar,
"height" float8,
"weight" int4,
"language" varchar,
"telephone" varchar,
"blood_type" varchar,
"created_at" timestamptz,
"occupation" varchar,
"updated_at" timestamptz,
"nationality" varchar,
"academic_degree" varchar,
PRIMARY KEY ("id")
);
CREATE TABLE "1m_users"."purchases" (
"id" BIGSERIAL NOT NULL,
"user_id" int8,
"product_id" int8,
"returned_at" timestamptz,
"purchased_at" timestamptz,
"added_to_cart_at" timestamptz,
PRIMARY KEY ("id")
);
CREATE TABLE "1m_users"."products" (
"id" BIGSERIAL NOT NULL,
"make" text,
"year" text,
"model" text,
"price" float8,
"created_at" timestamptz,
PRIMARY KEY ("id")
);
CREATE INDEX "purchases_user_id_fk" ON "1m_users"."purchases" USING BTREE ("user_id");
CREATE INDEX "idx_email" ON "1m_users"."users" USING BTREE ("email");
CREATE INDEX "idx_created_at" ON "1m_users"."users" USING BTREE ("created_at");
CREATE INDEX "idx_updated_at" ON "1m_users"."users" USING BTREE ("updated_at"); |
Data backed up to GCS |
octavia-approvington
pushed a commit
that referenced
this pull request
Jan 3, 2023
* [faker] decouple stream state * add PR # * commit Stream instantiate changes * fixup expected record * skip backward test for this version too * Apply suggestions from code review Co-authored-by: Augustin <augustin@airbyte.io> * lint * Create realistic datasets of 10GB, 100GB, and 1TB in size (#20558) * Faker CSV Streaming utilities * readme * don't do a final pipe to jq or you will run out or ram * doc * Faker gets 250% faster (#20741) * Faker is 250% faster * threads in spec + lint * pass tests * revert changes to record helper * cleanup * update expected_records * bump default records-per-slice to 1k * enforce unique email addresses * cleanup * more comments * `parallelism` and pass tests * update expected records * cleanup notes * update readme * update expected records * auto-bump connector version Co-authored-by: Augustin <augustin@airbyte.io> Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/connectors
Connector related issues
area/documentation
Improvements or additions to documentation
connectors/source/faker
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #20556
We need big data to test with (specifically, @akashkulk does)!
Idea 1: Sync data using Faker
We made 3
source-faker
connections which would produce the desired amount of dataWhile the first small sync completed (20M faker users, 10GB of data), the remaining 2 keep hitting platform errors and failing after a few days. Due to the current instability of the platform, this is unlikely to work. Also of note, each attempt leaves lingering
airbyte_tmp
tables around, ever increasing the cost of the server's storage...Idea 2: Generate CSV files
Borrowing from the Performance Research, the next plan is to produce a number of CSV files which will contain the proper data.
To do this, we will:
/secrets/state.json
to move our cursor forward each chunkA bonus of this approach is that we will be able to persist these CSV files for repeated use on other databases
Ideas about how to take a stream of JSON and turn it into CSV -> https://stackoverflow.com/questions/32960857/how-to-convert-arbitrary-simple-json-to-csv-using-jq