Skip to content

Latest commit

 

History

History
518 lines (384 loc) · 20 KB

ProblemStatement.md

File metadata and controls

518 lines (384 loc) · 20 KB

Mongo-oplog-to-sql

Problem Statement

Write a program which can parse MongoDB operations log (Oplog) and generate equivalent SQL statements.

We have a scenario where an organisation was using MongoDB at the start but now needs to move to a RDBMS database. This data transition can be made easy if we can find a way to convert the JSON documents in MongoDB collections to equivalent rows in relational DB tables. That's the purpose of this program.

The MongoDB server generates the Oplog which is an ordered collection of all the write operations (insert, update, delete) to the MongoDB. Your job is to parse these oplogs and generate equivalent SQL statements.

There’s already an open source tool, stampede, that converts MongoDB oplogs to sql; we are simply attempting to develop an implementation in Go.

A sample MongoDB oplog looks like:

{
  "op" : "i",
  "ns" : "test.student",
  "o" : {
    "_id" : "635b79e231d82a8ab1de863b",
    "name" : "Selena Miller",
    "roll_no" : 51,
    "is_graduated" : false,
    "date_of_birth" : "2000-01-30"
  }
}

The main fields in the oplog are:

  • op: This indicates the type of operation. It can be i (insert), u (update), d (delete), c (command), n (no operation). For this implementation, we’ll only care about insert, update and delete operations.
  • ns: This indicates the namespace. Namespace consists of database and collection name separated by a . In above case, database name is test and collection name is student.
  • o: This indicates the new data for insert or update operation. In above case, a student document is inserted in the collection.

The oplog contains some other fields like version, timestamp, etc. but for our consideration, we can ignore those.

We have divided the problem statement into multiple stories. You’re supposed to implement the stories.

Note: All these stories are dependent on each other. i.e. stories (and their test cases) need to be executed sequentially for it to work.

Story 1

Parse the insert oplog JSON and convert that into equivalent SQL insert statement. When inserting a record, check if the table exists. If not, create it first.

Here’s the mapping of MongoDB concepts to their equivalent relational database concepts

  • Database in MongoDB maps to schema in relational database
  • Collection in MongoDB maps to table in relational database
  • A single JSON document in MongoDB maps typically to a row in relational database.

Sample Input:

{
  "op" : "i",
  "ns" : "test.student",
  "o" : {
    "_id" : "635b79e231d82a8ab1de863b",
    "name" : "Selena Miller",
    "roll_no" : 51,
    "is_graduated" : false,
    "date_of_birth" : "2000-01-30"
  }
}

Expected Output:

INSERT INTO test.student 
   (_id, date_of_birth, is_graduated, name, roll_no) 
VALUES 
   ('635b79e231d82a8ab1de863b', '2000-01-30', false, 'Selena Miller', 51);

You can assume that the above oplog is generated by following MongoDB command that inserts some data in student table.

use test;

db.student.insertOne(
	{
	name: "Selena Miller", 
	roll_no:51, 
	is_graduated:false, 
	date_of_birth: "2000-01-30"
	}
);

Assumptions:

  • The op key indicates the operation. i stands for insert operation.
  • The oplog contains some other fields like version, timestamp, etc. but for our consideration, we can ignore those. Hence, the above oplog contains only the fields which are relevant for our SQL conversion use case.
  • The ns key indicates a combination of db name and collection name. In above example, db name is test and collection name is student. Db name and collection name will be separated by a .
  • For simplicity, for now, assume that there are no nested objects in the Mongo collection.

Expectations:

  • Your code should be generic enough to extract the db and collection name from ns field. It should also extract following types from JSON - string, boolean and number. In above example, name is a string variable, roll_no is a number variable and is_graduated is boolean. For now, you can treat date_of_birth as string (and not a date field type)

Story 2

Parse the update oplog JSON and convert that into equivalent SQL update statement.

Sample Input for Setting:

{
   "op":"u",
   "ns":"test.student",
   "o":{
      "$v":2,
      "diff":{
         "u":{
            "is_graduated":true
         }
      }
   },
   "o2": {
      "_id":"635b79e231d82a8ab1de863b"
   }
}

Expected Output:

UPDATE test.student SET is_graduated = true WHERE _id = '635b79e231d82a8ab1de863b';

Sample Input for Un-setting:

{
   "op":"u",
   "ns":"test.student",
   "o":{
      "$v":2,
      "diff":{
         "d":{
            "roll_no":false
         }
      }
   },
   "o2":{
      "_id":"635b79e231d82a8ab1de863b"
   }
}

Expected Output:

UPDATE test.student SET roll_no = NULL WHERE _id = '635b79e231d82a8ab1de863b';

Assumptions:

  • The u in op key stands for update operation.
  • The o field contains the update operation details. In this case, it includes the following subfields:
    • The $v field specifies the protocol version used for the update operation. In this case, the value is 2.
    • The diff field represents the changes being made to the document. In this case, it contains the following subfield:
      • u: modifies a field of a document and sets the value of a field in the document.
      • d: removes a field from a document.
  • The key o2 represents the row identifier or the WHERE clause field in SQL
  • For simplicity, assume that the _id would always be the updation criteria
  • Assume that no new column will be added via this update operation for now.
  • Assume that the table and the data exist from before (which can be done manually).

Story 3

Parse the delete oplog JSON and convert that into equivalent SQL delete statement.

Sample Input:

{
  "op" : "d",
  "ns" : "test.student",
  "o" : {
    "_id" : "635b79e231d82a8ab1de863b"
  }
}

Expected Output:

DELETE FROM test.student WHERE _id = '635b79e231d82a8ab1de863b';

Assumptions:

  • The d in op key stands for delete.
  • The o key contains the _id of the field to be deleted
  • For simplicity, assume that the _id would always be the deletion criteria
  • Assume that the table exists from before (which can be done manually).

Story 4 (create table with one oplog entry)

This story is the modification of Story 1. In this story, you’ll parse the same insert oplog JSON from Story 1 and convert it to equivalent SQL statements. However, you’ll also generate the create schema and create table statements along with insert into statement.

Features to implement in this story:

  • Generate CREATE SCHEMA SQL statement
  • Generate CREATE TABLE SQL statement
  • Generate INSERT INTO SQL statement

Sample Input:

{
  "op": "i",
  "ns": "test.student",
  "o": {
    "_id": "635b79e231d82a8ab1de863b",
    "name": "Selena Miller",
    "roll_no": 51,
    "is_graduated": false,
    "date_of_birth": "2000-01-30"
  }
}

Expected output:

CREATE SCHEMA test;

CREATE TABLE test.student
  (
     _id           VARCHAR(255) PRIMARY KEY,
     date_of_birth VARCHAR(255),
     is_graduated  BOOLEAN,
     name          VARCHAR(255),
     roll_no       INTEGER
  );

INSERT INTO test.student (_id, name, roll_no, is_graduated, date_of_birth) VALUES ('635b79e231d82a8ab1de863b', 'Selena Miller', 51, false, '2000-01-30');

Assumptions:

  • In the above output, the create table statement is split into multiple lines. This is done only for readability purpose. You should generate the create table statement in a single line.
  • Feel free to modify the code and tests written as part of Story 1. The input to Story 1 and 4 is same, but in the output, we now expect create schema and create table statements as well.

Expectations:

  • You should be able to run all the SQL statement generated by your program into PostgreSQL without any errors.

Story 5 (create table with multiple oplog entries)

Until now, we were handling only one oplog at a time. However, now we need to handle multiple insert oplogs for the same collection. For simplicity, let’s assume that there are no field changes across these two oplogs. The only thing that changes is the value of the fields.

As per previous story, create schema and create table statements are generated for every insert oplog. Now, we need to fix the issue where the create schema and create table statements are generated only once for each collection.

Sample Input:

[
  {
    "op": "i",
    "ns": "test.student",
    "o": {
      "_id": "635b79e231d82a8ab1de863b",
      "name": "Selena Miller",
      "roll_no": 51,
      "is_graduated": false,
      "date_of_birth": "2000-01-30"
    }
  },
  {
    "op": "i",
    "ns": "test.student",
    "o": {
      "_id": "14798c213f273a7ca2cf5174",
      "name": "George Smith",
      "roll_no": 21,
      "is_graduated": true,
      "date_of_birth": "2001-03-23"
    }
  }
]

In the input, there are two insert oplogs for the same database and collection. The only difference in the two oplogs is the values of JSON fields. The type and the number of fields are same for both oplogs.

Expected output:

CREATE SCHEMA test;

CREATE TABLE test.student
  (
     _id           VARCHAR(255) PRIMARY KEY,
     date_of_birth VARCHAR(255),
     is_graduated  BOOLEAN,
     name          VARCHAR(255),
     roll_no       FLOAT
  );

INSERT INTO test.student (_id, date_of_birth, is_graduated, name, roll_no) VALUES ('635b79e231d82a8ab1de863b', '2000-01-30', false, 'Selena Miller', 51.0);

INSERT INTO test.student (_id, date_of_birth, is_graduated, name, roll_no) VALUES ('14798c213f273a7ca2cf5174', '2001-03-23', true, 'George Smith', 21.0);

Assumptions:

  • In the above output, the create table statement is split into multiple lines. This is done only for readability purpose. You should generate the create table statement in a single line.

Expectations:

  • The input to the program has changed from a single oplog JSON to an array of oplogs. Make sure your code is able to handle both.
  • You will have to modify the code and tests written as part of Story 4.

Story 6 (alter table with multiple oplog entries)

The input for this story is very similar to Story 5 above. Except, in the second oplog, there’s a new field -phone. Your job is to generate an alter table statement and then generate an insert into statement for the second oplog.

Thus, you’ll have to generate SQL statements in the following order:

  • Generate CREATE SCHEMA SQL statement
  • Generate CREATE TABLE SQL statement
  • Generate INSERT INTO SQL statement
  • Generate ALTER TABLE SQL statement
  • Generate INSERT INTO SQL statement

Sample Input:

[
  {
    "op": "i",
    "ns": "test.student",
    "o": {
      "_id": "635b79e231d82a8ab1de863b",
      "name": "Selena Miller",
      "roll_no": 51,
      "is_graduated": false,
      "date_of_birth": "2000-01-30"
    }
  },
  {
    "op": "i",
    "ns": "test.student",
    "o": {
      "_id": "14798c213f273a7ca2cf5174",
      "name": "George Smith",
      "roll_no": 21,
      "is_graduated": true,
      "date_of_birth": "2001-03-23",
      "phone": "+91-81254966457"
    }
  }
]

Expected Output:

CREATE SCHEMA test;

CREATE TABLE test.student
  (
     _id           VARCHAR(255) PRIMARY KEY,
     date_of_birth VARCHAR(255),
     is_graduated  BOOLEAN,
     name          VARCHAR(255),
     roll_no       FLOAT
  );

INSERT INTO test.student (_id, date_of_birth, is_graduated, name, roll_no) VALUES ('635b79e231d82a8ab1de863b', '2000-01-30', false, 'Selena Miller', 51.0);

ALTER TABLE test.student ADD COLUMN phone VARCHAR(255);

INSERT INTO test.student (_id, date_of_birth, is_graduated, name, roll_no, phone) VALUES ('14798c213f273a7ca2cf5174', '2001-03-23', true, 'George Smith', 21.0, '+91-81254966457');

Expectation:

  • You will have to modify the code and tests written as part of Story 5.
  • Your code should also handle case of more than two oplogs for the same collection.
  • Your program should assign null values to columns for which the JSON fields are missing.
  • Currently, in the sample input, we are only considering an addition of one field (phone). However, your program should handle addition of any number of new fields and generate those many number of alter table statements.

Story 7 (handle nested mongo document)

So far, we have handled simple JSON documents in MongoDB. In this story, you will have to handle parsing of oplogs of nested JSON objects.

Sample Input:

{
  "op": "i",
  "ns": "test.student",
  "o": {
    "_id": "635b79e231d82a8ab1de863b",
    "name": "Selena Miller",
    "roll_no": 51,
    "is_graduated": false,
    "date_of_birth": "2000-01-30",
    "address": [
      {
        "line1": "481 Harborsburgh",
        "zip": "89799"
      },
      {
        "line1": "329 Flatside",
        "zip": "80872"
      }
    ],
    "phone": {
      "personal": "7678456640",
      "work": "8130097989"
    }
  }
}

In the input, we have phone, which is a single nested JSON object and we have address, which is an array of nested JSON objects. Let’s see how this is represented in the expected SQL below.

Expected Output:

CREATE SCHEMA test;

CREATE TABLE test.address(_id varchar(255) PRIMARY KEY,line1 varchar(255),student__id varchar(255),zip varchar(255));

INSERT INTO test.address (_id,line1,student__id,zip) VALUES ('1642f9ad-3856-4b21-8f85-b6886701a1de','481 Harborsburgh','635b79e231d82a8ab1de863b','89799');

INSERT INTO test.address (_id,line1,student__id,zip) VALUES ('21d4807b-aecf-4a33-8ff3-6b33a7cecec8','329 Flatside','635b79e231d82a8ab1de863b','80872');

CREATE TABLE test.phone(_id varchar(255) PRIMARY KEY,personal varchar(255),student__id varchar(255),work varchar(255));

INSERT INTO test.phone (_id,personal,student__id,work) VALUES ('e32996dd-ee28-416f-b3f9-0a62d4bd0a95','7678456640','635b79e231d82a8ab1de863b','8130097989');

CREATE TABLE test.student(_id varchar(255) PRIMARY KEY,date_of_birth varchar(255),is_graduated boolean,name varchar(255),roll_no float);

INSERT INTO test.student (_id,date_of_birth,is_graduated,name,roll_no) VALUES ('635b79e231d82a8ab1de863b','2000-01-30',false,'Selena Miller',51.000000);

Since we are dealing with relational data, we will have to create multiple tables with foreign key references. For simplicity, we will not create actual foreign key constraints in any of the tables.

We will create the separate tables for nested objects (i.e. address and phone), then insert records in those tables. Note the use of student__id column in the address and phone tables. This is the soft-foreign key to the student table’s id column. Once all the records are inserted in address and phone tables, we’ll insert records in the main parent student table.

Since we are not creating foreign key constraints and reference in the database, the order in which the tables are created and the order in which the records are inserted in those tables does not matter.

Assumptions:

  • For simplicity, assume nesting of JSON documents only at the top level.
  • Assume that_id is a varchar always.
  • We will not deal with the auto generation of _id. It will be a varchar, and will be a randomly generated UUID for foreign tables. _id for main table comes from the _id field of mongo oplog itself.
  • _id of foreign table needs to be created by the program automatically. This value then must be used for any reference to other related tables.
  • All the foreign tables/associated table will have the reference of primary key from main table. And the primary table will not have any referencing key from associated tables.
  • There’s no need to create actual foreign key references in SQL. The SQL JOINS will happen just by soft-references (i.e. values in referenced columns), not via actual foreign key constraints in database.

Expectations:

  • You’ll have to create associated tables with primary key as _id which will be a randomly generated UUID
  • You’ll have to create soft-foreign keys accordingly, so that all the details of employees can be fetched using JOINS.
  • The SQL should be generated such that it can be run on any relational DB without any modifications. For example, note the order of create table above. We generate the phone table first, and then the employees table. The phone table has reference to the employee table using employee__id column.
  • For simplicity, for a single nested object (e.g. phone), always generate a 1:M relationship. Note that the phone table has a foreign key reference to employee table using employee__id.

Story 8 (reading oplogs from a file)

In the above stories, we have parsed different types of MongoDB oplogs and generated equivalent SQL statements. Now, we have to read multiple oplogs from a JSON file containing MongoDB oplogs and convert them into equivalent SQL statement. You can use the example-input.json file as the input file.

Write the generated SQL statements in a file giving the file name as output.sql . You can compare your output file with example-output.sql.

Assumptions:

  • Take the file name which will contain MongoDB oplogs as an input in the argument.
  • Also, give the file name of output file as argument in which we are going to store the generated SQL statements.

Expectations:

  • Make sure while reading the file we process one oplog at a time. This will be helpful to you in further story where you would be getting data infinitely in stream.
  • You are expected to write the generated SQL statements in a file. You can compare your output file with our example-output.sql.

Story 9 (reading oplogs from MongoDB)

In the story 8, we are reading MongoDB oplogs from a JSON file. Now, we expect you to read the oplogs directly from MongoDB server and generate equivalent SQL statements. You should read the oplogs from MongoDb server until a user signal is given for termination.

Write the generated SQL statements directly in relational database.

Assumptions:

  • Assume Postgres as relational database for now when we are writing the generated SQL statements in relational database.

Expectations:

  • As we might not get a specific set of data i.e data can come in stream mode with certain intervals. So make sure while reading oplogs from MongoDB server, we process one oplog at a time.

  • It should terminate reading from MongoDB only when a user signal (ctrl + c) is given.

  • As you have build the code where you are parsing oplogs from a JSON file and writing the generated SQL statements in a SQL file and also reading oplogs directly from MongoDb server and writing the generated SQL statements in a relational database. We expect you to enable your code in a such way that it can accept the possibility of reading the oplogs from MongoDB server or a JSON file and writing the generated SQL statements in a file or in relational database. i.e.:

    JSON File → SQL File

    JSON File → Relational database

    MongoDB server → SQL File

    MongoDB server → Relational database

Instructions

  1. You will have to create a new repo in language of your choice (Java, Go, etc)
  2. Use the example-input.json file as the input to the program.
  3. The output should be an output.sql file which has the correct SQL statements to prepare the SQL database. You can compare your output with example-output.sql.
  4. If something is unclear, try to figure out the solution, given the test cases. If something is still unclear, assume what you think is right and make a note in the readme.

Evaluation Criteria

  1. The code should provide the necessary documentation to run the program.
  2. The code should produce correct output for the given input.
  3. The code should have necessary tests.
  4. Bonus points for code quality and best practices.