Skip to content

This is an example code of converting a large scale (16gb) JSON dump failure to usable SQL DB.

License

Notifications You must be signed in to change notification settings

anyfactor/JSON-Plaintext-to-SQL

Repository files navigation

JSON-Plaintext-to-SQL

This is an example code of converting a large scale (16gb) JSON dump failure to usable SQL DB.

The Issue

This was a large scale reddit API scrape of the subreddit r/videos. But the scraping script failed to output proper JSON objects.

Rather than dumping the JSON object from the API approporiately the script appended each JSON object response as a plaintext to a file formated JSON. The appropriate JSON would have been the following -

[{"author": "[deleted]", "author_flair_css_class": null, "author_flair_text": null, "created_utc": 1356998512, "domain": "youtube.com", "full_link": "https://www.reddit.com/r/videos/comments/15qzf0/as_2013_approaches_look_back_on_the_past_year_in/", "gilded": 0, "id": "15qzf0", "is_self": false, "media_embed": {}, "mod_reports": [], "num_comments": 0, "over_18": false, "permalink": "/r/videos/comments/15qzf0/as_2013_approaches_look_back_on_the_past_year_in/", "retrieved_on": 1413242528, "score": 1, "secure_media_embed": {}, "selftext": "", "stickied": false, "subreddit": "videos", "subreddit_id": "t5_2qh1e", "thumbnail": "default", "title": "As 2013 approaches, look back on the past year in movies. \"2012: The Cinescape\" by Matt Shapiro.", "url": "http://www.youtube.com/watch?&v=qVw1_PadudE", "user_reports": []},

{"author": "[deleted]", "author_flair_css_class": null, "author_flair_text": null, "created_utc": 1356998655, "domain": "youtube.com", "full_link": "https://www.reddit.com/r/videos/comments/15qzk2/happy_new_year/", "gilded": 0, "id": "15qzk2", "is_self": false, "media": {"oembed": {"author_name": "DamItsGood808", "author_url": "http://www.youtube.com/user/DamItsGood808", "description": "HAPPY NEW YEAR!!! SEND TO YOUR FRIENDS! music by Neil Cicierega - http://www.youtube.com/user/neilcicierega", "height": 338, "html": "<iframe width=\"600\" height=\"338\" src=\"http://www.youtube.com/embed/nRkVa1-Iog8?feature=oembed\" frameborder=\"0\" allowfullscreen></iframe>", "provider_name": "YouTube", "provider_url": "http://www.youtube.com/", "thumbnail_height": 360, "thumbnail_url": "http://i3.ytimg.com/vi/nRkVa1-Iog8/hqdefault.jpg", "thumbnail_width": 480, "title": "2013 DANCE!!!", "type": "video", "url": "http://www.youtube.com/watch?v=nRkVa1-Iog8", "version": "1.0", "width": 600}, "type": "youtube.com"}, "media_embed": {"content": "<iframe width=\"600\" height=\"338\" src=\"http://www.youtube.com/embed/nRkVa1-Iog8?feature=oembed\" frameborder=\"0\" allowfullscreen></iframe>", "height": 338, "scrolling": false, "width": 600}, "mod_reports": [], "num_comments": 0, "over_18": false, "permalink": "/r/videos/comments/15qzk2/happy_new_year/", "retrieved_on": 1413242522, "score": 0, "secure_media_embed": {}, "selftext": "", "stickied": false, "subreddit": "videos", "subreddit_id": "t5_2qh1e", "thumbnail": "default", "title": "Happy New Year", "url": "http://www.youtube.com/watch?v=nRkVa1-Iog8", "user_reports": []}]

But the file was -

{"author": "[deleted]", "author_flair_css_class": null, "author_flair_text": null, "created_utc": 1356998512, "domain": "youtube.com", "full_link": "https://www.reddit.com/r/videos/comments/15qzf0/as_2013_approaches_look_back_on_the_past_year_in/", "gilded": 0, "id": "15qzf0", "is_self": false, "media_embed": {}, "mod_reports": [], "num_comments": 0, "over_18": false, "permalink": "/r/videos/comments/15qzf0/as_2013_approaches_look_back_on_the_past_year_in/", "retrieved_on": 1413242528, "score": 1, "secure_media_embed": {}, "selftext": "", "stickied": false, "subreddit": "videos", "subreddit_id": "t5_2qh1e", "thumbnail": "default", "title": "As 2013 approaches, look back on the past year in movies. \"2012: The Cinescape\" by Matt Shapiro.", "url": "http://www.youtube.com/watch?&v=qVw1_PadudE", "user_reports": []}

{"author": "[deleted]", "author_flair_css_class": null, "author_flair_text": null, "created_utc": 1356998655, "domain": "youtube.com", "full_link": "https://www.reddit.com/r/videos/comments/15qzk2/happy_new_year/", "gilded": 0, "id": "15qzk2", "is_self": false, "media": {"oembed": {"author_name": "DamItsGood808", "author_url": "http://www.youtube.com/user/DamItsGood808", "description": "HAPPY NEW YEAR!!! SEND TO YOUR FRIENDS! music by Neil Cicierega - http://www.youtube.com/user/neilcicierega", "height": 338, "html": "<iframe width=\"600\" height=\"338\" src=\"http://www.youtube.com/embed/nRkVa1-Iog8?feature=oembed\" frameborder=\"0\" allowfullscreen></iframe>", "provider_name": "YouTube", "provider_url": "http://www.youtube.com/", "thumbnail_height": 360, "thumbnail_url": "http://i3.ytimg.com/vi/nRkVa1-Iog8/hqdefault.jpg", "thumbnail_width": 480, "title": "2013 DANCE!!!", "type": "video", "url": "http://www.youtube.com/watch?v=nRkVa1-Iog8", "version": "1.0", "width": 600}, "type": "youtube.com"}, "media_embed": {"content": "<iframe width=\"600\" height=\"338\" src=\"http://www.youtube.com/embed/nRkVa1-Iog8?feature=oembed\" frameborder=\"0\" allowfullscreen></iframe>", "height": 338, "scrolling": false, "width": 600}, "mod_reports": [], "num_comments": 0, "over_18": false, "permalink": "/r/videos/comments/15qzk2/happy_new_year/", "retrieved_on": 1413242522, "score": 0, "secure_media_embed": {}, "selftext": "", "stickied": false, "subreddit": "videos", "subreddit_id": "t5_2qh1e", "thumbnail": "default", "title": "Happy New Year", "url": "http://www.youtube.com/watch?v=nRkVa1-Iog8", "user_reports": []}

The script also didn't handle the JSON response dump appropriately to a DB either an SQL or a NoSQL db, this huge set was appended to a plaintext file. This resulted in a 16 gb plaintext file that was difficult to be processed any standard operation.

The Desired Output

The desired output operation is -

  1. Convert the plaintext file to an SQLite3 db
  2. Extract only the necessary keys from the JSON objects and convert them to columns

Tools

For this operation I will be using the following tools-

  1. Scripting language: Python
  2. Python Modules: JSON, os
  3. DB: SQLite3
  4. Version Control: Git, Github
  5. Additional tools: Terminal applications (head, split)

Process

Breaking down the 16 gb file

The 16 gb file is too large for any standard machine to handle. So I first start by breaking down the file into smaller manageable pieces with the terminal application called split. The command was -

split -l 5000 .\submissions.json^C

The argument for the split application

  1. -l split by lines
  2. 5000 split on every 5000 lines of the file
  3. submission.json that 16 gb plaintext file

This operation resulted in 965 files that is divided in 5000 lines. Each file has different file size as I splitted the file by lines and not by size.

Traversing and Parsing data

Now we have to parse the data from each of the files. The structure of the file is same across file to file.

We can use the Python's json module's loads method to parse the data which converts string representation of JSON object to python dictionary. To do that we will first split the file to dictionary with the delimite of , then we will append each parsed JSON object to a file.

To traverse across the filesystem I am going to use the os.listdir method.

For each file we are going to parse to dictionary, create a list of dictionary and input the list of dictionary to an SQLite3 db.

SQLite3 DB

Because I am dumping converting an incredible amount of code in my first attempt I tried to do execute > commit for each line of data. Which was a terrible terrible idea. Now, I am using executemany and doing a commit.

End Result

I was able to convert the 16 gb JSON file to a manageable 1.5 gb SQLite3 db. With 4822371 rows.


Disclaimer

I had to use some try and except blocks because the JSON file had some irregularities and I logged the erroneous and traceback info using traceback.format_exec() to a log file. The error data was negligible.

The original working script was written in scripting style and later converted to OOP style.

About

This is an example code of converting a large scale (16gb) JSON dump failure to usable SQL DB.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages