☁️ loading and troubleshooting • 2023
This repository shows how to load publicly available datasets from Kaggle into BigQuery.
Tags: dataset
, kaggle
, bigquery
📍 Kaggle Dataset used in this repository: « Marketing Funnel by Olist »
Step 1 • Download the dataset from Kaggle
Step 2 • On BigQuery: create a project / select a project
Step 3 • Click [ +ADD ]
Step 4 • Click [ Local file | Upload a local file ]
Step 5 • CREATE NEW DATASET / Choose from loaded datasets
Step 6 • Create dataset | Set a Dataset ID | Set expiration if applicable
Step 7 • CREATE TABLE --> Select file
Step 8 • CREATE TABLE --> Set a Table Name
Step 9 • CREATE TABLE --> Header rows to skip = 1
Step 10 • CREATE TABLE --> Add fields, defining the SCHEMA
🧐 TROUBLESHOOTING
✔️ TABLE CREATED
Troubleshooting 1 • « Could not parse 'field_value' as DATE for field @field_name »
Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns »
Kaggle Datasets used in this repository (License: CC BY-NC-SA 4.0):
↳ « Brazilian E-Commerce Public Dataset by Olist » 100k orders with product, customer and reviews info, and
↳ « Marketing Funnel by Olist » 8k leads, closed deals and connection to 100k orders
↳ related to « Joining Marketing Funnel with Brazilian E-Commerce » Python
10---a---schema-is-available-on-kaggle-data-card--6MB.mp4
10---b---add-fields+---6MB.mp4
Message:
« Could not parse '2018-01-18 0:00:00' as DATE for field review_creation_date (position 5) starting at location 122 with message 'Unable to parse' »
↳ Try changing Field Type from DATE
to DATETIME
Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns »
Message:
« CSV table references column position 6, but line starting at position:1765 contains only 5 columns »
➥ A possible reason for it is the confusing of rich text commas ,
with delimiting commas in a comma-separated value (.csv) table.
➥ Some approaches have been tried, none having worked so far ( if you know and would like to share the answer, would appreciate to learn it 💬 )
↳ ✅ Attempt 8 (finally: worked!): regex deleting on Google Sheets: paragraph breaks, commas, and double quote marks: \n|,|")
↳ ❌ Attempt 1 (didn't work): tab separated values (.tsv) table (set as csv with tab delimiter on BigQuery)
↳ ❌ Attempt 2 (didn't work): deleting commas on Google Sheets
↳ ❌ Attempt 3 (didn't work): find and replace blank values by 'unknown' (didn't find)
↳ ❌ Attempt 4 (didn't work): auxiliary columns to replace blanks by 'unknown': replaced, but didn't solve
↳ ❌ Attempt 5 (didn't work): change col type from number to text, where appropriate
↳ ❌ Attempt 6 (didn't work): formatting datetime as YYYY-MM-DD HH:MM:SS
↳ ❌ Attempt 7 (didn't work): relieving the requirement for a value on the id field
Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns » (warning)
✅ Attempt 8 (finally: worked!): regex deleting on Google Sheets: paragraph breaks, commas, and double quote marks: \n|,|")
SQL:
SELECT
origin
, count(origin)
FROM `ecommerce-olist.olist.olist_marketing_qualified_leads_dataset`
GROUP BY origin
ORDER BY 2 desc;