Kaggle datasets into BigQuery

☁️ loading and troubleshooting • 2023

This repository shows how to load publicly available datasets from Kaggle into BigQuery.

Tags: dataset, kaggle, bigquery

Steps

📍 Kaggle Dataset used in this repository: « Marketing Funnel by Olist »
Step 1 • Download the dataset from Kaggle
Step 2 • On BigQuery: create a project / select a project
Step 3 • Click [ +ADD ]
Step 4 • Click [ Local file | Upload a local file ]
Step 5 • CREATE NEW DATASET / Choose from loaded datasets
Step 6 • Create dataset | Set a Dataset ID | Set expiration if applicable
Step 7 • CREATE TABLE --> Select file
Step 8 • CREATE TABLE --> Set a Table Name
Step 9 • CREATE TABLE --> Header rows to skip = 1
Step 10 • CREATE TABLE --> Add fields, defining the SCHEMA
🧐 TROUBLESHOOTING
✔️ TABLE CREATED

Troubleshooting items

Troubleshooting 1 • « Could not parse 'field_value' as DATE for field @field_name »
Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns »

📍 Kaggle Datasets used in this repository: « Marketing Funnel by Olist »

Kaggle Datasets used in this repository (License: CC BY-NC-SA 4.0):
  ↳ « Brazilian E-Commerce Public Dataset by Olist » 100k orders with product, customer and reviews info, and
  ↳ « Marketing Funnel by Olist » 8k leads, closed deals and connection to 100k orders
      ↳ related to « Joining Marketing Funnel with Brazilian E-Commerce » Python

↑

Step 1 • Download the dataset from Kaggle

↑

Step 2 • On BigQuery: create a project / select a project

↑

Step 3 • Click [ +ADD ]

↑

Step 4 • Click [ Local file | Upload a local file ]

↑

Step 5 • CREATE NEW DATASET / Choose from loaded datasets

↑

Step 6 • Create dataset | Set a Dataset ID | Set expiration if applicable

↑

Step 7 • CREATE TABLE --> Select file

↑

Step 8 • CREATE TABLE --> Set a Table Name

↑

Step 9 • CREATE TABLE --> Header rows to skip = 1

↑

Step 10 • CREATE TABLE --> Add fields, defining the SCHEMA

10---a---schema-is-available-on-kaggle-data-card--6MB.mp4

10---b---add-fields+---6MB.mp4

↑

🧐 TROUBLESHOOTING

Troubleshooting 1 • « Could not parse 'field_value' as DATE for field @field_name »

Message:
« Could not parse '2018-01-18 0:00:00' as DATE for field review_creation_date (position 5) starting at location 122 with message 'Unable to parse' »

↳ Try changing Field Type from DATE to DATETIME

Troubleshooting 1 • « Could not parse 'field_value' as DATE for field @field_name » (warning)

↑

Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns »

Message:
« CSV table references column position 6, but line starting at position:1765 contains only 5 columns »
➥ A possible reason for it is the confusing of rich text commas , with delimiting commas in a comma-separated value (.csv) table.
➥ Some approaches have been tried, none having worked so far ( if you know and would like to share the answer, would appreciate to learn it 💬 )

   ↳ ✅ Attempt 8 (finally: worked!): regex deleting on Google Sheets: paragraph breaks, commas, and double quote marks: \n|,|")
   ↳ ❌ Attempt 1 (didn't work): tab separated values (.tsv) table (set as csv with tab delimiter on BigQuery)
   ↳ ❌ Attempt 2 (didn't work): deleting commas on Google Sheets
   ↳ ❌ Attempt 3 (didn't work): find and replace blank values by 'unknown' (didn't find)
   ↳ ❌ Attempt 4 (didn't work): auxiliary columns to replace blanks by 'unknown': replaced, but didn't solve
   ↳ ❌ Attempt 5 (didn't work): change col type from number to text, where appropriate
   ↳ ❌ Attempt 6 (didn't work): formatting datetime as YYYY-MM-DD HH:MM:SS
   ↳ ❌ Attempt 7 (didn't work): relieving the requirement for a value on the id field

Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns » (warning)

✅ Attempt 8 (finally: worked!): regex deleting on Google Sheets: paragraph breaks, commas, and double quote marks: \n|,|")

❌ Attempt 1 (didn't work): tab separated values (.tsv) table (set as csv with tab delimiter on BigQuery)

11---TROUBLESHOOTING---2---dw-1-tsv---1-gsheets

11---TROUBLESHOOTING---2---dw-1-tsv---2-bq

11---TROUBLESHOOTING---2---dw-1-tsv---3-bq-field-delimiter-tab

11---TROUBLESHOOTING---2---dw-1-tsv---4-didnt-work

❌ Attempt 2 (didn't work): deleting commas on Google Sheets

11---TROUBLESHOOTING---2---dw-2-no-commas---1-gsheets

11---TROUBLESHOOTING---2---dw-2-no-commas---2-didnt-work

❌ Attempt 3 (didn't work): find and replace blank values by 'unknown' (didn't find)

11---TROUBLESHOOTING---2---dw-3-find-replace---1-gsheets

11---TROUBLESHOOTING---2---dw-3-find-replace---1-gsheets---didnt-find

❌ Attempt 4 (didn't work): auxiliary columns to replace blanks by 'unknown': replaced, but didn't solve

11---TROUBLESHOOTING---2---dw-4-unknown-aux-col---1-gsheets---PASTE-VALUES

❌ Attempt 5 (didn't work): change col type from number to text, where appropriate

11---TROUBLESHOOTING---2---dw-5-col-type-text---1-gsheets

11---TROUBLESHOOTING---2---dw-5-col-type-text---2-gsheets

❌ Attempt 6 (didn't work): formatting datetime as `YYYY-MM-DD HH:MM:SS`

11---TROUBLESHOOTING---2---dw-6-col-type-datetime---1-gsheets

11---TROUBLESHOOTING---2---dw-6-col-type-datetime---2-gsheets

❌ Attempt 7 (didn't work): relieving the requirement for a value on the id field

11---TROUBLESHOOTING---2---dw-7-id-nullable---1-bq

↑

✔️ TABLE CREATED

SQL:

SELECT
  origin
  , count(origin) 
FROM `ecommerce-olist.olist.olist_marketing_qualified_leads_dataset` 
GROUP BY origin 
ORDER BY 2 desc;

File | On BigQuery

↑

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
README.md		README.md
bq_kaggle_table_query.sql		bq_kaggle_table_query.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle datasets into BigQuery

Steps

Troubleshooting items

📍 Kaggle Datasets used in this repository: « Marketing Funnel by Olist »

Step 1 • Download the dataset from Kaggle

Step 2 • On BigQuery: create a project / select a project

Step 3 • Click [ +ADD ]

Step 4 • Click [ Local file | Upload a local file ]

Step 5 • CREATE NEW DATASET / Choose from loaded datasets

Step 6 • Create dataset | Set a Dataset ID | Set expiration if applicable

Step 7 • CREATE TABLE --> Select file

Step 8 • CREATE TABLE --> Set a Table Name

Step 9 • CREATE TABLE --> Header rows to skip = 1

Step 10 • CREATE TABLE --> Add fields, defining the SCHEMA

🧐 TROUBLESHOOTING

Troubleshooting 1 • « Could not parse 'field_value' as DATE for field @field_name »

Troubleshooting 1 • « Could not parse 'field_value' as DATE for field @field_name » (warning)

Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns »

Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns » (warning)

✅ Attempt 8 (finally: worked!): regex deleting on Google Sheets: paragraph breaks, commas, and double quote marks: \n|,|")

❌ Attempt 1 (didn't work): tab separated values (.tsv) table (set as csv with tab delimiter on BigQuery)

❌ Attempt 2 (didn't work): deleting commas on Google Sheets

❌ Attempt 3 (didn't work): find and replace blank values by 'unknown' (didn't find)

❌ Attempt 4 (didn't work): auxiliary columns to replace blanks by 'unknown': replaced, but didn't solve

❌ Attempt 5 (didn't work): change col type from number to text, where appropriate

❌ Attempt 6 (didn't work): formatting datetime as `YYYY-MM-DD HH:MM:SS`

❌ Attempt 7 (didn't work): relieving the requirement for a value on the id field

✔️ TABLE CREATED

About

Releases

Packages

isis-santos-costa/kaggle-datasets-in-bigquery

Folders and files

Latest commit

History

Repository files navigation

Kaggle datasets into BigQuery

Steps

Troubleshooting items

📍 Kaggle Datasets used in this repository: « Marketing Funnel by Olist »

Step 1 • Download the dataset from Kaggle

Step 2 • On BigQuery: create a project / select a project

Step 3 • Click [ +ADD ]

Step 4 • Click [ Local file | Upload a local file ]

Step 5 • CREATE NEW DATASET / Choose from loaded datasets

Step 6 • Create dataset | Set a Dataset ID | Set expiration if applicable

Step 7 • CREATE TABLE --> Select file

Step 8 • CREATE TABLE --> Set a Table Name

Step 9 • CREATE TABLE --> Header rows to skip = 1

Step 10 • CREATE TABLE --> Add fields, defining the SCHEMA

🧐 TROUBLESHOOTING

Troubleshooting 1 • « Could not parse 'field_value' as DATE for field @field_name »

Troubleshooting 1 • « Could not parse 'field_value' as DATE for field @field_name » (warning)

Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns »

Troubleshooting 2 • « CSV table references column position N+1, but line starting at position:X contains only N columns » (warning)

✅ Attempt 8 (finally: worked!): regex deleting on Google Sheets: paragraph breaks, commas, and double quote marks: \n|,|")

❌ Attempt 1 (didn't work): tab separated values (.tsv) table (set as csv with tab delimiter on BigQuery)

❌ Attempt 2 (didn't work): deleting commas on Google Sheets

❌ Attempt 3 (didn't work): find and replace blank values by 'unknown' (didn't find)

❌ Attempt 4 (didn't work): auxiliary columns to replace blanks by 'unknown': replaced, but didn't solve

❌ Attempt 5 (didn't work): change col type from number to text, where appropriate

❌ Attempt 6 (didn't work): formatting datetime as `YYYY-MM-DD HH:MM:SS`

❌ Attempt 7 (didn't work): relieving the requirement for a value on the id field

✔️ TABLE CREATED

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages