Skip to content

stevekwon211/Hello-Kaggle-Guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Hello Kaggle!:wave:

I summarized the definitions of Kaggle and basic usage after reading Kaggle's Official Document and Kaggle Guide

I hope it will help those who are just introduced to Kaggle like me.

If there is anything that needs to be corrected, please leave it in Issue.

FYI, the Hello Kaggle' document rarely deals with Python programming or machine learning theory
and focuses on Kaggle usage.

For those of you who are looking for programming, data science, and machine learning materials, I'll leave you with some links that I've been helped with.


Table of contents

  1. What is Kaggle?

  1. How is Kaggle used?

  1. Kaggle Competition?

  1. Getting Started with Kaggle

  1. Getting to know Notebook

  1. Competitions and Notebooks

  1. Competitions Progress Flow

  1. Rule of Competitions

  1. Flow of Technology in Kaggle

  1. Kaggle Dataset and API

  1. Finished!



What is Kaggle?

  • Kaggle is the platform that hosts the Data Analysis Competition.

  • It is common for competitions to be hosted by providing data that needs to be analyzed for the company's research challenges, key services.
    Untitled Diagram (1)

  • Artificial Intelligence, Machine Learning Boom has continued to increase the number of participants and was acquired by Google's parent company 'Alphabet' in 2017.

  • Since the Alphabet's acquisition, Kaggle has become a critical site for data scientists and engineers, not just a platform.


Kaggler? Kaggling?

  • Like Google searches Googling, > Kaggle's users are Kaggler or Kaggling to participate in the Competition.

Kaggle Service and Features

  • Jobs
    • Jobs Service was originally provided, but the service ended on December 22, 2020.
      Simply put, it's because the number of users is small.
      For more information, read it here at https://www.kaggle.com/jobs-board-closed.

  • Course image
    • Provides practical and practical lectures on Python, machine learning and visualization, and so on.
    • Kaggle's course can be quite useful if you haven't learned it step by step or if you've studied an old course.
    • All lectures are also available in English, free and a certificate of completion.

English

  • Data scientists from all over the world gather together and use English by default.
  • Complementation Notice, Dataset, Discussion are also in English.
    Below is the photo of Discussion and Site Forum. image
  • If you look at the profiles of the winners of the Competition, there are a variety of USA,Korea ,Russia ,China ,India, and so on.

  • Programming Language
    • Generally use Python and R a lot.

Required Kaggling Knowledge

  • Purpose Knowledge Required
    Competition participation Python, R, data analysis
    Competition organizer Data analysis, English
    Discussion with Kaggler English
    Learning through Courses English

Prepare before becoming Kaggler

  • Required: Internet, Python and R , PC
  • Recommended: Server with GPU or Workstation and high capacity HDD or SSD



How is Kaggle used?

Infrastructure for data analytics

  • Kaggle is web-based and provides tools for data analysis. (Notebook)
  • Community with a variety of Kagglers to enable competition and cooperation.

Notebook

  • The programming environment for data analysis provided by Kaggle.
  • A SaaS environment that runs code written on your Notebook on a server.
  • It provides a programming environment, so there is no need to build a separate development environment. (No Python installation, Anaconda installation, etc.)
  • It is similar to Jupyter Notebook.
  • Provides 4 Core CPU + 16GB RAM by default. GPU Server provides 2Core CPU + GPU + 13GB RAM.
    Provided free of charge, and GPU can be used for 30 hours a week.

Dataset

image

  • The first thing to do when developing a machine learning-based data analysis program is to prepare Dataset.
  • Dataset is open for academic purposes or created and released by Kaggler.
  • If you don't want to share your Dataset, you can use the Private setting to make it private to the outside world.
  • Once Dataset or Notebook is set to Public, Apache 2.0 License is applied, so you must make a careful decision.

Company Training

  • Example: staff training for creating neural network-based machine learning programs
      1. Sign up for Kaggle
      1. Employees are ready to copy and execute the moderator's Notebook
      1. Modifying a Neural Network Model in Notebook
      1. Submit the results of the modified model to Competition and check the score
  • What if we didn't use the Kaggle?
      1. Establishing a development environment on a training computer
      1. Distributing examples of machine learning programs (neural network models)
      1. Create a program to evaluate neural network model execution results by converting them into scores
      1. Check the evaluation score of the executed model
      1. Modifying a Neural Network Model
      1. Confirm that the score varies depending on the outcome of the run

  • Kaggle is much easier and less expensive in building a development environment, checking the score, and deployment.

Discussion

  • If you don't know something, you can ask in Site Forums, and Competition of the Communities.
  • Communities image

  • Site Forums image



Kaggle Competition?

Refer to Competitions Documentation.

Featured, the most common Competition

image

  • Difficult competitions and generally commercial purposes.
  • Most Kagglers participate in the competition, which has been held so far, the prize range is between $100 and $1,500,000.

Research

image

  • It mainly deals with research topics and generally does not have prize money or rewards. (All the ongoing Research Competitions have prize money.)
  • Instead, you can do research by discussing with less competitive and intellectually curious Kagglers.

Getting Started for New Kaggler

image

  • The Competitions shown here are for beginners.
  • Especially Titanic: Machine Learning from Disaster, House Prices: Advanced Regression Techniques, Digit Recognizer These three competitions are the most recommended and helpful competitions for new machine learners.

Playground for data scientists and engineers

image

  • Competition is held mainly with topics that data scientists and engineers might find interesting.
  • Playground is not an easy task. It usually covers recent academic/technical issues and public social issues.
  • In some cases, the organizers may offer prize money or reward.

Recruitment for job opportunities

image

  • Companies are hosting and a prize is mostly a Job Interview opportunity. Participants can upload a Resume at the end of the Competition.

Annual Competition held regularly

  • Kaggle has several regularly held Competitions. You can find the following information at the current Kaggle. image

Analytics to effectively explain the results

  • This is not explained in Documentation, so I read and wrote the Analytics Competitions that are currently up there.
  • Reading the evaluation and submission formats of each Competition, the scoring method of Analytics is shown by submitting a notebook directly and scoring by a person.
    The analyzed data should be described by the organizers' requirements. It looks like a company persuading management through a presentation.



Getting Started with Kaggle

Sign Up

  • Prior to starting Kaggle, click Register button on the upper right to sign up first.

Take a look at Kaggle Courses

  • For those of you who do not have enough knowledge about machine learning or data analytics, it is also a good idea to study the areas you need at Courses, as described above.
  • Each course consists of 2 to 8 classes and offers a variety of hands-on examples.

Refer to Kaggle Progression System.
Before I explain how to become a Contributor, I will explain about Kaggle Tiers and Medal.

Kaggle Tiers

  • There is a Progression System in Kaggle, which is simply Kaggler Tier.
    This rating is a good indicator of your ability as a data scientist.
    It also intuitively shows how much you've grown.

  • The Kaggle Tiers are divided into five levels, and conditions are also given to achieve each.

    • Novice
      image

    • Contributor
      image

    • Expert
      image

    • Master
      image

    • Grandmaster
      image

  • Also, as you can see in the pictures above, Kaggle Tier is rated differently for Competitions, Datasets, Notebooks, and Discussion.

  • Click on the upper right account icon and select My Profile to go to the profile page.
    Then you can check your profile information and Kaggle activity content and tiers.


Medal

  • Medal shows Kaggler's performance in each field.
    • Kaggler with excellent results in Competition
    • Kaggler writes and shares popular Notebook
    • Kaggler shares useful Dataset
    • Kaggler writes good Comment

  • Contributor just needs to satisfy conditions. However, from Expert, the medals required for the applicable conditions in each discipline must be collected.
  • Competitions have different medal criteria depending on the number of teams participating.
    image

  • Datasets, Notebooks, Discussionare evaluated by Vote. It means, the higher number of Vote, the more Kaggler recommended it.
    image
  • Note that there is only one type of medal awarded for each post in each part.
    For example, if a post on Dataset received 20 Votes, the bronze medal will be gone and the silver medal will be given.

Being Contributor

1. Adding User Profile Information

  • Enter your profile, click Edit Profile, and enter the following:
    • Bio (self-introduction)
    • Occupation
    • Organization
    • City
  • In addition, you can set profile image and Social Media freely.

2. SMS Verification

  • Click Phone Verification on the profile screen.
  • Check the Country Code, Phone Number and Not a Robot boxes and click Send Code.
  • Enter the transmitted code and click Verify to complete authentication.

3. Run Script

  • You can achieve this by learning at Course or by creating your own Notebook and executing any code.
  • 4. Participate in the Competition will run a notebook, so you can skip it.

4. Participate in the Competition

  • Select one Competition in the 'Getting Started' category.
  • If you go in, you can see the menu below in the middle of the screen.
    image
  • Click on 'Notes' here and take a look at other people's notebooks.

  • Pick one notebook and open it in the upper right corner image You'll see a button like that. Click this button to copy the notebook.

  • Once the copy is complete, click Save Version at the upper right corner.
    • Version Name: You can enter the name.
    • Version Type: There are two options, Quick Save or Save & Run All (Commit). Quick Save is saved, not executed, and Save & Run All (Commit) is executed.

  • Click Save & Run All here and press the Save button.

  • Go back to your profile and click Notebook to see the notebook you just copied.
    When you click on this notebook, there is Output at the right menu.
    Select Submission.csv, which can be viewed by pressing Output, and click Submit to Competition on the right.

  • The screen will now be moved to the Leaderboard menu and the submitted files will be automatically scored.
    After scoring, you can check your score and click Jump to your position on the leaderboard to see your ranking.

5. Comment to other people's posts or comments and cast upvote (Make 1 comment & Cast 1 upload)

  • In Discussion, enter the topic you want and click any article you are interested in (recommended to enter Getting Started in Site Forums).
  • Read carefully and write comments. If the text is useful or you like it, press Vote as well.

6. Now you are a Contributor!


Wait!

  • Let me add one more thing, Kaggle Rankings.
  • Rankings are separated by Competitions, Datasets, Notebooks, and Discussion.
  • The photo below shows the ranking in the Competitions. You can also check how many people are in each tier. image



Getting to know Notebook


What can you do with your Notebook?

  • Programming for data analysis is the primary purpose, and programs created to run on the Kaggle server.
  • Submit to Competition or share Notebook with Kaggler. Some of the Notebooks are shared only for training or skills.
  • Use Code Cell and Markdown Cell to write codes, and descriptions of the code, text, image, etc.
    How to use Markdown
    Markdown emoji-cheat-sheet
    The above two links I referred to when I first used Markdown, and I still sometimes look at emoji whenever I need it.

Create and Use Notebook

  • Go to the Notebook menu and look in the upper right corner image There's a button like this. Click it.

  • Kaggle Notebook has two types: Script and Notebook.

    • Script is a method of writing and executing code in a commonly used code editor.
  • Notebook is an interactive development environment similar to Jupyter Notebook. The characteristic is that you can divide the cells and execute only the code you want.


  • Press File in the upper left corner and hover your cursor over Edit Type to select the type. In addition, you can choose between Python and R in Language.
    Screenshot (1)

  • You can change the name by clicking on the top left column that looks like the picture below.
    image

  • The first time you create a Notebook, you will see the following code:
    # This Python 3 environment comes with many helpful analytics libraries installed
    # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
    # For example, here's several helpful packages to load
    
    import numpy as np # linear algebra
    import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
    
    # Input data files are available in the read-only "../input/" directory
    # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
    
    import os
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))
    
    # You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
    # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
    The above code specifies the directory /kaggle/input to import files after loading Numpy and Pandas libraries from Python.

  • I will print Hello Kaggle! on Notebook. Place the cursor in any code cell and press the + Code button.
  • Then complete the following:
    image

  • At the top left image press this play button or
    Enter Ctrl + Enter or Shift + Enter to execute the code. The output will be like this
    image

  • These are the functions of the buttons that can be seen in the cell.

    • image: Raise the cell position one space forward.
    • image : Lower the cell position one space down.
    • image : Deletes the corresponding cell.
    • image/image : Hides or indicates that cell.
      • image : provides the following additional features:
        image

Various settings for Notebook

  • Set Public & Private
    • Notebook can be released for sharing with other Kaggler. But if you don't want to share, or when you work as a team, you can make settings such as Private or Shared to a specific user.
    • Press the Share button in the upper right corner to open a window for public or private setting.
    • If Privacy is set to Public, it will be released with Apache 2.0 License.
    • Use Collaborators to add users as collaborators.

  • Settings
    • Language : You can set the programming language to use Python and R.
    • Environment : The Docker image can be set. Original sets up the development environment when creating Notebook and Latest Available uses the latest development environment provided by Kaggle.
    • Accelerator : Whether to use GPU or TPU can be set.
    • GPU/TPU Quota : Show time and usage of GPU and TPU
    • Internet : You can set whether or not to connect to the Internet.
      You can install certain packages by setting Internet to On. Google accounts also allow you to use BigQuery, Cloud Storage, and AutoML services from GCP (Google Cloud Platform).

How to import Data from Notebook

  • Kaggle Notebook is available not only in Competition Data but also in a variety of Dataset shared.
    In this case, a separate file must be set up for use in Notebook.

    1. How to create a new Notebook
    • Go to the Dataset you want to use, image and press New Notebook to set the file automatically.

    1. How to add to an existing Notebook
    • To add new data to your existing Notebook, first access your Notebook.
      Then click the image + Add Data button in the upper right corner.
      Then a window appears where you search for the desired Dataset and press Add after you choose Dataset.

    1. How to upload yourself
    • If you go into the Data menu and look in the upper right corner, click on the image + New Data button.
      Then enter a name for Enter Dataset Title and click Select Files to Upload to upload the file. (Compressed file types such as zip or tar.gz are also possible.)
      Finally, press Create to upload Dataset. You can import the uploaded Dataset using the i or ii method.

    1. How to use output data from another Notebook
    • If you follow ii method, a window will appear, where you can click on the Kernel Output Files tab to use the output data from another Notebook

Use external packages in Notebook

  • External packages that pip is avaliable can be installed with pip install package_name by clicking Console at the bottom of Notebook.
    image

  • You can also use pip directly in the code cell, as shown in two examples
    !pip install package_name
    import os
    os.system('pip install package_name')

Use Source Code from Dataset in Notebook

  • If you add example dataset that has package hello_kaggle to Notebook, you can add the ../input/example-dataset/hello_kaggle directory.
    The codes you add are as follows:

    import sys
    sys.path.append("../input/example-dataset/hello_kaggle")



Competitions and Notebooks

What else can the Notebook be used for besides data analysis Competition?

  • In general, if the goal is to win a prize, Notebook will be shared(Public) after Competition is finished.
    However, there is also an environment in which we can discuss with Kaggler even when Competition is in progress.

How to handle Data File to use in Competition Notebook?

  • When performing Competition, the Data tab is located in the upper right corner of the Notebook. There are three types of files you can click on, each of which is described as follows.
    • train.csv : Learning data with correct answer label.
    • test.csv : Data for testing without the correct answer label.
    • Sample_submission.csv : Examples of data for submission

  • View the Data menu in Competition to see what data each file contains.
    For example, lets look at the Titanic - Machine Learning from Disaster.
    image
    In the picture above, click on the Data menu to read Overview as follows
    image
    If you go down further, you can select each file to view the data and download it as follows
    image

  • Let's use these files to create and submit a csv file for model creation and submission.
    (The same is explained in 4. Participate in the Competition.)

    • Click Save Version in the upper right corner of the Notebook screen. (If the code is not executed, click Save & Run All (Commit).
    • In Save & Run All (Commit), Commit is the same meaning as Git Commit in Github, which I am currently working on.
      Therefore, Kaggle Notebook can refer to the version of the source code previously written.
  • Now return to your profile and click Notebook to see the notebook you just saved.
    When you click on this notebook, there is Output in the right menu.
    Select Submission.csv that you can view by pressing Output menu and click Submit to Competition on the right.


  • The screen will now be moved to the Leaderboard menu and the submitted files will be automatically scored.
    After scoring, you can check your score and click Jump to your position on the leaderboard to see your ranking.



Competitions Progress Flow

  • The type and order that comes out here is the personal opinion of Toshiyuki Sakamoto, author of Kaggle Guide.

Baseline implementing the general-purpose algorithm

  • First, you start analyzing the data, you get the output data through a general-purpose algorithm.
  • Develop machine learning models in earnest and compare output data and results from general-purpose algorithms.
  • If the comparison results in a worse result than the general-purpose algorithm, you can assume that the model has a problem.

Data Analysis Notebook

  • This refers to Notebook that analyzes Competition data and shows visualization.
  • Focus on identifying correlations, rules, and structure between the analyzed data without creating data to submit. We also look for independent variables that fit well with dependent variable.
  • If you have less Competition experience, it would be a good start to build knowledge and insight by looking at data analyzed by other Kagglers.

Fork Notebook

  • For those who are new to machine learning and Kaggle, one way is to fork out a notebook that is open without data analysis or model development yourself.
  • Fork means to copy a version of the source code.
  • On the top right of the Notebook you'd like to fork image press button to copy.

Merge, Blending, Stacking, Ensemble Notebook

  • Notebook with words such as Merge, Blending, Stacking, and Ensemble.
  • As the name suggests, it means Notebook combining several Notebooks.
  • Example: image

Conclusion of Competitions Progress Flow

Untitled Diagram

  • When Competition is carried out in this order, I think it would be better to study a variety of Notebooks to understand the process rather than just looking at the winner's notebook.
  • Also, Competition is literally a competition, so the shared(public) Notebook means that they are not serious impact on their score.
    In fact, if you look at the Notebook of winners, you can often see that they used the latest technology or used a different solution than the shared notebook.



Rule of Competitions

  • Competitions in Kaggle sometimes have specific rules. This is because Competitions are usually hosted by a company or organization, and special rules are often created to achieve the results that the company or organization wants.

What rules should I check?

    1. Rules : To win the Competition, you must first know the rules of Competition. Check the Rules menu for each Competition.
    1. Evaluation : On the Evaluation page of Overview, you should look at the Evaluation function and see what evaluation method is applied. Usually, statistical-based functions are used.
    1. One-person score check limit : If you can check the score frequently by submitting a result file as you change the data one by one, the competition won't get any meaningful results, so there is usually a limit to the number of results checked.
    1. Notebook Only Competition : Submit results using Kaggle Notebook only.
      If only Kaggle Notebook is used, Kaggler is more likely to share Notebook, and all participants can easily find good ideas by viewing shared Notebook.
      Also, all participants have the same computing resources, which can help address inequality between those who use personal workstations and those who do not.



Flow of Technology in Kaggle

Exploring in Closed Competition

  • One characteristic of Kaggle is that it leaves discussion and notebook of Competition that ended a long time ago.
    So if you look at these, you can see what technologies were applied to where and in what ways.
  • Example
    Competition Used Technology Description
    Mercari Price Suction Cahllenge (2018.2) TF-IDF Vector + Pre-bonded Neural Network Learn the frequency of each word with neural networks
    Toxic Comment Classification Challenge (2018.3) FastText, Glove + GRU + LightGBM A combination of word vector dictionaries learned from time series data
    Avito Demand Prediction Challenge (2018.6) FastText + LSTM + 2D-CNN Learn data and images of sentences simultaneously with neural networks
    Quora Insincere Questions Classification(2019.1) Glove, para + OOV Token + LSTM + 1D-CNN Learn vocabularies through OOV token
    Jigsaw Unintended Bias in Toxicity Classification(2019.6) BERT + XLNet + GPT2 BERT model appeared to the Kaggle

Winner Solutions at a Glance

  • Data-Science-Competitions is a Github repository, presents solutions that won the Competition topic by topic (I just checked it out that 11 months ago was the last commit).
  • The winning solution is technology-based at the time, so we need to see if we have better technology today.
  • Most Competitions will continue to release their latest technology-enabled solutions on the Private Leaderboard page after the end.



Kaggle Dataset and API

Use public Dataset

  • When studying common algorithms, it is recommended to test performance with a widely publicized Dataset, UCI Machine Learning Repository is famous.
    It is also used in many academic papers.

Use it as a Data Repository

  • When using Github, you can use Kaggle as a convenient place to store Dataset and Notebook (Free!)
  • It also has the advantage of being able to connect Dataset directly to Notebook.
  • There is a capacity limit of up to 20GB per public Dataset and up to 20GB total for all private Dataset.

Kaggle API

  • Kaggle API is an API that can use various functions of Kaggle in various development environments.
  • Developed as Python 3 and the usage is input command into the terminal environment.

Install Kaggle API


    1. First, install Kaggle API using pip install kaggle.
  • 2.Then enter your profile, click on the image button that looks like this, and press Accounts.
  • 3.image
    Click Create New API Token here to download the json file.
    1. Save downloaded json file to the user's home directory as .kaggle/kaggle.json. now you are ready to use Kaggle API.

Use Kaggle API

  • You can open a terminal on your PC and run commands.
  • Run the kaggle competitions list command to see which Competitions are currently in progress.
    Screenshot from 2021-01-06 22-15-25
  • To view and download Competition files, check the file with kaggle competitions files COMPETITION_NAME and kaggle competitions download COMPETITION_NAME to download the files.
  • To learn more about the Kaggle API, please visit Kaggle Public API Documentation.

Finished!

First of all, thank you for reading Hello Kaggle!
I studied Python for the first time in April 2020 and was unable to concentrate fully on my studies as I've started military service in July of the same year.
That's why I couldn't study data science in depth, and I still need more knowledge to understand it.
Now finally I'm stepping into machine learning and Kaggle.
At this moment to write Hello Kaggle!, I've improved my understanding of Kaggle and I'm going to start with Getting Started Competition.
Also eager to keep up with the latest technology by looking at other outstanding Kaggler's Notebook.
Hopefully, everyone who reads Hello Kaggle! will get the best time in 2021. Let's Keep Going!