Skip to content

Dataset Metadata

Philippe Modard edited this page Jun 30, 2023 · 3 revisions

The Kaggle API follows the Data Package specification for specifying metadata when creating new Datasets and Dataset versions. Next to your files, you have to put a special dataset-metadata.json file in your upload folder alongside the files for each new Dataset (version).

Here's a basic example for dataset-metadata.json:

{
  "title": "My Awesome Dataset", 
  "id": "timoboz/my-awesome-dataset", 
  "licenses": [{"name": "CC0-1.0"}]
}

You can also use the API command kaggle datasets init -p /path/to/dataset to have the API create this file for you.

Here's an example containing file metadata:

{
  "title": "My Awesome Dataset", 
  "subtitle": "My awesomer subtitle",
  "description": "My awesomest description",
  "id": "timoboz/my-awesome-dataset", 
  "id_no": 12345,
  "licenses": [{"name": "CC0-1.0"}],
  "resources": [
    {
      "path": "my-awesome-data.csv",
      "description": "This is my awesome data!",
      "schema": {
        "fields": [
          {
            "name": "StringField",
            "description": "String field description",
            "type": "string"
          },
          {
            "name": "NumberField",
            "description": "Number field description",
            "type": "number"
          },
          {
            "name": "DateTimeField",
            "description": "Date time field description",
            "type": "datetime"
          }
        ]
      }
    },
    {
      "path": "my-awesome-extra-file.txt",
      "description": "This is my awesome extra file!"
    }
  ],
  "keywords": [
    "beginner",
    "tutorial"
  ]
}

Contents

The following metadata is currently supported:

  • kaggle datasets create (create a new Dataset):
    • title: Title of the dataset, must be between 6 and 50 characters in length.
    • subtitle: Subtitle of the dataset, must be between 20 and 80 characters in length.
    • description: Description of the dataset.
    • id: The URL slug of your new dataset, a combination of:
      1. Your username or organization slug (if you are a member of an organization).
      2. A unique Dataset slug, must be between 3 and 50 characters in length.
    • licenses: Must have exactly one entry that specifies the license. Only name is evaluated, all other information is ignored. See below for options.
    • resources: Contains an array of files that are being uploaded. (Note - this is not required, nor if included, does it need to include all of the files to be uploaded.):
      • path: File path.
      • description: File description.
      • schema: File schema (definition below):
        • fields: Array of fields in the dataset. Please note that this needs to include ALL of the fields in the data in order or they will not be matched up correctly. A later version of the API will fix this bug.
          • name: Field name
          • title: Field description
          • type: Field type. A best-effort list of types will be kept at the bottom of this wiki page, but new types may be added that are not documented here.
    • keywords: Contains an array of strings that correspond to an existing tag on Kaggle. If a specified tag doesn't exist, the upload will continue, but that specific tag won't be added.
  • kaggle datasets version (create a new version for an existing Dataset):
    • subtitle: Subtitle of the dataset, must be between 20 and 80 characters in length.
    • description: Description of the dataset.
    • id: The URL slug of the dataset you want to update (see above). You must be the owner or otherwise have edit rights for this dataset. One of id or id_no must be specified. If both are, id_no will be preferred.
    • id_no: The ID of the dataset. One of id or id_no must be specified. You must be the owner or otherwise have edit rights for this dataset. If both are, id_no will be preferred.
    • resources: Contains an array of files that are being uploaded. (Note - this is not required, nor if included, does it need to include all of the files to be uploaded.):
      • path: File path.
      • description: File description.
      • schema: File schema (definition below):
        • fields: Array of fields in the dataset. Please note that this needs to include ALL of the fields in the data in order or they will not be matched up correctly. A later version of the API will fix this bug.
          • name: Field name
          • title: Field description
          • type: Field type. A best-effort list of types will be kept at the bottom of this wiki page, but new types may be added that are not documented here.
    • keywords: Contains an array of strings that correspond to an existing tag on Kaggle. If a specified tag doesn't exist, the upload will continue, but that specific tag won't be added.

We will add further metadata processing in upcoming versions of the API.

Licenses

You can specify the following licenses for your datasets:

Data types

You can specify the following data types

  • string
  • boolean
  • numeric
  • datetime
  • id
  • uuid
  • latitude
  • longitude
  • coordinates
  • country
  • province (these are states in the US)
  • postalcode
  • address
  • email
  • url
  • integer
  • decimal
  • city
Clone this wiki locally