S3 enhanced (#70)

* refactor: s3 submodule methods modification * add: notebook to show how to interact with aws s3
pyronear · Oct 9, 2023 · b3c805e · b3c805e
1 parent 4079545
commit b3c805e
Show file tree

Hide file tree

Showing 2 changed files with 359 additions and 10 deletions.
diff --git a/pyro_risks/notebooks/s3_tutorial.ipynb b/pyro_risks/notebooks/s3_tutorial.ipynb
@@ -0,0 +1,262 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# S3 submodule tutorial"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Welcome to the S3 submodule tutorial. This tutorial will walk you through the basics of using the S3 submodule to interact with the risk S3 buckets."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Our data are stored in aws s3 buckets, which need to be accessed either using the aws cli or the boto3 python library. The S3 submodule provides a simple interface to the s3 buckets, allowing you to easily interact with the buckets using only python."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First, one needs to add temporary the directory containing the submodule to the path. This can be done by adding the following lines to the beginning of the notebook:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "sys.path.append(\"../utils\")\n",
+    "from s3 import S3Bucket, read_credentials"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then, we need to load your credentials. These should NEVER be directly written in your code.\n",
+    "Depending on your OS or the way you saved the credentials for the aws cli, the file \"credentials\" should be located in one of the following directories of the next cell. \n",
+    "\n",
+    "Please change the path accordingly if you are not using a linux machine or if you saved the credentials in a different directory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if sys.platform == \"win32\":  # Windows\n",
+    "    credentials_path = r\"C:\\Users\\Jules\\.aws\\\\\"  # Change Jules to your username\n",
+    "else:  # Linux or MacOS\n",
+    "    credentials_path = r\"~/.aws/\"\n",
+    "\n",
+    "credentials = read_credentials(credentials_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then, we can create a S3Bucket object and use it to upload and download files from S3 for instance.\n",
+    "\n",
+    "By default, the bucket's name is \"risk\". "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bucket = S3Bucket(bucket_name=\"risk\", **credentials)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## s3 submodule usage"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This section is not exhaustive, but it should give you a good idea of how to use the S3Bucket class. For more information, please refer directly to the docstrings."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you want to list folders in the bucket"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Example 1 : ['Old/', 'datacube/', 'datasets_grl/']\n",
+      "Example 2 : ['Old/ERA5Land/', 'Old/FWI/', 'Old/Historiques_feux/', 'Old/MODIS_LST/', 'Old/MODIS_NDVI/', 'Old/Old/', 'Old/Roads_distance/', 'Old/SAMPLE/', 'Old/SMI/', 'Old/Waterway_distance/', 'Old/Yearly_population/', 'Old/copernicus_elevation_and_slope/']\n"
+     ]
+    }
+   ],
+   "source": [
+    "_ = bucket.list_folders(prefix=\"\", delimiter=\"/\")\n",
+    "print(\"Example 1 :\", _)\n",
+    "_ = bucket.list_folders(prefix=_[0], delimiter=\"/\")\n",
+    "print(\"Example 2 :\", _)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Knowing the folder structure of the bucket, we can now list the files in a folder. \n",
+    "\n",
+    "For example, we can list the files in the folder \"Old/\", with .csv or .txt in their name."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "50\n"
+     ]
+    }
+   ],
+   "source": [
+    "_ = bucket.list_files(\n",
+    "    prefix=\"Old/\", patterns=[\".csv\", \".txt\"]\n",
+    ")  # patterns can also be a part of the filename\n",
+    "print(len(_))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Following the same procedure, we can list files' metadata such as their size, last modified date. Or get all the metadata only file by file by using prefix for the whole file path."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Example 1 : {'file_name': 'Old/ERA5Land/2022.csv', 'file_size': 611202.47, 'file_last_modified': datetime.datetime(2023, 9, 10, 17, 21, 47, tzinfo=tzutc())}\n",
+      "Example 2 : [{'file_name': 'Old/ERA5Land/2022.csv', 'file_size': 611202.47, 'file_last_modified': datetime.datetime(2023, 9, 10, 17, 21, 47, tzinfo=tzutc())}]\n"
+     ]
+    }
+   ],
+   "source": [
+    "_ = bucket.get_files_metadata(prefix=\"Old/\", patterns=[\".csv\", \".txt\"])\n",
+    "print(\"Example 1 :\", _[0])\n",
+    "_ = bucket.get_files_metadata(prefix=\"Old/ERA5Land/2022.csv\")\n",
+    "print(\"Example 2 :\", _)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "One can also donwload a whole folder from the bucket. The prefix will be used to filter the files to download. The whole architecture of the folder and its parent will be recreated locally. \n",
+    "\n",
+    "The ``save_path`` optional argument will be used if you need to save the folder in a specific location. Otherwise, the folder will be saved in the current working directory. Let's say if you need to have multiple versions of the same folder, you can use the ``save_path`` argument to save them in different locations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bucket.download_folder(prefix=\"Old/ERA5Land/\", save_path=\"Version_1/\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Using only the path of a certain file and the name you want to give it, you can download a file with : "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bucket.download_file(\"Old/ERA5Land/2022.csv\", \"2022.csv\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If needed, you need the key of the file to delete it. \n",
+    "By the same way, you can upload a file by specifying its local file path and the key you want to assign. Keep in mind that the key is the path of the file in the bucket and the name you want to give it.\n",
+    "\n",
+    "The functions are ``upload_file`` and ``delete_file``."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "pyronear",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}