Append option for uploads (#651)

- Ability to append an upload to a previously uploaded file/sqlite DB as a new table - Good cache busting and detection of file changes on uploads - Separate the upload UI from the 'add connection' UI, as they are materially different - Fix a small bug with bar chart generation, when values are null - Ability to refresh a connection's schema and data (if it's an upload) from the connections list view
explorerhq · Aug 5, 2024 · 841549d · 841549d
1 parent 3546dbf
commit 841549d
Show file tree

Hide file tree

Showing 25 changed files with 587 additions and 158 deletions.
diff --git a/docs/features.rst b/docs/features.rst
@@ -169,7 +169,49 @@ Multiple Connections
   multi-connection setup.
 - SQL Explorer also supports user-provided connections in the form
   of standard database connection details, or uploading CSV, JSON or SQLite
-  files. See the 'User uploads' section of :doc:`settings`.
+  files.
+
+File Uploads
+------------
+
+Upload CSV or JSON files, or SQLite databases to immediately create connections for querying.
+
+The base name of the file and the ID of the uploaded is used as the database name, to prevent collisions from multiple
+users uploading a file with the same name. The base name of the file is also used as the table name (e.g. uploading
+customers.csv results in a database file named customers_1.db, with a table named 'customers').
+
+Of interest, you can also append uploaded files to previously uploaded data sources. For example, if you had a
+'customers.csv' file and an 'orders.csv' file, you could upload customers.csv and create a new data source. You can
+then go back and upload orders.csv with the 'Append' drop-down set to your newly-created customers database, and you
+will have a resulting SQLite database connection with both tables available to be queried together. If you were to
+upload a new 'orders.csv' and append it to customers, the table 'orders' would be *fully replaced* with the new file.
+
+**How it works**
+
+1. Your file is uploaded to the web server. For CSV files, the first row is assumed to be a header.
+2. It is read into a Pandas dataframe. Many fields end up as strings that are in fact numeric or datetimes.
+3. During this step, if it is a json file, the json is 'normalized'. E.g. nested objects are flattened.
+4. A customer parser runs type-detection on each column for richer typer information.
+5. The dataframe is coerced to these more accurate types.
+6. The dataframe is written to a SQLite file, which is present on the server, and uploaded to S3.
+7. The SQLite database is added as a new connection to SQL Explorer and is available for querying, just like any
+   other data source.
+8. If the SQLite file is not available locally, it will be pulled on-demand from S3 when needed.
+9. Local SQLite files are periodically cleaned up by a recurring task after (by default) 7 days of inactivity.
+
+Note that if the upload is a SQLite database, steps 2-5 are skipped and the database is simply uploaded to S3 and made
+available for querying.
+
+**File formats**
+
+- Supports well-formed .csv, and .json files. Also supports .json files where each line of the file is a separate json
+  object. See /explorer/tests/json/ in the source for examples of what is supported.
+- Supports SQLite files with a .db or .sqlite extension. The validity of the SQLite file is not fully checked until
+  a query is attempted.
+
+**Configuration**
+
+- See the 'User uploads' section of :doc:`settings` for configuration details.
 
 Power tips
 ----------

diff --git a/docs/settings.rst b/docs/settings.rst
@@ -383,7 +383,7 @@ User Uploads
 With `EXPLORER_DB_CONNECTIONS_ENABLED` set to `True`, you can also set `EXPLORER_USER_UPLOADS_ENABLED` to allow users
 to upload their own CSV and SQLite files directly to explorer as new connections.
 
-Go to connections->Add New and scroll down to see the upload interface. The uploaded files are limited in size by the
+Go to connections->Upload File. The uploaded files are limited in size by the
 `EXPLORER_MAX_UPLOAD_SIZE` setting which is set to 500mb by default (500 * 1024 * 1024). SQLite files (in either .db or
-.sqlite) will simple appear as connections. CSV files get run through a parser that infers the type of each field.
+.sqlite) will simply appear as connections. CSV files get run through a parser that infers the type of each field.
 
diff --git a/explorer/charts.py b/explorer/charts.py
@@ -35,7 +35,7 @@ def get_chart(result: QueryResult, chart_type: str, num_rows: int) -> Optional[s
     bar_positions = []
     for idx, col_num in enumerate(numeric_columns):
         if chart_type == "bar":
-            values = [row[col_num] for row in data]
+            values = [row[col_num] if row[col_num] is not None else 0 for row in data]
             bar_container = ax.bar([x + idx * BAR_WIDTH
                                     for x in range(len(labels))], values, BAR_WIDTH, label=result.headers[col_num])
             bars.append(bar_container)

diff --git a/explorer/ee/db_connections/create_sqlite.py b/explorer/ee/db_connections/create_sqlite.py
@@ -1,24 +1,43 @@
 import os
 from io import BytesIO
 
+from explorer.utils import secure_filename
 from explorer.ee.db_connections.type_infer import get_parser
-from explorer.ee.db_connections.utils import pandas_to_sqlite
+from explorer.ee.db_connections.utils import pandas_to_sqlite, uploaded_db_local_path
 
 
-def parse_to_sqlite(file) -> (BytesIO, str):
-    f_name = file.name
-    f_bytes = file.read()
+def get_names(file, append_conn=None, user_id=None):
+    s_filename = secure_filename(file.name)
+    table_name, _ = os.path.splitext(s_filename)
+
+    # f_name represents the filename of both the sqlite DB on S3, and on the local filesystem.
+    # If we are appending to an existing data source, then we re-use the same name.
+    # New connections get a new database name.
+    if append_conn:
+        f_name = os.path.basename(append_conn.name)
+    else:
+        f_name = f"{table_name}_{user_id}.db"
+
+    return table_name, f_name
+
+
+def parse_to_sqlite(file, append_conn=None, user_id=None) -> (BytesIO, str):
+
+    table_name, f_name = get_names(file, append_conn, user_id)
+
+    # When appending, make sure the database exists locally so that we can write to it
+    if append_conn:
+        append_conn.download_sqlite_if_needed()
+
     df_parser = get_parser(file)
     if df_parser:
-        df = df_parser(f_bytes)
         try:
-            f_bytes = pandas_to_sqlite(df, local_path=f"{f_name}_tmp_local.db")
+            df = df_parser(file.read())
+            local_path = uploaded_db_local_path(f_name)
+            f_bytes = pandas_to_sqlite(df, table_name, local_path)
         except Exception as e:  # noqa
             raise ValueError(f"Error while parsing {f_name}: {e}") from e
-        # replace the previous extension with .db, as it is now a sqlite file
-        name, _ = os.path.splitext(f_name)
-        f_name = f"{name}.db"
     else:
-        return BytesIO(f_bytes), f_name  # if it's a SQLite file already, simply cough it up as a BytesIO object
+        # If it's a SQLite file already, simply cough it up as a BytesIO object
+        return BytesIO(file.read()), f_name
     return f_bytes, f_name
-
diff --git a/explorer/ee/db_connections/mime.py b/explorer/ee/db_connections/mime.py
@@ -42,7 +42,7 @@ def is_json_list(file):
 
 
 def is_sqlite(file):
-    if file.content_type != "application/x-sqlite3":
+    if file.content_type not in ["application/x-sqlite3", "application/octet-stream"]:
         return False
     try:
         # Check if the file starts with the SQLite file header

diff --git a/explorer/ee/db_connections/models.py b/explorer/ee/db_connections/models.py
@@ -1,11 +1,10 @@
 import os
-
 from django.conf import settings
 from django.core.exceptions import ValidationError
 from django.db import models
 from django.db.models.signals import pre_save
 from django.dispatch import receiver
-from explorer.ee.db_connections.utils import user_dbs_local_dir
+from explorer.ee.db_connections.utils import uploaded_db_local_path, quick_hash
 
 from django_cryptography.fields import encrypt
 
@@ -33,18 +32,44 @@ class DatabaseConnection(models.Model):
     host = encrypt(models.CharField(max_length=255, blank=True))
     port = models.CharField(max_length=255, blank=True)
     extras = models.JSONField(blank=True, null=True)
+    upload_fingerprint = models.CharField(max_length=255, blank=True, null=True)
 
     def __str__(self):
         return f"{self.name} ({self.alias})"
 
+    def update_fingerprint(self):
+        self.upload_fingerprint = self.local_fingerprint()
+        self.save()
+
+    def local_fingerprint(self):
+        if os.path.exists(self.local_name):
+            return quick_hash(self.local_name)
+
+    def _download_sqlite(self):
+        from explorer.utils import get_s3_bucket
+        s3 = get_s3_bucket()
+        s3.download_file(self.host, self.local_name)
+
+    def download_sqlite_if_needed(self):
+        download = not os.path.exists(self.local_name) or self.local_fingerprint() != self.upload_fingerprint
+
+        if download:
+            self._download_sqlite()
+            self.update_fingerprint()
+
+
     @property
     def is_upload(self):
         return self.engine == self.SQLITE and self.host
 
     @property
     def local_name(self):
         if self.is_upload:
-            return os.path.join(user_dbs_local_dir(), self.name)
+            return uploaded_db_local_path(self.name)
+
+    def delete_local_sqlite(self):
+        if self.is_upload and os.path.exists(self.local_name):
+            os.remove(self.local_name)
 
     @classmethod
     def from_django_connection(cls, connection_alias):

diff --git a/explorer/ee/db_connections/utils.py b/explorer/ee/db_connections/utils.py
@@ -2,7 +2,7 @@
 from django.db.utils import load_backend
 import os
 import json
-
+import hashlib
 import sqlite3
 import io
 
@@ -21,29 +21,23 @@ def upload_sqlite(db_bytes, path):
 # to this new database connection. Oops!
 # TODO: In the future, queries should probably be FK'ed to the ID of the connection, rather than simply
 #       storing the alias of the connection as a string.
-def create_connection_for_uploaded_sqlite(filename, user_id, s3_path):
+def create_connection_for_uploaded_sqlite(filename, s3_path):
     from explorer.models import DatabaseConnection
-    base, ext = os.path.splitext(filename)
-    filename = f"{base}_{user_id}{ext}"
     return DatabaseConnection.objects.create(
-        alias=f"{filename}",
+        alias=filename,
         engine=DatabaseConnection.SQLITE,
         name=filename,
-        host=s3_path
+        host=s3_path,
     )
 
 
 def get_sqlite_for_connection(explorer_connection):
-    from explorer.utils import get_s3_bucket
-
     # Get the database from s3, then modify the connection to work with the downloaded file.
     # E.g. "host" should not be set, and we need to get the full path to the file
-    local_name = explorer_connection.local_name
-    if not os.path.exists(local_name):
-        s3 = get_s3_bucket()
-        s3.download_file(explorer_connection.host, local_name)
+    explorer_connection.download_sqlite_if_needed()
+    # Note the order here is important; .local_name checked "is_upload" which relies on .host being set
+    explorer_connection.name = explorer_connection.local_name
     explorer_connection.host = None
-    explorer_connection.name = local_name
     return explorer_connection
 
 
@@ -54,6 +48,10 @@ def user_dbs_local_dir():
     return d
 
 
+def uploaded_db_local_path(name):
+    return os.path.join(user_dbs_local_dir(), name)
+
+
 def create_django_style_connection(explorer_connection):
 
     if explorer_connection.is_upload:
@@ -87,24 +85,45 @@ def create_django_style_connection(explorer_connection):
         raise DatabaseError(f"Failed to create explorer connection: {e}") from e
 
 
-def pandas_to_sqlite(df, local_path="local_database.db"):
-    # Write the DataFrame to a local SQLite database
-    # In theory, it would be nice to write the dataframe to an in-memory SQLite DB, and then dump the bytes from that
-    # but there is no way to get to the underlying bytes from an in-memory SQLite DB
-    con = sqlite3.connect(local_path)
-    try:
-        df.to_sql(name="data", con=con, if_exists="replace", index=False)
-    finally:
-        con.close()
+def sqlite_to_bytesio(local_path):
+    # Write the file to disk. It'll be uploaded to s3, and left here locally for querying
+    db_file = io.BytesIO()
+    with open(local_path, "rb") as f:
+        db_file.write(f.read())
+    db_file.seek(0)
+    return db_file
+
+
+def pandas_to_sqlite(df, table_name, local_path):
+    # Write the DataFrame to a local SQLite database and return it as a BytesIO object.
+    # This intentionally leaves the sqlite db on the local disk so that it is ready to go for
+    # querying immediately after the connection has been created. Removing it would also be OK, since
+    # the system knows to re-download it if it's not available, but this saves an extra download from S3.
+    conn = sqlite3.connect(local_path)
 
-    # Read the local SQLite database file into a BytesIO buffer
     try:
-        db_file = io.BytesIO()
-        with open(local_path, "rb") as f:
-            db_file.write(f.read())
-        db_file.seek(0)
-        return db_file
+        df.to_sql(table_name, conn, if_exists="replace", index=False)
     finally:
-        # Delete the local SQLite database file
-        # Finally block to ensure we don't litter files around
-        os.remove(local_path)
+        conn.commit()
+        conn.close()
+
+    return sqlite_to_bytesio(local_path)
+
+
+def quick_hash(file_path, num_samples=10, sample_size=1024):
+    hasher = hashlib.sha256()
+    file_size = os.path.getsize(file_path)
+
+    if file_size == 0:
+        return hasher.hexdigest()
+
+    sample_interval = file_size // num_samples
+    with open(file_path, "rb") as f:
+        for i in range(num_samples):
+            f.seek(i * sample_interval)
+            sample_data = f.read(sample_size)
+            if not sample_data:
+                break
+            hasher.update(sample_data)
+
+    return hasher.hexdigest()