-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add parquet upload #14449
feat: Add parquet upload #14449
Conversation
@john-bodley @villebro just wanted to follow up on this in case it got lost in the shuffle. How should we proceed on this PR? |
@exemplary-citizen sorry for dropping the ball on this - I'll have this reviewed within the next 24 hours |
06f9602
to
4cc378a
Compare
/testenv up |
superset/views/database/views.py
Outdated
"iterator": True, | ||
"keep_default_na": not form.null_values.data, | ||
"mangle_dupe_cols": form.mangle_dupe_cols.data, | ||
"usecols": form.usecols.data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to change the behavior of the existing CSV upload functionality by specifying columns. Can you add some tests around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a scenario to test_import_csv
that tests uploading a CSV with specific columns
"If not None, only these columns will be read from the file." | ||
), | ||
validators=[Optional()], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a screenshot of the updated form UI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a screenshot to the summary above
@robdiciuccio Ephemeral environment spinning up at http://34.214.127.48:8080. Credentials are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments:
- While it's convenient to add this functionality to the CSV upload form, I feel this should be in a form of its own, as the majority of the current fields are specific to CSV only (IIUC only
usecols
is needed for Parquet upload) - If the CSV upload form will also handle Parquet, the title needs to be updated to reflect this. However, I'd personally prefer moving this into a form of its own.
- it would be nice if the form could handle directories/zip files, as it's fairly common to have partitioned data that is split up into multiple Parquet files. As pandas also supports uploading from a directory path, this would be a great feature to avoid having to manually append upload each file.
superset/views/database/forms.py
Outdated
config["ALLOWED_EXTENSIONS"].intersection(config["CSV_EXTENSIONS"]), | ||
config["ALLOWED_EXTENSIONS"].intersection( | ||
config["CSV_EXTENSIONS"].union(config["OTHER_EXTENSIONS"]) | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add the union
to the error message below.
Yeah I agree that this probably belongs in a separate form. Just went in this direction because creating a new form would mean that we'd effectively be abandoning #13834. I'll go ahead and get started working on a new form for |
@villebro can you restart CI? |
360e17e
to
5b5eb74
Compare
@villebro should come back green now |
@exemplary-citizen there's a linting error in one of the files. You can either setup pre-commit hooks or apply the diff below to fix the problem: diff --git a/superset/views/database/views.py b/superset/views/database/views.py
index 8d0a92f6c..3863b165c 100644
--- a/superset/views/database/views.py
+++ b/superset/views/database/views.py
@@ -406,17 +406,23 @@ class ColumnarToDatabaseView(SimpleFormView):
def form_get(self, form: ColumnarToDatabaseForm) -> None:
form.if_exists.data = "fail"
- def form_post(self, form: ColumnarToDatabaseForm) -> Response: # pylint: disable=too-many-locals
+ def form_post(
+ self, form: ColumnarToDatabaseForm
+ ) -> Response: # pylint: disable=too-many-locals
database = form.con.data
columnar_table = Table(table=form.name.data, schema=form.schema.data)
files = form.columnar_file.data
file_type = {file.filename.split(".")[-1] for file in files}
if file_type == {"zip"}:
- zipfile_ob = zipfile.ZipFile(form.columnar_file.data[0]) # pylint: disable=consider-using-with
+ zipfile_ob = zipfile.ZipFile(
+ form.columnar_file.data[0]
+ ) # pylint: disable=consider-using-with
file_type = {filename.split(".")[-1] for filename in zipfile_ob.namelist()}
files = [
- io.BytesIO((zipfile_ob.open(filename).read(), filename)[0]) # pylint: disable=consider-using-with
+ io.BytesIO(
+ (zipfile_ob.open(filename).read(), filename)[0]
+ ) # pylint: disable=consider-using-with
for filename in zipfile_ob.namelist()
] |
@villebro took care of the code formatting |
@exemplary-citizen sorry to bother you again, but we've recently updated the version of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! While testing I found some edge cases that caused trouble, but those can be improved upon later (I'll try to open up a PR for some of them; will tag you for a review when I do).
Ephemeral environment shutdown and build artifacts deleted. |
That sounds great @exemplary-citizen! Oh, and I forgot; thanks so much for your patience with the review process! |
* allow csv upload to accept parquet file * fix mypy * fix if statement * add test for specificying columns in CSV upload * clean up test * change order in test * fix failures * upload parquet to seperate table in test * fix error message * fix mypy again * rename other extensions to columnar * add new form for columnar upload * add support for zip files * undo csv form changes except usecols * add more tests for zip * isort & black * pylint * fix trailing space * address more review comments * pylint * black * resolve remaining issues
* allow csv upload to accept parquet file * fix mypy * fix if statement * add test for specificying columns in CSV upload * clean up test * change order in test * fix failures * upload parquet to seperate table in test * fix error message * fix mypy again * rename other extensions to columnar * add new form for columnar upload * add support for zip files * undo csv form changes except usecols * add more tests for zip * isort & black * pylint * fix trailing space * address more review comments * pylint * black * resolve remaining issues
Hi @villebro and @exemplary-citizen , I am using Superset v2.1.0 docker compose and I couldn't upload a parquet file to Superset. Is this request get deprecated in new version? |
I am sorry for the question above, I see the Is there a way to programatically import parquet files to superset db? Thanks |
SUMMARY
Allow CSV upload form to accept parquet file. Went in this direction so as not to exacerbate what was brought up in #13834 by adding a new form specifically for parquet files. I believe small modifications can be made to this PR to accommodate
feather
andorc
files.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TEST PLAN
Added a test similar to the ones already in
csv_upload_tests.py
ADDITIONAL INFORMATION
Fixes #14020