-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix encoding in GCS Integration #5447
Conversation
@@ -140,7 +139,9 @@ def __build_df(self, key: str) -> 'pd.DataFrame': | |||
if '.parquet' in key: | |||
df = pd.read_parquet(buffer) | |||
elif '.csv' in key: | |||
df = pd.read_csv(buffer) | |||
with blob.open('rb') as f: | |||
encoding = chardet.detect(f.read())['encoding'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if we need to read the whole file, may be it's better to use Example: Detecting encoding incrementally from https://chardet.readthedocs.io/en/latest/usage.html#example-using-the-detect-function
Hey @lytkarinskiy |
# Description <!-- Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change. --> Bump up version to 0.9.75 <!-- Release notes generated using configuration in .github/release.yml at master --> ## What's Changed ### 🎉 Exciting New Features * Airtable Destination by @TalaatHasanin in #5454 * [hw] Support Mage in Python 3.11 and 3.12 by @csharplus in #5393 * Add postgres client package by @jx2lee in #5486 * [enhancement] Update Elasticsearch + verify_certs connection option by @syepes in #5462 * [enhancement] adds support of _op_type for bulk operations of destinations by @syepes in #5482 * [enhancement] Elasticsearch support of _op_type for bulk operations by @syepes in #5471 * Add support to multiples webhook endpoints for Microsoft Teams notification service by @messerzen in #5508 * [hw] Update Redshift connector to enable merge load and correct row count by @csharplus in #5522 ### 🐛 Bug Fixes * [xh]Pass OpenAI API key to OpenAI Library by @matrixstone in #5430 * [jk] Fix error detail log parsing by @johnson-mage in #5443 * [jk] Render json objects in block outputs as string instead of nested table by @johnson-mage in #5450 * [jk] Block output table cell overflow by @johnson-mage in #5451 * [Bug] Updated SalesForce source to handle multiple date formats by @tolson17 in #5493 * [jk] Handle encoded page_block_layouts and block_outputs routes by @johnson-mage in #5544 * [jk] Revert name of test environment by @johnson-mage in #5478 * [hw] Correct the SSL settings in the nats configuration template by @csharplus in #5579 * Fix encoding in GCS Integration by @TalaatHasanin in #5447 * [enhancement] Elasticsearch - Align the doc publishing with the Standard (batch) by @syepes in #5510 * [Enhancement] io/base.py: add Excel support by @LucasGrugru in #5542 * [hw] Update deltalake to a recent version `0.20.2` by @csharplus in #5541 * bugfix/#5562 python data exporter failed insert 2d array to postgres by @sugimiyanto in #5563 * Fix unable to pass extra parameters to psycopg2 in Postgres connector by @kanenorman in #5449 * [hw] Avoid throwing exceptions when `block_type` is None by @csharplus in #5439 * Fix load data in GCS integration by @TalaatHasanin in #5467 ### 💅 Enhancements & Polish * [xh] Update python version to fix vulnerabilities by @matrixstone in #5523 * [jk] Display ID in block runs table by @johnson-mage in #5457 * [Oracle DB] Modify inefficient code that converts dataframe to list of tuples by @farmboy-dev in #5502 * [jk] Allow user to set number of lines displayed for block output sample preview by @johnson-mage in #5485 * [hw] Use `self.logger.debug` to replace `print` to make code clean by @csharplus in #5494 * Updated Facebook Ads SDK to 20.0.2 by @jonatansthlmstratlab in #5437 ### Other Changes * Update documentation for Git in Mage in the Getting Started section by @Lennardvb in #5459 * [hw] Use the built-in IntEnum instead of creating a new class by @csharplus in #5433 * [hw] Add time delay to avoid getting the same file timestamp for multiple changes by @csharplus in #5499 * Update text in documentation about enabling HTTPS in AWS by @Lennardvb in #5517 * Fix typos in compute-resource.mdx by @mvillaizan in #5519 * Update alerting-teams.mdx for multiples webhooks. by @messerzen in #5511 * Update README.md by @neubert-analytics in #5441 * Typo fix ai-client.mdx by @MageKai in #5582 * Add information regarding feature dependency to docs by @oscarlofwenhamn in #5584 ## New Contributors * @jonatansthlmstratlab made their first contribution in #5437 * @Lennardvb made their first contribution in #5459 * @syepes made their first contribution in #5471 * @jx2lee made their first contribution in #5486 * @mvillaizan made their first contribution in #5519 * @farmboy-dev made their first contribution in #5502 * @tolson17 made their first contribution in #5493 * @neubert-analytics made their first contribution in #5441 * @kanenorman made their first contribution in #5449 * @LucasGrugru made their first contribution in #5542 * @sugimiyanto made their first contribution in #5563 * @oscarlofwenhamn made their first contribution in #5584 **Full Changelog**: 0.9.74...0.9.75 # How Has This Been Tested? <!-- Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. --> - [x] Basic tests locally # Checklist - [x] The PR is tagged with proper labels (bug, enhancement, feature, documentation) - [x] I have performed a self-review of my own code - [ ] I have added unit tests that prove my fix is effective or that my feature works - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation cc: <!-- Optionally mention someone to let them know about this pull request -->
Description
This pr includes adding
chardet
(a python library to get file encoding) to pass the sile encoding toread_csv()
pandas methodAlso added chardet and pyairtable to
requirements.txt
Fix [BUG] Google Cloud Storage UTF-8 error #5446
Test with different file encodings
Checklist
cc:
@wangxiaoyou1993