Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding in GCS Integration #5447

Merged
merged 3 commits into from
Oct 8, 2024
Merged

Conversation

TalaatHasanin
Copy link
Contributor

Description

  • This pr includes adding chardet (a python library to get file encoding) to pass the sile encoding to read_csv() pandas method

  • Also added chardet and pyairtable to requirements.txt
    Fix [BUG] Google Cloud Storage UTF-8 error #5446

  • Test with different file encodings

Checklist

  • The PR is tagged with proper labels (bug, enhancement, feature, documentation)
  • I have performed a self-review of my own code

cc:
@wangxiaoyou1993

@@ -140,7 +139,9 @@ def __build_df(self, key: str) -> 'pd.DataFrame':
if '.parquet' in key:
df = pd.read_parquet(buffer)
elif '.csv' in key:
df = pd.read_csv(buffer)
with blob.open('rb') as f:
encoding = chardet.detect(f.read())['encoding']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we need to read the whole file, may be it's better to use Example: Detecting encoding incrementally from https://chardet.readthedocs.io/en/latest/usage.html#example-using-the-detect-function

@TalaatHasanin
Copy link
Contributor Author

Hey @lytkarinskiy
I used charset_normalizer instead of chardet because it more lightweight

@wangxiaoyou1993 wangxiaoyou1993 merged commit 24324d9 into mage-ai:master Oct 8, 2024
9 checks passed
wangxiaoyou1993 added a commit that referenced this pull request Dec 4, 2024
# Description
<!-- Please include a summary of the change and which issue is fixed.
Please also include relevant motivation and context.
List any dependencies that are required for this change.
-->
Bump up version to 0.9.75

<!-- Release notes generated using configuration in .github/release.yml
at master -->

## What's Changed
### 🎉 Exciting New Features
* Airtable Destination by @TalaatHasanin in
#5454
* [hw] Support Mage in Python 3.11 and 3.12 by @csharplus in
#5393
* Add postgres client package by @jx2lee in
#5486
* [enhancement] Update Elasticsearch + verify_certs connection option by
@syepes in #5462
* [enhancement] adds support of _op_type for bulk operations of
destinations by @syepes in #5482
* [enhancement] Elasticsearch support of _op_type for bulk operations by
@syepes in #5471
* Add support to multiples webhook endpoints for Microsoft Teams
notification service by @messerzen in
#5508
* [hw] Update Redshift connector to enable merge load and correct row
count by @csharplus in #5522

### 🐛 Bug Fixes
* [xh]Pass OpenAI API key to OpenAI Library by @matrixstone in
#5430
* [jk] Fix error detail log parsing by @johnson-mage in
#5443
* [jk] Render json objects in block outputs as string instead of nested
table by @johnson-mage in #5450
* [jk] Block output table cell overflow by @johnson-mage in
#5451
* [Bug] Updated SalesForce source to handle multiple date formats by
@tolson17 in #5493
* [jk] Handle encoded page_block_layouts and block_outputs routes by
@johnson-mage in #5544
* [jk] Revert name of test environment by @johnson-mage in
#5478
* [hw] Correct the SSL settings in the nats configuration template by
@csharplus in #5579
* Fix encoding in GCS Integration by @TalaatHasanin in
#5447
* [enhancement] Elasticsearch - Align the doc publishing with the
Standard (batch) by @syepes in
#5510
* [Enhancement] io/base.py: add Excel support by @LucasGrugru in
#5542
* [hw] Update deltalake to a recent version `0.20.2` by @csharplus in
#5541
* bugfix/#5562 python data exporter failed insert 2d array to postgres
by @sugimiyanto in #5563
* Fix unable to pass extra parameters to psycopg2 in Postgres connector
by @kanenorman in #5449
* [hw] Avoid throwing exceptions when `block_type` is None by @csharplus
in #5439
* Fix load data in GCS integration by @TalaatHasanin in
#5467

### 💅 Enhancements & Polish
* [xh] Update python version to fix vulnerabilities by @matrixstone in
#5523
* [jk] Display ID in block runs table by @johnson-mage in
#5457
* [Oracle DB] Modify inefficient code that converts dataframe to list of
tuples by @farmboy-dev in #5502
* [jk] Allow user to set number of lines displayed for block output
sample preview by @johnson-mage in
#5485
* [hw] Use `self.logger.debug` to replace `print` to make code clean by
@csharplus in #5494
* Updated Facebook Ads SDK to 20.0.2 by @jonatansthlmstratlab in
#5437


### Other Changes
* Update documentation for Git in Mage in the Getting Started section by
@Lennardvb in #5459
* [hw] Use the built-in IntEnum instead of creating a new class by
@csharplus in #5433
* [hw] Add time delay to avoid getting the same file timestamp for
multiple changes by @csharplus in
#5499
* Update text in documentation about enabling HTTPS in AWS by @Lennardvb
in #5517
* Fix typos in compute-resource.mdx by @mvillaizan in
#5519
* Update alerting-teams.mdx for multiples webhooks. by @messerzen in
#5511
* Update README.md by @neubert-analytics in
#5441
* Typo fix ai-client.mdx by @MageKai in
#5582
* Add information regarding feature dependency to docs by
@oscarlofwenhamn in #5584

## New Contributors
* @jonatansthlmstratlab made their first contribution in
#5437
* @Lennardvb made their first contribution in
#5459
* @syepes made their first contribution in
#5471
* @jx2lee made their first contribution in
#5486
* @mvillaizan made their first contribution in
#5519
* @farmboy-dev made their first contribution in
#5502
* @tolson17 made their first contribution in
#5493
* @neubert-analytics made their first contribution in
#5441
* @kanenorman made their first contribution in
#5449
* @LucasGrugru made their first contribution in
#5542
* @sugimiyanto made their first contribution in
#5563
* @oscarlofwenhamn made their first contribution in
#5584

**Full Changelog**:
0.9.74...0.9.75

# How Has This Been Tested?
<!-- Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
-->

- [x] Basic tests locally


# Checklist
- [x] The PR is tagged with proper labels (bug, enhancement, feature,
documentation)
- [x] I have performed a self-review of my own code
- [ ] I have added unit tests that prove my fix is effective or that my
feature works
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation

cc:
<!-- Optionally mention someone to let them know about this pull request
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Google Cloud Storage UTF-8 error
3 participants