forked from databricks/spark-redshift
-
Notifications
You must be signed in to change notification settings - Fork 63
/
CHANGELOG
234 lines (181 loc) · 12 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
#
# Modifications Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# spark-redshift Changelog
## 6.4.0 (2024-12-02)
- Supports utilizing the Redshift Data API as an alternative mechanism for communicating with Redshift (see parameters "data_api_*" in the README). [Beaux Sharifi]
- Adds DML pushdown support for four Spark SQL operators: INSERT, DELETE, UPDATE, and MERGE. [Hari Kishore Chaparala, Ruei Yang Huang, Xiaoxuan Li, Beaux Sharifi, Brooke White]
- Adds new pushdown support for the following Spark operators: LocalRelation, CreateNamedStruct, Intersect, and Except [Hari Kishore Chaparala, Xiaoxuan Li, Beaux Sharifi]
- Enhances InSubQuery pushdown to support multiple values. [Beaux Sharifi]
- Adds pushdown support for join conditions [Beaux Sharifi, Brooke White]
- Adds sub-field pushdown support to child scalar subqueries. [Beaux Sharifi]
- Converts left semi and left anti joins into intersect and except set operations in some conditions to improve pushdown performance. [Hari Kishore Chaparala]
- Reverses readSidePadding Spark optimization rule to improve pushdown performance. [Atul Payapilly]
- Adds parameter "check_s3_bucket_usage" for disabling S3 bucket checks during operation to reduce network calls. [Armin Najafi]
- Abbreviates generated pushdown SQL for accommodating Redshift Data API’s smaller maximum query length. [Jack Ye]
- Strengthens logic for detecting and preventing cross-cluster pushdowns. [Beaux Sharifi]
- Validates connector tests pass with the latest Spark patch releases 3.3.4, 3.4.4, and 3.5.3 [Beaux Sharifi]
## 6.3.0 (2024-07-03)
- Validates connector tests pass with Spark releases 3.4.3 and 3.5.1 [Beaux Sharifi]
- Adds support for three-part table names to allow connector to query Redshift data sharing tables (#153) [Prashant Singh, Beaux Sharifi]
- Corrects mapping of Spark ShortType to use Redshift SMALLINT instead of INTEGER to better match expected data size (#152) [Akira Ajisaka, Ruei Huang]
- Adds pushdown of toprettystring() function to support df.show() operations used by Spark 3.5 [Beaux Sharifi]
- Corrects pushdown of inner joins with no join condition to result in a cross join instead of an invalid inner join [Beaux Sharifi]
- Corrects pushdown when casting null boolean values to strings to result in the value null instead of the string "null". [Beaux Sharifi]
- Upgrades Redshift JDBC driver to the latest available version 2.1.0.29. [Beaux Sharifi]
## 6.2.0 (2024-01-12)
- Validates support for Spark 3.3.4 and Spark 3.4.2
- Upgrades Redshift JDBC driver to version 2.1.0.24
- Fixes issue where CSV writes would trim leading and trailing whitespace on column values.
- Improves logging during Redshift writes to differentiate time spent writing to S3 versus COPYing into Redshift (#148)
- Supports pre-GA AWS regions.
## 6.1.0 (2023-10-06)
- Support Spark 3.4.1 and 3.5.0
- Upgrade sbt to version 1.9
- Upgrade Redshift JDBC driver to version 2.1.0.18
- Resolve CVE-2023-2976 and CVE-2020-8908
- Cancel any running queries upon shutdown.
- Improve traceability between Spark and Redshift with introduction of 'trace_id' Spark configuration.
- Improve performance of integration tests by parallelizing their execution.
- Resolve 'expression_tree_walker' query exceptions for sort-limit-project queries.
- Fix issues with cross-cluster queries not working correctly with pushdown.
## 6.0.0 (2023-07-17)
This major release brings forward several performance, security, and usability improvements contributed by AWS:
- Automatic translation and pushdown of Spark expressions down to Redshift for local execution (via new 'autopushdown' parameter)
- Upgrade JDBC driver from 1.x to 2.x for improved security and compatibility with Redshift
- IAM-based authentication
- AWS Secret Manager based authentication
- Parquet support for reading and writing
- S3 result caching (when autopushdown is enabled)
- Spark 3.4.0 support
- Improve cross-region support between Spark cluster and S3 bucket (via new 'tempdir_region' parameter)
- Automatic retry during Redshift COPY operations (via new 'copydelay' and 'copyretrycount' parameters)
- Ability to pass additional options during Redshift UNLOAD operations (via new 'extraunloadoptions' parameter)
- Microsecond precision when reading or writing time-based columns to/from Redshift.
- Complex type support for using Spark ArrayType, MapType, and StructType with Redshift SUPER data type.
- Support for parallel Redshift SQL execution during autopushdown.
- Support for passing JDBC parameters through the connector to the underlying JDBC driver (via new 'jdbc.*' parameter)
- Query group support for annotating Redshift queries from connector (via new 'label' parameter)
## 5.1.0 (2022-09-22)
- Make manifest file path use s3a/n scheme
- Add catalyst type mapping for LONGVARCHAR
- Upgrade to Spark 3.2
- Fix log4j-apt compatability with Spark 3.2
## 5.0.5 (2021-11-09)
- Avoid warning when tmp bucket is configured with a lifecycle without prefix.
## 5.0.4 (2021-07-08)
- Upgrade spark version to 3.0.2 and to latest test aws java sdk version to latest
## 5.0.3 (2021-05-10)
- Remove sbt-spark-package plugin dependency (#90)
## 5.0.2 (2021-05-06)
- Add sse kms support (#82)
## 5.0.1 (2021-04-30)
- Address low performance issue while reading csv files (#87)
## 5.0.0 (2021-01-13)
- Upgrade spark-redshift to support hadoop3
## 4.2.0 (2020-10-08)
- Make spark-redshift Spark 3.0.1 compatible
## 4.1.1
- Cross publish for scala 2.12 in addition to 2.11
## 4.1.0
- Add `include_column_list` parameter
## 4.0.2
- Trim SQL text for preactions and postactions, to fix empty SQL queries bug.
## 4.0.1
- Fix bug when parsing microseconds from Redshift
## 4.0.0
This major release makes spark-redshift compatible with spark 2.4. This was tested in production.
While upgrading the package we droped some features due to time constraints.
- Support for hadoop 1.x has been dropped.
- STS and IAM authentication support has been dropped.
- postgresql driver tests are inactive.
- SaveMode tests (or functionality?) are broken. This is a bit scary but I'm not sure we use the functionality
and fixing them didn't make it in this version (spark-snowflake removed them too).
- S3Native has been deprecated. We created an InMemoryS3AFileSystem to test S3A.
## 4.0.0-SNAPSHOT
- SNAPSHOT version to test publishing to Maven Central.
## 4.0.0-preview20190730 (2019-07-30)
- The library is tested in production using spark2.4
- RedshiftSourceSuite is again among the scala test suites.
## 4.0.0-preview20190715 (2019-07-15)
Move to pre-4.0.0 'preview' releases rather than SNAPSHOT
## 4.0.0-SNAPSHOT-20190710 (2019-07-10)
Remove AWSCredentialsInUriIntegrationSuite test and require s3a path in CrossRegionIntegrationSuite.scala
## 4.0.0-SNAPSHOT-20190627 (2019-06-27)
Baseline SNAPSHOT version working with 2.4
#### Deprecation
In order to get this baseline snapshot out, we dropped some features and package versions,
and disabled some tests.
Some of these changes are temporary, others - such as dropping hadoop 1.x - are meant to stay.
Our intent is to do the best job possible supporting the minimal set of features
that the community needs. Other non-essential features may be dropped before the
first non-snapshot release.
The community's feedback and contributions are vitally important.
* Support for hadoop 1.x has been dropped.
* STS and IAM authentication support has been dropped (so are tests).
* postgresql driver tests are inactive.
* SaveMode tests (or functionality?) are broken. This is a bit scarier but I'm not sure we use the functionality and fixing them didn't make it in this version (spark-snowflake removed them too).
* S3Native has been deprecated. It's our intention to phase it out from this repo. The test util ‘inMemoryFilesystem’ is not present anymore so an entire test suite RedshiftSourceSuite lost its major mock object and I had to remove it. We plan to re-write it using s3a.
#### Commits changelog
- 5b0f949 (HEAD -> master, origin_community/master) Merge pull request #6 from spark-redshift-community/luca-spark-2.4
- 25acded (origin_community/luca-spark-2.4, origin/luca-spark-2.4, luca-spark-2.4) Revert sbt scripts to an older version
- 866d4fd Moving to external github issues - rename spName to spark-redshift-community
- 094cc15 remove in Memory FileSystem class and clean up comments in the sbt build file
- 0666bc6 aws_variables.env gitignored
- f3bbdb7 sbt assembly the package into a fat jar - found the perfect coordination between different libraries versions! Tests pass and can compile spark-on-paasta and spark successfullygit add src/ project/
- b1fa3f6 Ignoring a bunch of tests as did snowflake - close to have a green build to try out
- 95cdf94 Removing conn.commit() everywhere - got 88% of integration tests to run - fix for STS token aws access in progress
- da10897 Compiling - managed to run tests but they mostly fail
- 0fe37d2 Compiles with spark 2.4.0 - amazon unmarshal error
- ea5da29 force spark.avro - hadoop 2.7.7 and awsjavasdk downgraded
- 834f0d6 Upgraded jackson by excluding it in aws
- 90581a8 Fixed NewFilter - including hadoop-aws - s3n test is failing
- 50dfd98 (tag: v3.0.0, tag: gtig, origin/master, origin/HEAD) Merge pull request #5 from Yelp/fdc_first-version
- fbb58b3 (origin/fdc_first-version) First Yelp release
- 0d2a130 Merge pull request #4 from Yelp/fdc_DATALAKE-4899_empty-string-to-null
- 689635c (origin/fdc_DATALAKE-4899_empty-string-to-null) Fix File line length exceeds 100 characters
- d06fe3b Fix scalastyle
- e15ccb5 Fix parenthesis
- d16317e Fix indentation
- 475e7a1 Fix convertion bit and test
- 3ae6a9b Fix Empty string is converted to null
- 967dddb Merge pull request #3 from Yelp/fdc_DATALAKE-486_avoid-log-creds
- 040b4a9 Merge pull request #2 from Yelp/fdc_DATALAKE-488_cleanup-fix-double-to-float
- 58fb829 (origin/fdc_DATALAKE-488_cleanup-fix-double-to-float) Fix test
- 3384333 Add bit and default types
- 3230aaa (origin/fdc_DATALAKE-486_avoid-log-creds) Avoid logging creds. log sql query statement only
- ab8124a Fix double type to float and cleanup
- cafa05f Merge pull request #1 from Yelp/fdc_DATALAKE-563_remove-itests-from-public
- a3a39a2 (origin/fdc_DATALAKE-563_remove-itests-from-public) Remove itests. Fix jdbc url. Update Redshift jdbc driver
- 184b442 Make the note more obvious.
- 717a4ad Notes about inlining this in Databricks Runtime.
- 8adfe95 (origin/fdc_first-test-branch-2) Fix decimal precision loss when reading the results of a Redshift query
- 8da2d92 Test infra housekeeping: reduce SBT memory, update plugin versions, update SBT
- 79bac6d Add instructions on using JitPack master SNAPSHOT builds
- 7a4a08e Use PreparedStatement.getMetaData() to retrieve Redshift query schemas
- b4c6053 Wrap and re-throw Await.result exceptions in order to capture full stacktrace
- 1092c7c Update version in README to 3.0.0-preview1
- 320748a Setting version to 3.0.0-SNAPSHOT
- a28832b (tag: v3.0.0-preview1, origin/fdc_30-review) Setting version to 3.0.0-preview1
- 8afde06 Make Redshift to S3 authentication mechanisms mutually exclusive
- 9ed18a0 Use FileFormat-based data source instead of HadoopRDD for reads
- 6cc49da Add option to use CSV as an intermediate data format during writes
- d508d3e Add documentation and warnings related to using different regions for Redshift and S3
- cdf192a Break RedshiftIntegrationSuite into smaller suites; refactor to remove some redundancy
- bdf4462 Pass around AWSCredentialProviders instead of AWSCredentials
- 51c29e6 Add codecov.yml file.
- a9963da Update AWSCredentialUtils to be uniform between URI schemes.
## 3.0.0-SNAPSHOT (2017-11-08)
Databricks spark-redshift pre-fork, changes not tracked.