Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated dependencies for Spark 3.0.0 #30

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dovijoel
Copy link

Updated dependencies to work with Spark 3.0 as addressed in issue #15

@ghost
Copy link

ghost commented Jul 29, 2020

CLA assistant check
All CLA requirements met.

@dovijoel
Copy link
Author

This is strange, all tests passed on my local dev environment. See the log below:

C:\Users\DoviJoel\.jdks\corretto-1.8.0_262\bin\java.exe -server -Xmx1536M -Dsbt.supershell=false -Didea.managed=true -Dfile.encoding=UTF-8 -Dsbt.log.noformat=true -jar C:\Users\DoviJoel\AppData\Roaming\JetBrains\IntelliJIdea2020.1\plugins\Scala\launcher\sbt-launch.jar early(addPluginSbtFile=\"\"\"C:\Users\DoviJoel\AppData\Local\Temp\idea.sbt\"\"\") "; set ideaPort in Global := 52731 ; idea-shell"
[info] welcome to sbt 1.3.13 (Amazon.com Inc. Java 1.8.0_262)
[info] loading global plugins from C:\Users\DoviJoel\.sbt\1.0\plugins
[info] loading settings for project sql-spark-connector-build from assembly.sbt,idea.sbt ...
[info] loading project definition from C:\dev\sql-spark-connector\project
[info] loading settings for project sql-spark-connector from build.sbt ...
[info] set current project to spark-mssql-connector (in build file:/C:/dev/sql-spark-connector/)
[info] Defining Global / ideaPort
[info] The new value will be used by Compile / compile, Test / compile
[info] Reapplying settings...
[info] set current project to spark-mssql-connector (in build file:/C:/dev/sql-spark-connector/)
[IJ]sbt:spark-mssql-connector> {file:/C:/dev/sql-spark-connector/}sql-spark-connector/test
[info] Compiling 1 Scala source and 1 Java source to C:\dev\sql-spark-connector\target\scala-2.12\classes ...
[warn] C:\dev\sql-spark-connector\src\main\java\com\microsoft\sqlserver\jdbc\spark\DataFrameBulkRecord.java:18:1: com.microsoft.sqlserver.jdbc.ISQLServerBulkRecord in com.microsoft.sqlserver.jdbc has been deprecated
[warn] com.microsoft.sqlserver.jdbc.ISQLServerBulkRecord
[warn] C:\dev\sql-spark-connector\src\main\java\com\microsoft\sqlserver\jdbc\spark\DataFrameBulkRecord.java:20:1: com.microsoft.sqlserver.jdbc.ISQLServerBulkRecord in com.microsoft.sqlserver.jdbc has been deprecated
[warn] ISQLServerBulkRecord
[warn] C:\dev\sql-spark-connector\src\main\java\com\microsoft\sqlserver\jdbc\spark\DataFrameBulkRecord.java:20:1: serializable class com.microsoft.sqlserver.jdbc.spark.DataFrameBulkRecord has no definition of serialVersionUID
[warn] public class DataFrameBulkRecord implements ISQLServerBulkRecord, AutoCloseable {
[warn]     private Iterator<Row> iterator;
[warn]     private ColumnMetadata[] dfColumnMetadata;
[warn]     private Set<Integer> columnOrdinals;
[warn]
[warn]     public DataFrameBulkRecord(Iterator<Row> iterator, ColumnMetadata[] dfColumnMetadata) {
[warn]         this.iterator = iterator;
[warn]         this.dfColumnMetadata = dfColumnMetadata;
[warn]         this.columnOrdinals = IntStream.range(1, dfColumnMetadata.length+1).boxed().collect(Collectors.toSet());
[warn]     }
[warn]
[warn]     @Override
[warn]     public Object[] getRowData() throws SQLServerException {
[warn]
[warn]         Row row = iterator.next();
[warn]         Object[] rowData = new Object[row.length()];
[warn]         for (int idx = 0; idx < dfColumnMetadata.length; idx++) {
[warn]              // Columns may be reordered between SQLTable and DataFrame. dfFieldIndex maps to the
[warn]              // corresponding column in rowData and thus use dfFieldIndex to get the column.
[warn]              int dfFieldIndex = dfColumnMetadata[idx].getDfColIndex();
[warn]              rowData[idx] = row.get(dfFieldIndex);
[warn]         }
[warn]         return rowData;
[warn]     }
[warn]
[warn]     @Override
[warn]     public Set<Integer> getColumnOrdinals() {
[warn]         return columnOrdinals;
[warn]     }
[warn]
[warn]     @Override
[warn]     public String getColumnName(int column) {
[warn]         return dfColumnMetadata[column-1].getName();
[warn]     }
[warn]
[warn]     @Override
[warn]     public int getColumnType(int column) {
[warn]         return dfColumnMetadata[column-1].getType();
[warn]     }
[warn]
[warn]     @Override
[warn]     public int getPrecision(int column) {
[warn]         return dfColumnMetadata[column-1].getPrecision();
[warn]     }
[warn]
[warn]     @Override
[warn]     public int getScale(int column) {
[warn]         return dfColumnMetadata[column-1].getScale();
[warn]     }
[warn]
[warn]     @Override
[warn]     public boolean isAutoIncrement(int column) {
[warn]         return dfColumnMetadata[column-1].isAutoIncrement();
[warn]     }
[warn]
[warn]     @Override
[warn]     public boolean next() throws SQLServerException {
[warn]         return iterator.hasNext();
[warn]     }
[warn]
[warn]     @Override
[warn]     public void close() throws SQLServerException {}
[warn]
[warn]     @Override
[warn]     public void addColumnMetadata(int positionInFile, String name, int jdbcType,
[warn]         int precision, int scale) {}
[warn]
[warn]     @Override
[warn]     public void addColumnMetadata(int positionInFile, String name, int jdbcType,
[warn]         int precision, int scale, DateTimeFormatter dateTimeFormatter) {}
[warn]
[warn]     @Override
[warn]     public DateTimeFormatter getColumnDateTimeFormatter(int column) {
[warn]         return null;
[warn]     }
[warn]
[warn]     @Override
[warn]     public void setTimestampWithTimezoneFormat(String dateTimeFormat) {}
[warn]
[warn]     @Override
[warn]     public void setTimestampWithTimezoneFormat(DateTimeFormatter dateTimeFormatter) {}
[warn]
[warn]     @Override
[warn]     public void setTimeWithTimezoneFormat(String timeFormat) {}
[warn]
[warn]     @Override
[warn]     public void  setTimeWithTimezoneFormat(DateTimeFormatter dateTimeFormatter) {}
[warn] }
[info] Done compiling.
[info] Compiling 1 Java source to C:\dev\sql-spark-connector\target\scala-2.12\test-classes ...
[info] Done compiling.
[debug] Test run started
[debug] Test com.microsoft.sqlserver.jdbc.spark.DataSourceUtilsTest.dataFrameBulkRecordTest started
[debug] Test com.microsoft.sqlserver.jdbc.spark.DataSourceUtilsTest.dataFrameBulkRecordTest finished, took 0.037 sec
[debug] Test com.microsoft.sqlserver.jdbc.spark.DataSourceUtilsTest.columnMetadataTest started
[debug] Test com.microsoft.sqlserver.jdbc.spark.DataSourceUtilsTest.columnMetadataTest finished, took 0.001 sec
[debug] Test run finished: 0 failed, 0 ignored, 2 total, 0.046s
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/29 00:23:21 INFO SparkContext: Running Spark version 3.0.0
20/07/29 00:23:21 INFO ResourceUtils: ==============================================================
20/07/29 00:23:21 INFO ResourceUtils: Resources for spark.driver:

20/07/29 00:23:21 INFO ResourceUtils: ==============================================================
20/07/29 00:23:21 INFO SparkContext: Submitted application: test-sql-context
20/07/29 00:23:22 INFO SecurityManager: Changing view acls to: DoviJoel
20/07/29 00:23:22 INFO SecurityManager: Changing modify acls to: DoviJoel
20/07/29 00:23:22 INFO SecurityManager: Changing view acls groups to:
20/07/29 00:23:22 INFO SecurityManager: Changing modify acls groups to:
20/07/29 00:23:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(DoviJoel); groups with view permissions: Set(); users  with modify permissions: Set(DoviJoel); groups with modify permissions: Set()
20/07/29 00:23:22 INFO Utils: Successfully started service 'sparkDriver' on port 52763.
20/07/29 00:23:22 INFO SparkEnv: Registering MapOutputTracker
20/07/29 00:23:22 INFO SparkEnv: Registering BlockManagerMaster
20/07/29 00:23:22 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/29 00:23:22 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/29 00:23:22 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
20/07/29 00:23:22 INFO DiskBlockManager: Created local directory at C:\Users\DoviJoel\AppData\Local\Temp\blockmgr-c54378f1-32b0-4f4e-8e1e-ee8305440657
20/07/29 00:23:22 INFO MemoryStore: MemoryStore started with capacity 671.1 MiB
20/07/29 00:23:22 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/29 00:23:23 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/07/29 00:23:23 INFO Utils: Successfully started service 'SparkUI' on port 4041.
20/07/29 00:23:23 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://DESKTOP-5BLC2ED:4041
20/07/29 00:23:23 INFO Executor: Starting executor ID driver on host DESKTOP-5BLC2ED
20/07/29 00:23:23 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52790.
20/07/29 00:23:23 INFO NettyBlockTransferService: Server created on DESKTOP-5BLC2ED:52790
20/07/29 00:23:23 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/29 00:23:23 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, DESKTOP-5BLC2ED, 52790, None)
20/07/29 00:23:23 INFO BlockManagerMasterEndpoint: Registering block manager DESKTOP-5BLC2ED:52790 with 671.1 MiB RAM, BlockManagerId(driver, DESKTOP-5BLC2ED, 52790, None)
20/07/29 00:23:23 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, DESKTOP-5BLC2ED, 52790, None)
20/07/29 00:23:23 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, DESKTOP-5BLC2ED, 52790, None)
20/07/29 00:23:23 INFO DataSourceTest:

===== TEST OUTPUT FOR com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'Schema validation between Spark DataFrame and SQL Server ResultSet' =====

20/07/29 00:23:23 INFO DataSourceTest:

===== FINISHED com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'Schema validation between Spark DataFrame and SQL Server ResultSet' =====




20/07/29 00:23:23 INFO SharedState: loading hive config file: jar:file:/C:/Users/DoviJoel/.ivy2/cache/org.apache.spark/spark-sql_2.12/jars/spark-sql_2.12-3.0.0-tests.jar!/hive-site.xml
20/07/29 00:23:24 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/C:/dev/sql-spark-connector/spark-warehouse//com.microsoft.sqlserver.jdbc.spark.DataSourceTest').
20/07/29 00:23:24 INFO SharedState: Warehouse path is 'file:/C:/dev/sql-spark-connector/spark-warehouse//com.microsoft.sqlserver.jdbc.spark.DataSourceTest'.
20/07/29 00:23:24 INFO DataSourceTest:

===== TEST OUTPUT FOR com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'JdbcBulkOptions should have proper Bulk configurations' =====

20/07/29 00:23:24 INFO DataSourceTest:

===== FINISHED com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'JdbcBulkOptions should have proper Bulk configurations' =====

20/07/29 00:23:24 INFO DataSourceTest:

===== TEST OUTPUT FOR com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'Data pool URL generation' =====




20/07/29 00:23:24 INFO DataSourceTest:

===== FINISHED com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'Data pool URL generation' =====

20/07/29 00:23:24 INFO DataSourceTest:

===== TEST OUTPUT FOR com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'Multi part tablename test' =====

20/07/29 00:23:24 INFO DataSourceTest:

===== FINISHED com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'Multi part tablename test' =====

20/07/29 00:23:24 INFO DataSourceTest:

===== TEST OUTPUT FOR com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'Data pool options test' =====

20/07/29 00:23:24 INFO DataSourceTest:

===== FINISHED com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'Data pool options test' =====

20/07/29 00:23:24 INFO DataSourceTest:

===== TEST OUTPUT FOR com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'Default AAD options are correct.' =====

20/07/29 00:23:24 INFO DataSourceTest:

===== FINISHED com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'Default AAD options are correct.' =====

20/07/29 00:23:24 INFO DataSourceTest:

===== TEST OUTPUT FOR com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'Correct AAD options are set when accessToken is specified' =====

20/07/29 00:23:24 INFO DataSourceTest:

===== FINISHED com.microsoft.sqlserver.jdbc.spark.DataSourceTest: 'Correct AAD options are set when accessToken is specified' =====

20/07/29 00:23:25 INFO SparkUI: Stopped Spark web UI at http://DESKTOP-5BLC2ED:4041
20/07/29 00:23:25 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/29 00:23:25 INFO MemoryStore: MemoryStore cleared
20/07/29 00:23:25 INFO BlockManager: BlockManager stopped
20/07/29 00:23:25 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/29 00:23:25 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
[info] DataSourceTest:
[info] - Schema validation between Spark DataFrame and SQL Server ResultSet
[info] - JdbcBulkOptions should have proper Bulk configurations
[info] - Data pool URL generation
[info] - Multi part tablename test
[info] - Data pool options test
[info] - Default AAD options are correct.
[info] - Correct AAD options are set when accessToken is specified
[info] ScalaTest20/07/29 00:23:25 INFO SparkContext: Successfully stopped SparkContext
20/07/29 00:23:25 WARN DataSourceTest:

===== POSSIBLE THREAD LEAK IN SUITE com.microsoft.sqlserver.jdbc.spark.DataSourceTest, thread names: rpc-boss-3-1, shuffle-boss-6-1 =====


[info] Run completed in 4 seconds, 649 milliseconds.
[info] Total number of tests run: 7
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 9, Failed 0, Errors 0, Passed 9
[success] Total time: 18 s, completed 29 Jul 2020 9:23:25 AM

`

@shivsood
Copy link
Collaborator

@dovijoel Thanks for your work. Did you do a E2E test with Databricks/Spark3.0? AFIAK that will fail. This needs a source level fix as well.

val sparkVersion = "2.4.6"
scalaVersion := "2.12.11"
ThisBuild / useCoursier := false
val sparkVersion = "3.0.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to support sparkVersion 2.4/Scala 2.11 combo as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will look into supporting both scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use something like
crossScalaVersions := Seq("2.12.10", "2.11.12")
to do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajmera3 it won't work, because there is no Spark 3.0 with Scala 2.11. Here we need to have a combo of (Spark 2.4 + Scala 2.11) & (Spark 3.0 + Scala 2.12)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option here is to make main line as Spark3.0/Scala.2.12 and stable/old version to a separate branch e.g. Spark2.4 branch.

@dovijoel
Copy link
Author

@dovijoel Thanks for your work. Did you do a E2E test with Databricks/Spark3.0? AFIAK that will fail. This needs a source level fix as well.

Can you please point me in the right direction on how to do an E2E test?

After looking through the changes that was done for Spark 3.0, it seems that some classes are now no longer accessible, specifically the internal Logger class
Someone in the below JIRA suggests pasting the code into the project in order to use it. Do you think that is a viable solution?
https://issues.apache.org/jira/browse/SPARK-13928
https://github.com/apache/spark/pull/11764

@dovijoel dovijoel marked this pull request as draft July 30, 2020 14:24
@rajmera3
Copy link
Contributor

rajmera3 commented Aug 5, 2020

@shivsood please advise @dovijoel when you get the chance

@shivsood
Copy link
Collaborator

@dovijoel Thanks for your work. Did you do a E2E test with Databricks/Spark3.0? AFIAK that will fail. This needs a source level fix as well.

Can you please point me in the right direction on how to do an E2E test?

After looking through the changes that was done for Spark 3.0, it seems that some classes are now no longer accessible, specifically the internal Logger class
Someone in the below JIRA suggests pasting the code into the project in order to use it. Do you think that is a viable solution?
https://issues.apache.org/jira/browse/SPARK-13928
https://github.com/apache/spark/pull/11764

I get a 404 when i access these links.

apache/spark#11764

Yes, will need to remove the dependency on internal logging and have a logger created in the connector. 2 possibilities

  • get the logger using Logger.getLogger()
  • Implement our own logging class like the spark trait Logging

I am not sure what's other issues happen with DBR 3.0. Can u please put any errors/finding here?

@shivsood shivsood mentioned this pull request Aug 11, 2020
@dovijoel
Copy link
Author

@dovijoel Thanks for your work. Did you do a E2E test with Databricks/Spark3.0? AFIAK that will fail. This needs a source level fix as well.

Can you please point me in the right direction on how to do an E2E test?
After looking through the changes that was done for Spark 3.0, it seems that some classes are now no longer accessible, specifically the internal Logger class
Someone in the below JIRA suggests pasting the code into the project in order to use it. Do you think that is a viable solution?
https://issues.apache.org/jira/browse/SPARK-13928
https://github.com/apache/spark/pull/11764

I get a 404 when i access these links.

apache/spark#11764

Yes, will need to remove the dependency on internal logging and have a logger created in the connector. 2 possibilities

  • get the logger using Logger.getLogger()
  • Implement our own logging class like the spark trait Logging

I am not sure what's other issues happen with DBR 3.0. Can u please put any errors/finding here?

Thank you for your advice! I will incorporate your advice and let you know.

@nightscape
Copy link

Hi @dovijoel, did you see my PR against your PR? 😄
dovijoel#1
dovijoel#2

@aravish
Copy link

aravish commented Aug 13, 2020

I was able to validate the connector with Databricks DBR 7.1 runtime successfully from the forked branch which had the PR 30 merged. (https://github.com/dovijoel/sql-spark-connector).

image
image

print("Use Apache Spark Connector for SQL Server and Azure SQL to write to master SQL instance ")
servername = "jdbc:sqlserver://aravishsqlserver.database.windows.net"
dbname = "custommetastore"
url = servername + ";" + "databaseName=" + dbname + ";"

dbtable = "TBLS"
user = "aravish"
password = "xxxxxxxx" # Please specify password here

print("read data from SQL server table  ")
jdbcDF = spark.read \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .option("url", url) \
        .option("dbtable", dbtable) \
        .option("user", user) \
        .option("password", password).load()

jdbcDF.show(5)

(1) Spark Jobs
Job 0 View(Stages: 1/1)
jdbcDF:pyspark.sql.dataframe.DataFrame = [TBL_ID: long, CREATE_TIME: integer ... 9 more fields]
Use Apache Spark Connector for SQL Server and Azure SQL to write to master SQL instance
read data from SQL server table
+------+-----------+-----+----------------+-----+---------+-----+---------------+-------------+------------------+------------------+
|TBL_ID|CREATE_TIME|DB_ID|LAST_ACCESS_TIME|OWNER|RETENTION|SD_ID| TBL_NAME| TBL_TYPE|VIEW_EXPANDED_TEXT|VIEW_ORIGINAL_TEXT|
+------+-----------+-----+----------------+-----+---------+-----+---------------+-------------+------------------+------------------+
| 1| 1550187698| 1| 0| root| 0| 1|hivesampletable|MANAGED_TABLE| null| null|
+------+-----------+-----+----------------+-----+---------+-----+---------------+-------------+------------------+------------------+
Command took 4.55 seconds -- by arravish@microsoft.com at 8/13/2020, 4:48:17 AM on dbr70

Writes

jdbcDF.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("overwrite") \
    .option("url", url) \
    .option("dbtable", "TBLS_Spark_SQL_Connector") \
    .option("user", user) \
    .option("password", password) \
    .save()

(1) Spark Jobs
Command took 2.73 seconds -- by arravish@microsoft.com at 8/13/2020, 9:05:57 AM on dbr70

@shivsood
Copy link
Collaborator

@dovijoel Thanks for your work. Did you do a E2E test with Databricks/Spark3.0? AFIAK that will fail. This needs a source level fix as well.

Can you please point me in the right direction on how to do an E2E test?
After looking through the changes that was done for Spark 3.0, it seems that some classes are now no longer accessible, specifically the internal Logger class
Someone in the below JIRA suggests pasting the code into the project in order to use it. Do you think that is a viable solution?
https://issues.apache.org/jira/browse/SPARK-13928
https://github.com/apache/spark/pull/11764

I get a 404 when i access these links.

apache/spark#11764

Yes, will need to remove the dependency on internal logging and have a logger created in the connector. 2 possibilities

  • get the logger using Logger.getLogger()
  • Implement our own logging class like the spark trait Logging

I am not sure what's other issues happen with DBR 3.0. Can u please put any errors/finding here?

Thank you for your advice! I will incorporate your advice and let you know.
@dovijoel here's the error that @rajmera3 reported when he tested with master sometime back

  1. Runtime 7.0 (beta) (Apache Spark 3.0.0, Scala 2.12)
    a. Error: java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class

@shivsood
Copy link
Collaborator

I was able to validate the connector with Databricks DBR 7.1 runtime successfully from the forked branch which had the PR 30 merged. (https://github.com/dovijoel/sql-spark-connector).

image
image

print("Use Apache Spark Connector for SQL Server and Azure SQL to write to master SQL instance ")
servername = "jdbc:sqlserver://aravishsqlserver.database.windows.net"
dbname = "custommetastore"
url = servername + ";" + "databaseName=" + dbname + ";"

dbtable = "TBLS"
user = "aravish"
password = "xxxxxxxx" # Please specify password here

print("read data from SQL server table  ")
jdbcDF = spark.read \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .option("url", url) \
        .option("dbtable", dbtable) \
        .option("user", user) \
        .option("password", password).load()

jdbcDF.show(5)

(1) Spark Jobs
Job 0 View(Stages: 1/1)
jdbcDF:pyspark.sql.dataframe.DataFrame = [TBL_ID: long, CREATE_TIME: integer ... 9 more fields]
Use Apache Spark Connector for SQL Server and Azure SQL to write to master SQL instance
read data from SQL server table
+------+-----------+-----+----------------+-----+---------+-----+---------------+-------------+------------------+------------------+
|TBL_ID|CREATE_TIME|DB_ID|LAST_ACCESS_TIME|OWNER|RETENTION|SD_ID| TBL_NAME| TBL_TYPE|VIEW_EXPANDED_TEXT|VIEW_ORIGINAL_TEXT|
+------+-----------+-----+----------------+-----+---------+-----+---------------+-------------+------------------+------------------+
| 1| 1550187698| 1| 0| root| 0| 1|hivesampletable|MANAGED_TABLE| null| null|
+------+-----------+-----+----------------+-----+---------+-----+---------------+-------------+------------------+------------------+
Command took 4.55 seconds -- by arravish@microsoft.com at 8/13/2020, 4:48:17 AM on dbr70

Writes

jdbcDF.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("overwrite") \
    .option("url", url) \
    .option("dbtable", "TBLS_Spark_SQL_Connector") \
    .option("user", user) \
    .option("password", password) \
    .save()

(1) Spark Jobs
Command took 2.73 seconds -- by arravish@microsoft.com at 8/13/2020, 9:05:57 AM on dbr70

Thanks @aravish. Can u test with Master and check that you get an error and repeat the test with this PR branch.

@MrWhiteABEX
Copy link

Any progress on this? I can also validate that the spark-3.0 branch of @dovijoel works. I have compiled it successfully on Windows and used it to read and write data using DBR 7.3.
@aravish getting Spark 3.0 support is really crucial as a Azure Databricks customer. We are currently on 6.5 due to needed MLFlow features. So no option to go back to 6.4. The support ends in October. I can still update to 6.6 but that is also only a temporary solution until November. I would really like to avoid using a custom build sql-spark-connector.

@gmdiana-hershey
Copy link

@MrWhiteABEX - we have the same issue, using an old Databricks runtime because this connector doesn't support Spark 3.0. As an Azure Databricks customer, it's important to us that Spark 3.0+ support comes ASAP. Please advise if there's anything I can do to validate this in our DEV environment.

@MrWhiteABEX
Copy link

@gmdiana-hershey In our testing environment I´am already running DBR 7.2 with the spark connector from adjustments from dovijoel. No issues experienced so far. I think the only problem is that the CI build pipeline is broken by Scala 2.12 but I did not investigate it.
I just compiled it locally using sbt assembly and uploaded the library to my Databricks workspace. Would like to see official Spark 3.0 support but I guess I will just switch to my custom library and update to DBR 7. See already large improvements in my streaming pipeline due to Spark 3.0

@alexott
Copy link
Contributor

alexott commented Nov 6, 2020

I just tested build with Spark 3.0.1 & DBR 7.3, and it works just fine.

@pmooij
Copy link

pmooij commented Nov 26, 2020

I was able to validate the connector with Databricks DBR 7.1 runtime successfully from the forked branch which had the PR 30 merged. (https://github.com/dovijoel/sql-spark-connector).
image
image

print("Use Apache Spark Connector for SQL Server and Azure SQL to write to master SQL instance ")
servername = "jdbc:sqlserver://aravishsqlserver.database.windows.net"
dbname = "custommetastore"
url = servername + ";" + "databaseName=" + dbname + ";"

dbtable = "TBLS"
user = "aravish"
password = "xxxxxxxx" # Please specify password here

print("read data from SQL server table  ")
jdbcDF = spark.read \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .option("url", url) \
        .option("dbtable", dbtable) \
        .option("user", user) \
        .option("password", password).load()

jdbcDF.show(5)

(1) Spark Jobs
Job 0 View(Stages: 1/1)
jdbcDF:pyspark.sql.dataframe.DataFrame = [TBL_ID: long, CREATE_TIME: integer ... 9 more fields]
Use Apache Spark Connector for SQL Server and Azure SQL to write to master SQL instance
read data from SQL server table
+------+-----------+-----+----------------+-----+---------+-----+---------------+-------------+------------------+------------------+
|TBL_ID|CREATE_TIME|DB_ID|LAST_ACCESS_TIME|OWNER|RETENTION|SD_ID| TBL_NAME| TBL_TYPE|VIEW_EXPANDED_TEXT|VIEW_ORIGINAL_TEXT|
+------+-----------+-----+----------------+-----+---------+-----+---------------+-------------+------------------+------------------+
| 1| 1550187698| 1| 0| root| 0| 1|hivesampletable|MANAGED_TABLE| null| null|
+------+-----------+-----+----------------+-----+---------+-----+---------------+-------------+------------------+------------------+
Command took 4.55 seconds -- by arravish@microsoft.com at 8/13/2020, 4:48:17 AM on dbr70
Writes

jdbcDF.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("overwrite") \
    .option("url", url) \
    .option("dbtable", "TBLS_Spark_SQL_Connector") \
    .option("user", user) \
    .option("password", password) \
    .save()

(1) Spark Jobs
Command took 2.73 seconds -- by arravish@microsoft.com at 8/13/2020, 9:05:57 AM on dbr70

Thanks @aravish. Can u test with Master and check that you get an error and repeat the test with this PR branch.

@aravish @shivsood any update on this?
needed to step back from DBR 6.6 to 6.4 today because 6.6 is EOL now :(

@lmicverm
Copy link

lmicverm commented Dec 3, 2020

I tried building it with sbt assembly to get a fat jar. I used the jar on DBR 7.3 with spark 3.0.1, but when writing to a database, I get the following error:

Py4JJavaError: An error occurred while calling o491.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 16, 10.139.64.5, executor 0): com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
	at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:234)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(SQLServerConnection.java:1130)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection.rollback(SQLServerConnection.java:3270)
	at com.microsoft.sqlserver.jdbc.spark.BulkCopyUtils$.savePartition(BulkCopyUtils.scala:53)
	at com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$.$anonfun$write$2(BestEffortSingleInstanceStrategy.scala:30)
	at com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$.$anonfun$write$2$adapted(BestEffortSingleInstanceStrategy.scala:29)
	at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1001)
	at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1001)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2371)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
	at org.apache.spark.scheduler.Task.run(Task.scala:117)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:662)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:665)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

It appears that it fails when writing the data to the database, since the table itself is successfully created in the database with all its columns.

@pmooij
Copy link

pmooij commented Dec 9, 2020

compiling the fat JAR worked for me, as I descbribed here: #15 (comment)

this has been running smooth at Databrick Runtime 7.4 | Spark 3.1 over the last few days

"com.novocode" % "junit-interface" % "0.11" % "test",

//SQLServer JDBC jars
"com.microsoft.sqlserver" % "mssql-jdbc" % "7.2.1.jre8"
"com.microsoft.sqlserver" % "mssql-jdbc" % "8.2.1.jre8"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why v8.2 for JDBC driver? Did 7.2 give some issue or was this just alignment to latest.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark 3 is JDK 11, you need to use

// https://mvnrepository.com/artifact/com.microsoft.sqlserver/mssql-jdbc
libraryDependencies += "com.microsoft.sqlserver" % "mssql-jdbc" % "8.4.1.jre11"

@shivsood
Copy link
Collaborator

@dovijoel Can u please rebase with master and push? Want to standardize on this branch Spark 3.0/Scala 2.12 support. Meanwhile ,have create a new PR based off this PR for Spark3.0/Scala2.12 support. #76

@dovijoel
Copy link
Author

dovijoel commented Jan 12, 2021 via email

@shivsood shivsood mentioned this pull request Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.