You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 20, 2018. It is now read-only.
We just upgraded our stack from Spark 1.6 to Spark 2.2 and with that me moved from com.databricks:spark-avro_2.10:2.0.1 to com.databricks:spark-avro_2.11:4.0.0.
We noticed a huge increase in the running time in one of our script. Here is the schema of the files we are loading from HDFS:
Hello,
We just upgraded our stack from Spark 1.6 to Spark 2.2 and with that me moved from
com.databricks:spark-avro_2.10:2.0.1
tocom.databricks:spark-avro_2.11:4.0.0
.We noticed a huge increase in the running time in one of our script. Here is the schema of the files we are loading from HDFS:
In Spark 1 our script runs in ~2 minutes vs ~40 minutes in Spark 2.
At first I suspected, our script & user defined functions to be quite slow. But then I updated the script to simply read & write our file:
And we were still facing the same performance issue: in Spark 1 this runs in ~2 minutes and in Spark 2 this runs in ~40 minutes.
To give your more info on the files we are loading : there are ~2 500 000 entries are the number of struct elements in the array can be quite high:
Could you look into this? If you need any additional information feel free to ask!
The text was updated successfully, but these errors were encountered: