Skip to content
This repository has been archived by the owner on Dec 20, 2018. It is now read-only.

spark-avro 3.2.0 doesn't work with spark 2.2.0 (abstract OutputWriter.write) #240

Closed
gnmerritt opened this issue Jul 12, 2017 · 33 comments
Closed

Comments

@gnmerritt
Copy link

I'm trying to upgrade to the recently released spark 2.2.0. I'm using spark-avro version 3.2.0 and I get the following error when trying to write to an avro file.

org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
...
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriter.write(Lorg/apache/spark/sql/catalyst/InternalRow;)V
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:327)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more

At first glance it appeared to be related to #208 but on closer inspection the problem here is the OutputWriter.write method and not OutputWriterFactory.getFileExtension which was causing trouble with the spark 2.1.0 upgrade.

Happy to provide more info or help debug, just let me know!

@squito
Copy link
Contributor

squito commented Jul 18, 2017

I'm seeing this too. I think its from SPARK-19085 apache/spark@b3d3962

You get a compile error if you try to compile after changing the spark version to 2.2.0:

info] Compiling 5 Scala sources to /Users/irashid/github/pub/spark-avro/target/scala-2.11/classes...
[error] /Users/irashid/github/pub/spark-avro/src/main/scala/com/databricks/spark/avro/AvroOutputWriter.scala:41: class AvroOutputWriter needs to be abstract, since method write in class OutputWriter of type (row: org.apache.spark.sql.catalyst.InternalRow)Unit is not defined
[error] (Note that org.apache.spark.sql.catalyst.InternalRow does not match org.apache.spark.sql.Row)
[error] private[avro] class AvroOutputWriter(
[error]                     ^
[error] /Users/irashid/github/pub/spark-avro/src/main/scala/com/databricks/spark/avro/AvroOutputWriter.scala:69: method write overrides nothing.
[error] Note: the super classes of class AvroOutputWriter contain the following, non final members named write:
[error] def write(row: org.apache.spark.sql.catalyst.InternalRow): Unit
[error]   override def write(row: Row): Unit = {
[error]                ^
[error] two errors found
[error] (compile:compileIncremental) Compilation failed

this was discussed in the PR for spark: apache/spark#16479 (comment)

@marcintustin
Copy link

I'm also hitting this, would love to see the PR merged

@sathish-io
Copy link

I got the same problem, any fix coming soon ?

@gnmerritt
Copy link
Author

We're using the fix proposed in #242 and it seems to be working fine. You'll just have to build spark-avro with the patch applied and point your maven/gradle/whatever at the custom build rather than the released version.

@ritesh-dineout
Copy link

Is there any confirmation when this PR will be merged. We are facing this in production and would appreciate if this is released sooner.

@airawat
Copy link

airawat commented Aug 17, 2017

We have similar urgency as well. Do share when this PR will be merged.

@ljank
Copy link

ljank commented Aug 22, 2017

Since the PR has been already merged, is there any roadmap/timeline for the release? Custom build is always an option, but going the clean way feels way better :) Thank you!

@nightscape
Copy link

If you need it now, you can use the Jitpack build like this:

spark-shell --repositories https://jitpack.io --packages com.github.databricks:spark-avro:204864b6cf 

@jung-kim
Copy link

I understand there are work arounds, but what is blocking the release of this fix?

@omervk
Copy link

omervk commented Sep 3, 2017

+1 waiting for the release. meanwhile using @nightscape's suggestion.

@reflog
Copy link

reflog commented Sep 4, 2017

@nightscape's solution works!

@mateo41
Copy link

mateo41 commented Sep 6, 2017

I know other people have asked, but is there a timeline for when the spark-avro release?

@rxin
Copy link
Contributor

rxin commented Sep 7, 2017

We want to make a release, although the eng team is pretty swamped at the moment.

For now please use the workaround provided above.

@dsfarrar
Copy link

+1 waiting for the release. I appreciate the engineering team's time!

@geek311
Copy link

geek311 commented Sep 26, 2017

I have built the jar from the master branch. It is the 4.0.0 snapshot - what should my sbt dependency be to use this instead of the spark-avro 3.2.0? This error is coming up with Spark 2.2 and Redshift write operations

@geek311
Copy link

geek311 commented Sep 27, 2017

On Building the jar from the master branch it gives me the spark-avro-assembly-4.0.0-SNAPSHOT.jar and if on the CDH cluster i replace the spark-avro-3.2.0 jar with this newly built jar from the master branch and then spark-submit with --jars having the new assembly jar then it is giving me LinkageError. Can you pls tell me how to solve this? What changes do i need to make in my sbt to point to this assembly jar? I have only passed it in the cluster as --jars

Exception in thread "main" java.lang.LinkageError: loader constraint violation: when resolving method "org.apache.spark.streaming.StreamingContext$.getOrCreate(Ljava/lang/String;Lscala/Function0;Lorg/apache/hadoop/conf/Configuration;Z)Lorg/apache/spark/streaming/StreamingContext;" the class loader (instance of org/apache/spark/util/ChildFirstURLClassLoader) of the current class and the class loader (instance of sun/misc/Launcher$AppClassLoader) for the method's defining class, org/apache/spark/streaming/StreamingContext$, have different Class objects for the type scala/Function0 used in the signature

@marcintustin
Copy link

marcintustin commented Sep 27, 2017 via email

@geek311
Copy link

geek311 commented Sep 27, 2017

I have the following in my project/plugins.sbt for building the fat jar.
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.4")

And my sbt has -
"com.databricks" % "spark-redshift_2.11" % "3.0.0-preview1"
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.2.0.cloudera1"

And build.sbt has all project.settings set to scalaVersion := 2.11.8

I have put the spark-avro-assembly-4.0.0-SNAPSHOT.jar in the [root]/lib directory and also tried putting it in the sub-project/lib directory as my root project has many sub-projects. Is there anything else I need to do in my sbt to point to my local snapshot jar?

My scala version is 2.11.8. I am using Intellij Idea and I see Scala 2.11.8 jars in my idea libraries but i also do see scala 2.10 in ~/.sbt folder and ~/.ivy2/cache folders. I had cleaned them but they reappear - how I can I fix this? I don't get the linkage error if I don't use the SNAPSHOT jar and just these -
com.databricks.spark-redshift_2.11-3.0.0-preview1.jar
com.databricks.spark-avro_2.11-3.2.0.jar

But then it is back to square one of InternalRow error

@geek311
Copy link

geek311 commented Sep 28, 2017

I am not getting any linkage error with the published spark-avro 3.2.0 jar while using with redshift 3.0.0-preview1 jar. But on checking out the master branch and building the jar using sbt assembly, and adding it as an unmanaged dependency in sbt root/lib, the LinkageError comes. Any advice? Or when can this fix be published ?

@geek311
Copy link

geek311 commented Sep 28, 2017

I was able to resolve that error and now I am getting the following - is there something else i need to change in the patched master jar for spark 2.2 CDH 5.10 cluster?

Caused by: java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
at org.apache.avro.file.SnappyCodec.compress(SnappyCodec.java:43)
at org.apache.avro.file.DataFileStream$DataBlock.compressUsing(DataFileStream.java:361)
at org.apache.avro.file.DataFileWriter.writeBlock(DataFileWriter.java:394)
at org.apache.avro.file.DataFileWriter.sync(DataFileWriter.java:413)
at org.apache.avro.file.DataFileWriter.flush(DataFileWriter.java:422)
at org.apache.avro.file.DataFileWriter.close(DataFileWriter.java:445)
at org.apache.avro.mapreduce.AvroKeyRecordWriter.close(AvroKeyRecordWriter.java:83)
at com.databricks.spark.avro.AvroOutputWriter.close(AvroOutputWriter.scala:84)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:337)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:330)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)

@beatlevic
Copy link

+1 waiting for the release. Is there anything that is blocking the release?

@geek311
Copy link

geek311 commented Oct 8, 2017

After days of trying several options, I found a simple workaround. Hope this saves people's tons of time and a lot of frustration!
Solution : Just add this line in the spark redshift write block - .option("tempformat","CSV")
Bypass the default AVRO format that spark-redshift uses and use CSV instead. And viola! everything works great with the released spark-redshift-3.0.0preview1.jar and the released spark-avro jar.
sbt entries are -
"com.databricks" %% "spark-redshift" % "3.0.0-preview1"
"com.databricks" %% "spark-avro" % "3.2.0"

Maynot be super effecient but this one line change saves the botheration of building from master, applying that patch in the cluster (plus other problems that go with it). Atleast temporarily till the jar is released

@ryanmickler
Copy link

I'm hitting the same error with spark-avro 3.2.0 against spark 2.1.0

@pmatpadi
Copy link

pmatpadi commented Oct 24, 2017

thanks a lot for the csv workaround @geek311

@lokkju
Copy link

lokkju commented Oct 26, 2017

Hey, question for the engineering team here - how are you handling cross-spark-version compatibility? I'm running into the same issue with a custom OutputWriter implementation, and not sure how to support both Spark 2.1 and Spark 2.2, as this is a breaking change. I don't really want to maintain two branches...

Suggestions as to how you're planning on handling this - or if you're just going to have a min required spark version - would be appreciated.

@gatorsmile
Copy link

cc @gengliangwang Could you take a look at this issue?

@gengliangwang
Copy link
Contributor

We will make a release recently. Sorry for the waiting.

@nemo83
Copy link

nemo83 commented Oct 27, 2017

Hello, thanks for solving this issue. Any ETA on the release?

Thanks!

@luckyvaliin
Copy link

How can i get the latest release ?

@nemo83
Copy link

nemo83 commented Nov 1, 2017

@luckyvaliin
Copy link

Yes i got that and its just working great !! Thank you.

@gatorsmile
Copy link

gatorsmile commented Nov 2, 2017

We will send out a release announcement soon. Thanks!

@gnmerritt
Copy link
Author

thanks guys

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests