Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while using spark-redshift jar #315

Open
ghost opened this issue Dec 29, 2016 · 37 comments
Open

Error while using spark-redshift jar #315

ghost opened this issue Dec 29, 2016 · 37 comments

Comments

@ghost
Copy link

ghost commented Dec 29, 2016

Hi,

Getting the below error while using the jar to integrate redshift with spark locally.

Exception in thread "main" java.lang.AbstractMethodError: com.databricks.spark.redshift.RedshiftFileFormat.prepareRead(Lorg/apache/spark/sql/SparkSession;Lscala/collection/immutable/Map;Lscala/collection/Seq;)Lscala/collection/immutable/Map;

at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
	at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:168)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$3.apply(DataSourceStrategy.scala:141)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$3.apply(DataSourceStrategy.scala:141)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:184)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:183)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:257)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:179)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:137)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:55)
	at org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:54)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:77)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:82)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:82)
	at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2462)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:1861)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2078)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:240)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:533)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:493)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:502)
	at simpleSample.RedshiftToSpark$.main(RedshiftToSpark.scala:53)
	at simpleSample.RedshiftToSpark.main(RedshiftToSpark.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

I find that prepareRead method is not in the RedshiftFileFormat.

Thanks & Regards,
Ravi

@JoshRosen
Copy link
Contributor

JoshRosen commented Dec 29, 2016

Which version of Spark are you using? If you're using 2.1.x then I suspect that changes to internal APIs may have broke spark-redshift, in which case we'll need to make a new release.

@JoshRosen
Copy link
Contributor

Actually, looking a little more closely since this problem relates to prepareRead I don't think it's a 2.1.x issue since that method had been completely removed from Spark by that point (see apache/spark#13698). According to https://issues.apache.org/jira/browse/SPARK-15983 that change went into 2.0.

Thus: are you using a newer version of spark-redshift with Spark 1.x? You'll need to use a 1.x version of this library with Spark 1.x; newer versions won't work with Spark 1.x.

@lminer
Copy link

lminer commented Jan 6, 2017

I'm getting the same exception with a different stack trace and only when I switch from spark 2.0.1 to spark 2.1.0/hadoop 2.7/mesos/spark-redshift_2.11-2.0.1.jar/RedshiftJDBC41-1.1.17.1017.jar

48f7-81e8-02403dbc2b57-S107): java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:232)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

@schwartzmx
Copy link

schwartzmx commented Jan 10, 2017

I'm getting this error as well, with spark 2.1.0, I've also tried using the 3.0.0-preview1 of this library, previously was using 2.0.0.

java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:232)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Edit: Here's a bit bigger stack trace that may help.

17/01/09 22:45:34 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 1.0 failed 1 times, most recent failure: Lost task 5.0 in stage 1.0 (TID 6, localhost, executor driver): java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:232)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
	at com.databricks.spark.redshift.RedshiftWriter.unloadData(RedshiftWriter.scala:295)
	at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:392)
	at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:108)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
	at org.lucidhq.SFRedshiftETL.SFObject.redshiftLoad(SFObject.scala:115)
	at org.lucidhq.SFRedshiftETL.SFObject.load(SFObject.scala:256)
	at org.lucidhq.SFRedshiftETL.SFRedshiftETL$$anonfun$run$1.apply(main.scala:61)
	at org.lucidhq.SFRedshiftETL.SFRedshiftETL$$anonfun$run$1.apply(main.scala:44)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.lucidhq.SFRedshiftETL.SFRedshiftETL$.run(main.scala:44)
	at org.lucidhq.SFRedshiftETL.SFRedshiftETL$$anonfun$main$1.apply(main.scala:83)
	at org.lucidhq.SFRedshiftETL.SFRedshiftETL$$anonfun$main$1.apply(main.scala:83)
	at scala.Option.map(Option.scala:146)
	at org.lucidhq.SFRedshiftETL.SFRedshiftETL$.main(main.scala:83)
	at org.lucidhq.SFRedshiftETL.SFRedshiftETL.main(main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:232)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

@lminer
Copy link

lminer commented Jan 13, 2017

@JoshRosen Any plans to make a new release soon? Seems like it's needed to use this with 2.1.0.

@elyast
Copy link

elyast commented Jan 20, 2017

@JoshRosen hit the same issue after upgrading from Spark 2.0.2 to Spark 2.1.0 our pipeline started throwing exceptions with the same cause

Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;

We are using spark-redsfhit 2.0.1 with https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC41-1.1.17.1017.jar

@carlos-eduardo-gb
Copy link

@elyast hit the same issue using spark 2.1.0.

I make this question in Stackoverflow

Using the version 2.0.2 of Spark you have the same issue? I'm not able to make the spark-redshift work in 2.0.2, if possible a help will be useful.

@elyast
Copy link

elyast commented Jan 20, 2017

found the root cause, spark 2.1 added new method to the interface:
org.apache.spark.sql.execution.datasources.OutputWriterFactory#def getFileExtension(context: TaskAttemptContext): String

which is not implemented in spark-avro, hence AbstractMethodError

@apurva-sharma
Copy link

Ran into the same issue with spark 2.1.0 , is there a work around (besides bumping the spark version down?).

@elyast
Copy link

elyast commented Jan 30, 2017

@apurva-sharma you can build this patch: databricks/spark-avro#206 and replace spark-avro dependency with this custom version, at least it worked for us

@apurva-sharma
Copy link

@elyast thanks for that, I can verify that monkey patching spark-avro as above worked for me with spark 2.1.0
It will be great if this is merged.

@elyast
Copy link

elyast commented Jan 30, 2017

@apurva-sharma +1

@alexander-branevskiy
Copy link

looks like spark-avro was fixed. any updates here?

@sanketvega
Copy link

any updates when this issue will be fixed?

@diegorep
Copy link

^ @JoshRosen

@caeleth
Copy link

caeleth commented Feb 24, 2017

Atm this driver is completely unusable ...

@hnfmr
Copy link

hnfmr commented Feb 25, 2017

Fixed mine by adding this line to sbt project build.sbt:

dependencyOverrides += spark_avro_320
where
val spark_avro_320: ModuleID = "com.databricks" % "spark-avro_2.11" % "3.2.0"

I am using spark-redshift 3 btw...

Hopefully this library can be actively supported in the long run, it looks like it has not been updated for several months....

@mrdmnd
Copy link

mrdmnd commented Mar 9, 2017

I've tried what @hnfmr suggests, but I am still running into this issue.

@hnfmr
Copy link

hnfmr commented Mar 9, 2017

@mrdmnd To be specific, I am using the Spark-Redshift v3.0.0-preview1 and my build.sbt looks like:

lazy val app = (project in file("app")).
  .settings(commonSettings: _*)
  .settings(
    libraryDependencies += "com.databricks" % "spark-redshift_2.11" % "3.0.0-preview1",
    dependencyOverrides += "com.databricks" % "spark-avro_2.11" % "3.2.0"
  )
)

BTW, I am using Spark 2.1.0... hope this helps

@wafisher
Copy link

@elyast Can you please describe what you did? My guess:

  1. Clone the spark-avro repo and checkout the commit of that PR (post-merge).
  2. Build the jar.
  3. Use SBT to use this jar. (Do you know how to do this offhand?)

Thank you!

@sadowski
Copy link

Also seeing this issue here. @hnfmr's fix is working for me now, but it would be nice to have this properly fixed. Spark is a popular tool and Redshift usage is only going to grow.

Exact workaround was to add the following to my build.sbt file:

// Temporary fix for: https://github.com/databricks/spark-redshift/issues/315
dependencyOverrides += "com.databricks" % "spark-avro_2.11" % "3.2.0"

@mrdmnd
Copy link

mrdmnd commented Mar 22, 2017

Yeah, I had a minor typo. Can confirm that this works.

@cockroachzl
Copy link

I use Zeppelin to do ETL to redshift and encountered the same AbstractMethodError.

By configuring the spark interpreter to exclude com.databricks:spark-avro_2.11:3.0.0 while depending on com.databricks:spark-redshift_2.11:2.0.1, and then to specify another dependency on com.databricks:spark-avro_2.11:3.2.0 works for me

Thanks a lot!

@Aung-Myint-Thein
Copy link

Yes! Just update or replace spark-avro_2.11-3.1.0.jar with spark-avro_2.11-3.2.0.jar and this problem should be solved now.

https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11/3.2.0

@cshintov
Copy link

HI, I have got the same problem.
I am using spark 2.1.0 and tried using spark-redshift 3.0.0-preview1 and 2.0.1, 2.0.0. All of them gives the same error.

java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/T$
skAttemptContext;)Ljava/lang/String;
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.(FileFormatWriter.scala:232)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeT$
sk(FileFormatWriter.scala:182)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/04/24 21:12:15 ERROR TaskSetManager: Task 1 in stage 2.0 failed 1 times; aborting job
17/04/24 21:12:15 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in s
tage 2.0 (TID 202, localhost, executor driver): java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.
getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.(FileFormatWriter.scala:232)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTa
sk(FileFormatWriter.scala:182)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

@giaosudau
Copy link

I have the same problem and I am using code in spark branch 2.2. Spark avro was spark-avro_2.11-3.2.0.jar already.

Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriter.write(Lorg/apache/spark/sql/catalyst/InternalRow;)V
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:318)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:249)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
  at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:252)

@davidzhao
Copy link

Any updates on this one? It seems that the underlying dependency (spark-avro_2.11-3.2.0) has resolved this issue. Instead of having everyone depend on the workaround, could the owner release a version that depends on the 3.2.0 of spark-avro?

@schwartzmx
Copy link

It seems this issue and repo are getting stale, would love to have this updated. @JoshRosen would it be possible to open this up to new contributors?

@tylermichael
Copy link

Any updates on this? I'm using this through pyspark and am unable to try the work arounds suggested.

@dnaumenko
Copy link

Looks like this issue is going to be fixed in next version of spark-avro lib - databricks/spark-avro#242. It's merged to master 8 days ago

@pmatpadi
Copy link

Thanks for the hint on updating spark-avro dependency version.

I resolved this issue with below spark-submit command in AWS EMR environment:
spark-submit --deploy-mode cluster --class <main_class> --packages com.databricks:spark-redshift_2.11:3.0.0-preview1,com.databricks:spark-avro_2.11:3.2.0,com.amazon.redshift:redshift-jdbc42:1.2.8.1005 --repositories http://redshift-maven-repository.s3-website-us-east-1.amazonaws.com/release s3://<path_to_my_spark_application_jar>

@marcintustin
Copy link

I've updated to spark-avro 4.0.0 and I still have this issue.

@zhassanbey
Copy link

I have faced the same exception with spark_2.11 v2.1.1.
The reason was that my project depended on a custom library which in turn depended on spark-avro v4.0.0 artifact in 'compileOnly' scope. Therefore the spark-avro dependency wasn't really propagating to my project. After I added spark-avro v4.0.0 dependency explicitly the problem resolved.

@schwartzmx
Copy link

I wouldn't expect this to be properly fixed, it seems databricks has decided to not update this library anymore outside of their own Databricks Runtime which as far as I can tell requires you to be using their entire platform, 717a4ad#diff-04c6e90faac2675aa89e2176d2eec7d8

@Renien
Copy link

Renien commented Mar 9, 2018

@hnfmr Thanks for the hint. Even I faced the same issue while running ALS model in GCP and storing the output using com.databricks.spark.csv. Initially I was using com.databricks.spark.csv - 1.2.0 with Spark 2.2.0 and issue occurred. I've updated with latest version 1.5.0 and solved my issue.

@yggowda
Copy link

yggowda commented Nov 2, 2018

I was using older version of spark-redshift_2.11 changed to 3.0.0..preview and it started working

@schwartzmx
Copy link

schwartzmx commented Nov 2, 2018

If anyone is still having issues and wants to collab on this, I forked both this connector and spark-avro, we can get a working group around fixing these. I think this library & spark-avro are dead from an open-source continued support/contributor perspective (outside of the Databricks Runtime 717a4ad#diff-04c6e90faac2675aa89e2176d2eec7d8) as the last commits are 2+ years old aside from README updates.

We would be required to adhere to the licensing (Apache 2.0), via NOTICE and some other means.
If interested in some collab/updates, email me @schwartzmx@gmail.com

Love the library @databricks but many people use this connector and library, I asked if this could be opened up to collaborators and didn't hear a response in over a year. It seems like it was quietly moved to closed source, which is understandable.. from a business perspective.

Thanks for all the initial work on this, has helped a ton of people out to begin with and I personally have built many ETLs and data analysis tools from using this connector.

Cheers,
Phil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests