Skip to content
This repository has been archived by the owner on Dec 20, 2018. It is now read-only.

java.lang.ArrayIndexOutOfBoundsException with Google Analytics Data #49

Closed
theouteredge opened this issue May 14, 2015 · 8 comments
Closed
Milestone

Comments

@theouteredge
Copy link

I'm attempting to use spark-avro with Google Analytics avro data files, from one of our clients. Also I'm new to spark/scala, so my apologies if I've got anything wrong or done anything stupid. I'm using Spark 1.3.1.

I'm experimenting with the data in the spark-shell which I'm kicking off like this:

spark-shell --packages com.databricks:spark-avro_2.10:1.0.0

Then I'm running the following commands:

import com.databricks.spark.avro._
import scala.collection.mutable._

val gadata = sqlContext.avroFile("[client]/data")
gadata: org.apache.spark.sql.DataFrame = [visitorId: bigint, visitNumber: bigint, visitId: bigint, visitStartTime:  bigint, date: string, totals: struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,tr ansactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScre en:bigint,totalTransactionRevenue:bigint>, trafficSource: struct<referralPath:string,campaign:string,source:string, medium:string,keyword:string,adContent:string>, device: struct<browser:string,browserVersion:string,operatingSystem :string,operatingSystemVersion:string,isMobile:boolean,mobileDeviceBranding:string,flashVersion:string,javaEnabled: boolean,language:string,screenColors:string,screenResolution:string,deviceCategory:string>, geoNetwork: str...

val gaIds = gadata.map(ga => ga.getString(11)).collect()

I get the following error:

[Stage 2:=>                                                                                          (8 + 4) / 430]15/05/14 11:14:04 ERROR Executor: Exception in task 12.0 in stage 2.0 (TID 27)
java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:14:04 WARN TaskSetManager: Lost task 12.0 in stage 2.0 (TID 27, localhost): java.lang.ArrayIndexOutOfBoundsException

15/05/14 11:14:04 ERROR TaskSetManager: Task 12 in stage 2.0 failed 1 times; aborting job
15/05/14 11:14:04 WARN TaskSetManager: Lost task 11.0 in stage 2.0 (TID 26, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 10.0 in stage 2.0 (TID 25, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 9.0 in stage 2.0 (TID 24, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 13.0 in stage 2.0 (TID 28, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 2.0 failed 1 times, most recent failure: Lost task 12.0 in stage 2.0 (TID 27, localhost): java.lang.ArrayIndexOutOfBoundsException

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

I though this might be too do with the index I was using, but the following statement works OK.

scala> gadata.first().getString(11)
res12: String = 29456309767885

So I though that maybe some of the records might be empty or have different amount of columns... so I attempted to run the following statement to get a list of all the record lengths:

scala> gadata.map(ga => ga.length).collect()

But I get a similar error:

[Stage 4:=>                                                                                          (8 + 4) / 430]15/05/14 11:20:04 ERROR Executor: Exception in task 12.0 in stage 4.0 (TID 42)
java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:20:04 WARN TaskSetManager: Lost task 12.0 in stage 4.0 (TID 42, localhost): java.lang.ArrayIndexOutOfBoundsException

15/05/14 11:20:04 ERROR TaskSetManager: Task 12 in stage 4.0 failed 1 times; aborting job
15/05/14 11:20:04 WARN TaskSetManager: Lost task 11.0 in stage 4.0 (TID 41, localhost): TaskKilled (killed intentionally)
15/05/14 11:20:04 ERROR Executor: Exception in task 13.0 in stage 4.0 (TID 43)
org.apache.spark.TaskKilledException
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
15/05/14 11:20:04 WARN TaskSetManager: Lost task 9.0 in stage 4.0 (TID 39, localhost): TaskKilled (killed intentionally)
15/05/14 11:20:04 WARN TaskSetManager: Lost task 10.0 in stage 4.0 (TID 40, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 4.0 failed 1 times, most recent failure: Lost task 12.0 in stage 4.0 (TID 42, localhost): java.lang.ArrayIndexOutOfBoundsException

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Is this an Issue with Spark-Avro or Spark?

@theouteredge
Copy link
Author

Not sure what the underlying issue was, but I've managed to fix the error by breaking up my data into monthly sets. I had 4 months worth of GA data in a single folder and was operation on all the data. The data ranged from 70MB to 150MB per day.

Creating 4 folders for January, February, March & April and loading them up individually the map succeeds without any issues. Once loaded I can join the data set together (only tried two so far) and work on them, without issue.

I'm using Spark on a Pseudo Hadoop distribution, not sure if this makes a difference to the volume of data Spark can handle.

@rxin
Copy link
Contributor

rxin commented May 16, 2015

Do you have the stacktrace on the executor side? It should've been logged right above the driver stacktrace.

@theouteredge
Copy link
Author

Hi I'm running everything through the spark-shell via a ssh shell on the machine itself, here if my whole session, just in case its useful:

[cehuser@hadoop15 ~]$ spark-shell --packages com.databricks:spark-avro_2.10:1.0.0
Ivy Default Cache set to: /home/cehuser/.ivy2/cache
The jars for the packages stored in: /home/cehuser/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark-1.3.1/lib/spark-assembly-1.3.1-hadoop2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-avro_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found com.databricks#spark-avro_2.10;1.0.0 in central
        found org.apache.avro#avro;1.7.6 in central
        found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
        found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central
        found com.thoughtworks.paranamer#paranamer;2.3 in central
        found org.xerial.snappy#snappy-java;1.0.5 in central
        found org.apache.commons#commons-compress;1.4.1 in central
        found org.tukaani#xz;1.0 in central
        found org.slf4j#slf4j-api;1.6.4 in central
:: resolution report :: resolve 518ms :: artifacts dl 25ms
        :: modules in use:
        com.databricks#spark-avro_2.10;1.0.0 from central in [default]
        com.thoughtworks.paranamer#paranamer;2.3 from central in [default]
        org.apache.avro#avro;1.7.6 from central in [default]
        org.apache.commons#commons-compress;1.4.1 from central in [default]
        org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
        org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
        org.slf4j#slf4j-api;1.6.4 from central in [default]
        org.tukaani#xz;1.0 from central in [default]
        org.xerial.snappy#snappy-java;1.0.5 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   9   |   0   |   0   |   0   ||   9   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 9 already retrieved (0kB/14ms)
15/05/16 11:50:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.1
      /_/

Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_79)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> import com.databricks.spark.avro._
import com.databricks.spark.avro._

scala> val gadata = sqlContext.avroFile("[client]/raw/ga")
15/05/16 11:50:33 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/usr/lib/spark/lib/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/usr/lib/spark-1.3.1/lib/datanucleus-api-jdo-3.2.6.jar."
15/05/16 11:50:33 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/usr/lib/spark/lib/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/usr/lib/spark-1.3.1/lib/datanucleus-core-3.2.10.jar."
15/05/16 11:50:33 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/usr/lib/spark-1.3.1/lib/datanucleus-rdbms-3.2.9.jar."
15/05/16 11:50:33 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/05/16 11:50:33 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
gadata: org.apache.spark.sql.DataFrame = [visitorId: bigint, visitNumber: bigint, visitId: bigint, visitStartTime: bigint, date: string, totals: struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,transactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScreen:bigint,totalTransactionRevenue:bigint>, trafficSource: struct<referralPath:string,campaign:string,source:string,medium:string,keyword:string,adContent:string>, device: struct<browser:string,browserVersion:string,operatingSystem:string,operatingSystemVersion:string,isMobile:boolean,mobileDeviceBranding:string,flashVersion:string,javaEnabled:boolean,language:string,screenColors:string,screenResolution:string,deviceCategory:string>, geoNetwork: str...
scala>

scala> gadata.map(ga => ga.getString(11))
res0: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at map at DataFrame.scala:848

scala> res0.collect
[Stage 0:=>                                                                                          (9 + 4) / 430]15/05/16 11:51:38 ERROR Executor: Exception in task 12.0 in stage 0.0 (TID 12)
java.lang.ArrayIndexOutOfBoundsException: 11
        at org.apache.avro.generic.GenericData$Record.get(GenericData.java:135)
        at com.databricks.spark.avro.AvroRelation$$anonfun$com$databricks$spark$avro$AvroRelation$$createConverter$5.apply(AvroRelation.scala:155)
        at com.databricks.spark.avro.AvroRelation$$anonfun$com$databricks$spark$avro$AvroRelation$$createConverter$5.apply(AvroRelation.scala:155)
        at com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1.apply(AvroRelation.scala:83)
        at com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1.apply(AvroRelation.scala:83)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
        at scala.collection.AbstractIterator.to(Iterator.scala:1157)
        at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
        at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
        at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
        at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
15/05/16 11:51:38 WARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, localhost): java.lang.ArrayIndexOutOfBoundsException: 11
        at org.apache.avro.generic.GenericData$Record.get(GenericData.java:135)
        at com.databricks.spark.avro.AvroRelation$$anonfun$com$databricks$spark$avro$AvroRelation$$createConverter$5.apply(AvroRelation.scala:155)
        at com.databricks.spark.avro.AvroRelation$$anonfun$com$databricks$spark$avro$AvroRelation$$createConverter$5.apply(AvroRelation.scala:155)
        at com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1.apply(AvroRelation.scala:83)
        at com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1.apply(AvroRelation.scala:83)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
        at scala.collection.AbstractIterator.to(Iterator.scala:1157)
        at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
        at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
        at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
        at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

15/05/16 11:51:38 ERROR TaskSetManager: Task 12 in stage 0.0 failed 1 times; aborting job
15/05/16 11:51:38 WARN TaskSetManager: Lost task 10.0 in stage 0.0 (TID 10, localhost): TaskKilled (killed intentionally)
15/05/16 11:51:38 WARN TaskSetManager: Lost task 11.0 in stage 0.0 (TID 11, localhost): TaskKilled (killed intentionally)
15/05/16 11:51:38 WARN TaskSetManager: Lost task 9.0 in stage 0.0 (TID 9, localhost): TaskKilled (killed intentionally)
15/05/16 11:51:38 WARN TaskSetManager: Lost task 13.0 in stage 0.0 (TID 13, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 0.0 failed 1 times, most recent failure: Lost task 12.0 in stage 0.0 (TID 12, localhost): java.lang.ArrayIndexOutOfBoundsException: 11
        at org.apache.avro.generic.GenericData$Record.get(GenericData.java:135)
        at com.databricks.spark.avro.AvroRelation$$anonfun$com$databricks$spark$avro$AvroRelation$$createConverter$5.apply(AvroRelation.scala:155)
        at com.databricks.spark.avro.AvroRelation$$anonfun$com$databricks$spark$avro$AvroRelation$$createConverter$5.apply(AvroRelation.scala:155)
        at com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1.apply(AvroRelation.scala:83)
        at com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1.apply(AvroRelation.scala:83)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
        at scala.collection.AbstractIterator.to(Iterator.scala:1157)
        at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
        at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
        at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
        at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

If I do this:

scala> val gadata = sqlContext.avroFile("[client]/raw/ga/march")

Everything runs OK, with the limited dataset

@rxin
Copy link
Contributor

rxin commented May 17, 2015

What's your avro schema? I don't really know much about avro, but this exception is actually coming from avro library itself, not a spark-avro thing. Is it possible your file is corrupted or have heterogeneous schema?

@theouteredge
Copy link
Author

The avro schema is the default Google Analytic's one. I loaded up each months data and printout the schemas. Both January and February are identical but after this a field goes walk about for March and Aprils schema's:

root
 |-- visitorId: long (nullable = true)
 |-- visitNumber: long (nullable = true)
 |-- visitId: long (nullable = true)
 |-- visitStartTime: long (nullable = true)
 |-- date: string (nullable = true)
 |-- totals: struct (nullable = true)
 |    |-- visits: long (nullable = true)
 |    |-- hits: long (nullable = true)
 |    |-- pageviews: long (nullable = true)
 |    |-- timeOnSite: long (nullable = true)
 |    |-- bounces: long (nullable = true)
 |    |-- transactions: long (nullable = true)
 |    |-- transactionRevenue: long (nullable = true)
 |    |-- newVisits: long (nullable = true)
 |    |-- screenviews: long (nullable = true)
 |    |-- uniqueScreenviews: long (nullable = true)
 |    |-- timeOnScreen: long (nullable = true)
 |    |-- totalTransactionRevenue: long (nullable = true)
(snipped)

After February the totalTransactionRevenuse at the bottom is not present anymore. So I assume this is causing the error and is related to Issue #31

@clockfly
Copy link
Contributor

This should be solved by #155

@theouteredge
Copy link
Author

@clockfly what version is this fixed in? Its it version 3?

@JoshRosen JoshRosen added this to the 3.1.0 milestone Nov 21, 2016
@JoshRosen
Copy link
Contributor

Marking this as fixed since it's supposedly fixed in #155, which will be included in the forthcoming 3.1.0 release (I'll make an announcement once it's out).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants