-
Notifications
You must be signed in to change notification settings - Fork 310
java.lang.ArrayIndexOutOfBoundsException with Google Analytics Data #49
Comments
Not sure what the underlying issue was, but I've managed to fix the error by breaking up my data into monthly sets. I had 4 months worth of GA data in a single folder and was operation on all the data. The data ranged from 70MB to 150MB per day. Creating 4 folders for January, February, March & April and loading them up individually the map succeeds without any issues. Once loaded I can join the data set together (only tried two so far) and work on them, without issue. I'm using Spark on a Pseudo Hadoop distribution, not sure if this makes a difference to the volume of data Spark can handle. |
Do you have the stacktrace on the executor side? It should've been logged right above the driver stacktrace. |
Hi I'm running everything through the spark-shell via a ssh shell on the machine itself, here if my whole session, just in case its useful:
If I do this:
Everything runs OK, with the limited dataset |
What's your avro schema? I don't really know much about avro, but this exception is actually coming from avro library itself, not a spark-avro thing. Is it possible your file is corrupted or have heterogeneous schema? |
The avro schema is the default Google Analytic's one. I loaded up each months data and printout the schemas. Both January and February are identical but after this a field goes walk about for March and Aprils schema's:
After February the totalTransactionRevenuse at the bottom is not present anymore. So I assume this is causing the error and is related to Issue #31 |
This should be solved by #155 |
@clockfly what version is this fixed in? Its it version 3? |
Marking this as fixed since it's supposedly fixed in #155, which will be included in the forthcoming 3.1.0 release (I'll make an announcement once it's out). |
I'm attempting to use spark-avro with Google Analytics avro data files, from one of our clients. Also I'm new to spark/scala, so my apologies if I've got anything wrong or done anything stupid. I'm using Spark 1.3.1.
I'm experimenting with the data in the spark-shell which I'm kicking off like this:
Then I'm running the following commands:
I get the following error:
I though this might be too do with the index I was using, but the following statement works OK.
So I though that maybe some of the records might be empty or have different amount of columns... so I attempted to run the following statement to get a list of all the record lengths:
But I get a similar error:
Is this an Issue with Spark-Avro or Spark?
The text was updated successfully, but these errors were encountered: