Skip to content

Latest commit

 

History

History
60 lines (48 loc) · 3.94 KB

datasets_reading.md

File metadata and controls

60 lines (48 loc) · 3.94 KB

Table of Contents generated with DocToc

Datasets

  • GDELT public dataset - A geo-spatial-political event dataset. Ingest the first 4 million lines (1979-1984), which comes out to about 1.1GB CSV file. Entire dataset is more than 250 million rows / 250GB. See the discussion on README about different modelling options.
    • To ingest:
val csvDF = sqlContext.read.format("com.databricks.spark.csv").
              option("header", "true").option("inferSchema", "true").
              load(pathToGdeltCsv)
import org.apache.spark.sql.SaveMode
csvDF.write.format("filodb.spark").
             option("dataset", "gdelt").
             option("row_keys", "GLOBALEVENTID").
             option("segment_key", ":round GLOBALEVENTID 10000").
             option("partition_keys", ":getOrElse MonthYear -1").
             mode(SaveMode.Overwrite).save()
  • NYC Taxi Trip and Fare Info - really interesting geospatial-temporal public time series data. Contains New York City taxi transactions, and an example of how to handle time series / IoT with many entities. Trip data is 2.4GB for one part, ~ 15 million rows, and there are 12 parts.
    • select(count("medallion")).show should result in 14776615 records for the trip_data_1.csv (Part 1). CSV takes around 25 secs on my machine.
    • Partition by string prefix of medallion gives a pretty even distribution, into 676 shards, of all taxi transactions. Note that even with this level of sharding, reading data for one taxi/medallion for a given time range is still pretty fast.
    • Segment by pickup_datetime allows range queries by time.
      • 14.766 million records divided by (676 partitions * 6 days per segment) =~ rouhgly 4000 records per segment, which is about right
val taxiDF = sqlContext.read.format("com.databricks.spark.csv").
               option("header", "true").option("inferSchema", "true").
               load(pathToNycTaxiCsv)
import org.apache.spark.sql.SaveMode
taxiDF.write.format("filodb.spark").
  option("dataset", "nyc_taxi").
  option("row_keys", "hack_license,pickup_datetime").
  option("segment_key", ":timeslice pickup_datetime 6d").
  option("partition_keys", ":stringPrefix medallion 2").
  mode(SaveMode.Overwrite).save()

There is a Spark Notebook to analyze the NYC Taxi dataset.

NOTE: for a stress testing scenario use :stringPrefix medallion 3 as a segment key. It creates really tiny segments and a massive amount of Futures and massive amount of (unnecessary) I/O.

Reading Material (Mostly for FiloDB research)