Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run the NYC Taxicab analysis notebook with Azure DataBricks #236

Open
gadgetman4u opened this issue Feb 6, 2019 · 3 comments
Open

Comments

@gadgetman4u
Copy link

I would like to run the NYC Taxicab analysis notebook with Azure Databricks but the data is in S3. How do I save the data into Azure? Would I save to Azure Data Lake Store and then mount it to Databricks?

Thanks.

@gadgetman4u
Copy link
Author

I already saved the neighborhoods.geojson file into Azure Data Lake Store and placed the path to it in the dbutils.fs.mount. How do I extract the neighborhoods and trips as per the code here?

val trips = sqlContext.read
.format("com.databricks.spark.csv")
.option("comment", "V")
.option("mode", "DROPMALFORMED")
.schema(schema)
.load("/mnt/nyctaxicabanalysis/trips/*")
.withColumn("point",
point($"pickup_longitude",$"pickup_latitude"))
.cache()

val neighborhoods = sqlContext.read
.format("magellan")
.option("type", "geojson")
.load("/mnt/nyctaxicabanalysis/neighborhoods/")
.select($"polygon",
$"metadata"("neighborhood").as("neighborhood"))
.cache()

Thanks.

@gadgetman4u
Copy link
Author

Does anybody know how I can upload the data into Azure so I can extract the neighborhoods and trips?

@guiferviz
Copy link

I found this today that may be interesting for you. I'm not the author: https://lamastex.github.io/scalable-data-science/sds/2/2/db/032_NYtaxisInMagellan.html
It only works for me using the Databricks runtime with Spark 2.1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants