Skip to content

PySpark is a Python API for support Python with Spark. Whether it is to perform computations on large datasets or to just analyze them

Notifications You must be signed in to change notification settings

aviggithub/PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

PySpark PySpark tutorial

PySpark is a Python API for support Python with Spark. Whether it is to perform computations on large datasets or to just analyze them

Install pySpark

pip install pyspark

Distributed Processing Power of PySpark

Key Features of PySpark

Real-time computations:

Because of the in-memory processing in the PySpark framework, it shows low latency.

Polyglot:

The PySpark framework is compatible with various languages such as Scala, Java, Python, and R, which makes it one of the most preferable frameworks for processing huge datasets.

Caching and disk persistence:

This framework provides powerful caching and great disk persistence.

Fast processing:

The PySpark framework is way faster than other traditional frameworks for Big Data processing.

Works well with RDDs:

Python programming language is dynamically typed, which helps when working with RDDs(Resilient Distributed Datasets ).

RDDs (Resilient Distributed Datasets) –

RDDs are immutable collection of objects. Since we are using PySpark, these objects can be of multiple types. These will become more clear further.

STEPS:

Reading the data Cleaning data