PySpark Algorithms: (PDF version) (Mahmoud Parsian)


GitHub Source Code for PySpark Algorithms book:

Sample chapters:

This is an introductory book on PySpark.

This book is about PySpark: Python API for Spark.
Apache Spark is an analytics engine for large-scale
data processing. Spark is the open source cluster
computing system that makes data analytics fast
to write and fast to run. This book provides a
large set of recipes for implementing big data
processing and analytics using Spark and Python.
The goal of this book is to show working examples
in PySpark so that you can do your ETL and
analytics easier. You may cut and paste examples to
deliver your applications in PySpark.

This book introduces PySpark (Python API for Spark).
You can use PySpark to tackle big datasets quickly
through simple APIs in Python. You will learn how to
express parallel tasks and computations with just a
few lines of code, and cover applications from ETL,
simple batch jobs to stream processing and machine

With this book, you may dive into Spark capabilities
such as RDDs (resilient distributed datasets),
DataFrames (data as a table of rows and columns),
in memory caching, and the interactive PySpark
shell, where you may leverage Spark’s powerful built
in libraries, including Spark SQL, Spark Streaming,
and MLlib.

In this book, you will learn Spark’s transformations
and actions by a set of well-defined and working
examples. All examples are tested and working: this
means that you can copy-cut-paste to your desired
PySpark applications. Writing PySpark is much easier
than writing Spark applications in Java and PySpark
applications are not bulky at all when compared to
Java Spark.

In this book you will learn:

* Short introduction to Spark and PySpark
* Learn about RDDs, DataFrames, SQL with worked examples
* How to use important Spark transformations on RDDs (low-level APIs)
* How to use SQL and DataFrame
* How to read data from many different data sources
and represent them as RDDs and DataFrames
* Learn the power of Data Design Patterns
* Learn the basics of Monoids and how you should use them in MapReduce
* Learn the basics of GraphFrames for solving graph-related data problems
* Implement Logistic Regression algorithms using PySpark
* Basics of data partitioning and understand reduction transformations

Table of Contents
chap01: Introduction to PySpark
chap02: Hello World
chap03: Data Abstractions
chap04: Getting Started
chap05: Transformations in Spark
chap06: Reductions in Spark
chap07: DataFrames and SQL
chap08: Spark DataSources
chap09: Logistic Regression
chap10: Movie Recommendations
chap11: Graph Algorithms
chap12: Design Patterns and Monoids
Appendix A: How To Install Spark
Appendix B: How to Use Lambda Expressions
Appendix C: Questions And Answers (50+ QA)

Future chapters:
chap13: FP-Growth
chap14: LDA
chap15: Linear Regression

PySpark Algorithms: (PDF version) (Mahmoud Parsian)
PySpark Algorithms: (PDF version) (Mahmoud Parsian)