Apache Spark Streaming | Junhuih Notes

Features

Spark Streaming provide batch and realtime processing stream data.
- Website monitor
- Fraud detection
- AD monitor
Much easier than Apache Storm
- Spark provide high-level api
- Storm provide low-level api
Second-scale latencies
Once and Only once processing (per duration)

Overview

Divide up data stream into batches of n seconds
Process each batch in Spark as an RDD
Return results of RDD operaBons in batches

DStream is serval RDDs in a duration
Two types of DStream operations:
- Transformations: Create a new DStream from an existing one
  - map/flatMap/filter
  - reduceByKey/groupByKey/joinByKey
- Output operations: Write data, similar to RDD actions
  - print: print first 10 elements in each RDD
  - saveAsTextFiles
  - saveAsObjectFiles
  - foreachRDD(function(RDD,timestamp):xxxxx)

Running Spark Streaming

when running Spark Streaming, you need to either run the shell on cluster or locally with at least two threads
`spark-shell –master local[2] -i wordcount.scala

otherwise if you use locally with one thread, it will show following error:

15/02/27 10:49:35 WARN BlockManager: Block input-0-1425062975000 already exists on this machine; not re-adding it
15/02/27 10:49:36 WARN BlockManager: Block input-0-1425062975800 already exists on this machine; not re-adding it
15/02/27 10:49:37 WARN BlockManager: Block input-0-1425062976800 already exists on this machine; not re-adding it

Using Window or State need CheckPoint

cd to directory contains pom.xml
use mvn package to compile scala

https://databricks-training.s3.amazonaws.com/index.html

http://databricks.gitbooks.io/databricks-spark-reference-applications/content/index.html

http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/

Broadcast

https://github.com/JerryLead/SparkInternals/blob/master/markdown/7-Broadcast.md

https://github.com/JerryLead/SparkInternals/tree/master/markdown