
  • Spark Streaming provide batch and realtime processing stream data.
    • Website monitor
    • Fraud detection
    • AD monitor
  • Much easier than Apache Storm
    • Spark provide high-level api
    • Storm provide low-level api
  • Second-scale latencies
  • Once and Only once processing (per duration)


  • Divide up data stream into batches of n seconds
  • Process each batch in Spark as an RDD
  • Return results of RDD operaBons in batches

  • DStream is serval RDDs in a duration
  • Two types of DStream operations:
    • Transformations: Create a new DStream from an existing one
      • map/flatMap/filter
      • reduceByKey/groupByKey/joinByKey
    • Output operations: Write data, similar to RDD actions
      • print: print first 10 elements in each RDD
      • saveAsTextFiles
      • saveAsObjectFiles
      • foreachRDD(function(RDD,timestamp):xxxxx)

Running Spark Streaming

when running Spark Streaming, you need to either run the shell on cluster or locally with at least two threads
`spark-shell –master local[2] -i wordcount.scala

otherwise if you use locally with one thread, it will show following error:

15/02/27 10:49:35 WARN BlockManager: Block input-0-1425062975000 already exists on this machine; not re-adding it
15/02/27 10:49:36 WARN BlockManager: Block input-0-1425062975800 already exists on this machine; not re-adding it
15/02/27 10:49:37 WARN BlockManager: Block input-0-1425062976800 already exists on this machine; not re-adding it

Using Window or State need CheckPoint

cd to directory contains pom.xml
use mvn package to compile scala





