Spark Stack

Spark was created to complement, not replace, Hadoop

  • Uses HDFS
  • Runs on YARN
  • Integrate well with Hadoop ecosystem (flume/sqoop/hbase, in future and kafka)

spark stack

Hadoop introduced two key concepts:

  • Distribute data
  • Run computation where the data is

Spark take it to the next level and make data distributed in memory.

Spark Framework

  • Cluster computing
    • Application processes are distributed across a cluster of worker nodes
    • Managed by a single “master”
    • Scalable and fault tolerant
  • Distribute storage
  • Data in memory

Common Use Case

  • ETL
  • Text mining
  • Index building
  • Graph creation and analysis
  • Pattern recognition
  • Collaborativce filtering
  • Prediction models
  • Sentiment analysis
  • Risk assessment

Spark VS Hadoop MapReduce

  • Hadoop MapReduce
    • Widely used, huge investment already made
    • Supports and supported by many complementary tools
    • Mature, stable, wellWtested technology
    • Skilled developers available
  • Spark
    • Flexible
    • Elegant
    • Fast
    • Changing rapidly

Hadoop Ecosystem

Data Storage: HBase – The HDFS Database

  • HBase: big benifit is you can modify your data
  • HBase: database layered on top of HDFS
    • Provides interactive access to data
  • Stores massive amounts of data
    • Petabytes+
  • High throughput
    • Thousands of writes per second (per node)
  • Handles sparse data well
    • No wasted space for a row with empty columns
  • Limited access model
    • Optimized for lookup of a row by key rather than full queries
    • No transactions: single row operations only

Data Analysis: Hive

Built on Hadoop MapReduce, an SQL-like access to Hadoop data tool.

Data Analysis: Impala

Open source project, developed by Cloudera, high-speed SQL query engine

  • High-performance SQL engine for vast amounts of data
    • Similar query language to HiveQL
    • 10 to 50+ Gmes faster than Hive or MapReduce
  • Impala runs on Hadoop clusters
    • Data stored in HDFS
    • Dedicated SQL engine; does not depend on Spark, MapReduce, or Hive

Data Integration: Flume

Flume: A service to move large amounts of data in real-time

Spark Streaming is integrated with Flume

Data Integration: Sqoop

Check to see the basic usage for Sqoop