Apache Spark Overview

2015-06-11 | 分类于 bigdata

Spark Stack

Spark was created to complement, not replace, Hadoop

Uses HDFS
Runs on YARN
Integrate well with Hadoop ecosystem (flume/sqoop/hbase, in future and kafka)

spark stack

Hadoop introduced two key concepts:

Distribute data
Run computation where the data is

Spark take it to the next level and make data distributed in memory.

Spark Framework

Cluster computing
- Application processes are distributed across a cluster of worker nodes
- Managed by a single “master”
- Scalable and fault tolerant
Distribute storage
Data in memory

Common Use Case

ETL
Text mining
Index building
Graph creation and analysis
Pattern recognition
Collaborativce filtering
Prediction models
Sentiment analysis
Risk assessment

Spark VS Hadoop MapReduce

Hadoop MapReduce
- Widely used, huge investment already made
- Supports and supported by many complementary tools
- Mature, stable, wellWtested technology
- Skilled developers available
Spark
- Flexible
- Elegant
- Fast
- Changing rapidly

Hadoop Ecosystem

Data Storage: HBase – The HDFS Database

HBase: big benifit is you can modify your data
HBase: database layered on top of HDFS
- Provides interactive access to data
Stores massive amounts of data
- Petabytes+
High throughput
- Thousands of writes per second (per node)
Handles sparse data well
- No wasted space for a row with empty columns
Limited access model
- Optimized for lookup of a row by key rather than full queries
- No transactions: single row operations only

Data Analysis: Hive

Built on Hadoop MapReduce, an SQL-like access to Hadoop data tool.

Data Analysis: Impala

Open source project, developed by Cloudera, high-speed SQL query engine

High-performance SQL engine for vast amounts of data
- Similar query language to HiveQL
- 10 to 50+ Gmes faster than Hive or MapReduce
Impala runs on Hadoop clusters
- Data stored in HDFS
- Dedicated SQL engine; does not depend on Spark, MapReduce, or Hive

Data Integration: Flume

Flume: A service to move large amounts of data in real-time

Spark Streaming is integrated with Flume

Data Integration: Sqoop

Check to see the basic usage for Sqoop