Data Integration: Sqoop | Junhuih Notes

Sqoop is an open source tool (originally written at Cloudera) designed to transfer data between Hadoop and RDBMS.

Comment

It’s possible to read data directly from an RDBMS in spark application
- May cause DDOS on RDBMS
- In practice - don’t do it!
- Import data into HDFS beforehand
Use JDBC interface, works with any JDBC-compatible database
Imports data to HDFS as delimited text files or SequenceFiles
- Default is comma delimited text files
Can be used for incremental data imports
- First import retrieves all rows in a table
- Subsequent imports retrieve just rows created since the last import

Syntax

Use sqoop help to get basic commands. each command also support help like sqoop import help

$ sqoop help
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

Examples

1. List Databases

$ sqoop list-databases 
--connect jdbc:mysql://localhost 
--username xxx --password xxxx

2. List Tables

connection string should include concrete database.

$ sqoop list-tables 
 --connect jdbc:mysql://localhost/customer 
 --username xxx --password xxxx

3. Import table to HDFS

$ sqoop import 
 --connect jdbc:mysql://localhost/customer 
 --username xxx --password xxxx 
 --fields-terminated-by '\t' --table address

4. Verify HDFS file

$ hdfs dfs -ls address
$ hdfs dfs -tail address/part-m-00000

References:

Sqoop Documentation