Sqoop is an open source tool (originally written at Cloudera) designed to transfer data between Hadoop and RDBMS.

Comment

  • It’s possible to read data directly from an RDBMS in spark application
    • May cause DDOS on RDBMS
    • In practice - don’t do it!
    • Import data into HDFS beforehand
  • Use JDBC interface, works with any JDBC-compatible database
  • Imports data to HDFS as delimited text files or SequenceFiles
    • Default is comma delimited text files
  • Can be used for incremental data imports
    • First import retrieves all rows in a table
    • Subsequent imports retrieve just rows created since the last import

Syntax

Use sqoop help to get basic commands. each command also support help like sqoop import help

$ sqoop help
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

Examples

1. List Databases

$ sqoop list-databases 
--connect jdbc:mysql://localhost 
--username xxx --password xxxx

2. List Tables

connection string should include concrete database.

$ sqoop list-tables 
 --connect jdbc:mysql://localhost/customer 
 --username xxx --password xxxx

3. Import table to HDFS

$ sqoop import 
 --connect jdbc:mysql://localhost/customer 
 --username xxx --password xxxx 
 --fields-terminated-by '\t' --table address

4. Verify HDFS file

$ hdfs dfs -ls address
$ hdfs dfs -tail address/part-m-00000

References: