Course duration
- 4 days
Course Benefits
- How the open source ecosystem of big data tools addresses challenges not met by traditional RDBMSs
- How Apache Hive and Apache Impala are used to provide SQL access to data
- How Hive and Impala syntax and data formats, including functions and subqueries, help answer questions about data
- How to create, modify, and delete tables, views, and databases; load data; and store results of queries
- How to create and use partitions and different file formats
- How to combine two or more datasets using JOIN or UNION, as appropriate
- What analytic and windowing functions are, and how to use them
- How to store and query complex or nested data structures
- How to process and analyze semi-structured and unstructured data
- Different techniques for optimizing Hive and Impala queries
- How to extend the capabilities of Hive and Impala using parameters, custom file formats and SerDes, and external scripts
- How to determine whether Hive, Impala, an RDBMS, or a mix of these is best for a given task
Course Outline
- Apache Hadoop Fundamentals
- The Motivation for Hadoop
- Hadoop Overview
- Data Storage: HDFS
- Distributed Data Processing: YARN, MapReduce, and Spark
- Data Processing and Analysis: Hive and Impala
- Database Integration: Sqoop
- Other Hadoop Data Tools
- Exercise Scenario Explanation
- Introduction to Apache Hive and Impala
- What Is Hive?
- What Is Impala?
- Why Use Hive and Impala?
- Schema and Data Storage
- Comparing Hive and Impala to Traditional Databases
- Use Cases
- Querying with Apache Hive and Impala
- Databases and Tables
- Basic Hive and Impala Query Language Syntax
- Data Types
- Using Hue to Execute Queries
- Using Beeline (Hive's Shell)
- Using the Impala Shell
- Common Operators and Built-In Functions
- Operators
- Scalar Functions
- Aggregate Functions
- Data Management
- Data Storage
- Creating Databases and Tables
- Loading Data
- Altering Databases and Tables
- Simplifying Queries with Views
- Storing Query Results
- Data Storage and Performance
- Partitioning Tables
- Loading Data into Partitioned Tables
- When to Use Partitioning
- Choosing a File Format
- Using Avro and Parquet File Formats
- Working with Multiple Datasets
- UNION and Joins
- Handling NULL Values in Joins
- Advanced Joins
- Analytic Functions and Windowing
- Using Analytic Functions
- Other Analytic Functions
- Sliding Windows
- Complex Data
- Complex Data with Hive
- Complex Data with Impala
- Analyzing Text
- Using Regular Expressions with Hive and Impala
- Processing Text Data with SerDes in Hive
- Sentiment Analysis and n-grams in Hive
- Apache Hive Optimization
- Understanding Query Performance
- Bucketing
- Hive on Spark
- Apache Impala Optimization
- How Impala Executes Queries
- Improving Impala Performance
- Extending Apache Hive and Impala
- Custom SerDes and File Formats in Hive
- Data Transformation with Custom Scripts in Hive
- User-Defined Functions
- Parameterized Queries
- Choosing the Best Tool for the Job
- Comparing Hive, Impala, and Relational Databases
- Which to Choose?
- Conclusion
- Apache Kudu
- What Is Kudu?
- Kudu Tables
- Using Impala with Kudu
Class Materials
Each student will receive a comprehensive set of materials, including course notes and all the class examples.
Experience in the following is required for this Hadoop class:
- Some knowledge of SQL.
- Basic Linux command-line familiarity. .
Instructor-led courses are offered via a live Web connection, at client sites throughout Europe, and at our Geneva Training Center.