Course duration
- 4 days
Course Benefits
- How the Apache Hadoop ecosystem fits in with the data processing lifecycle
- How data is distributed, stored, and processed in a Hadoop cluster
- How to write, configure, and deploy Apache Spark applications on a Hadoop cluster
- How to use the Spark shell and Spark applications to explore, process, and analyze distributed data
- How to query data using Spark SQL, DataFrames, and Datasets
- How to use Spark Streaming to process a live data stream
Course Outline
- Introduction to Apache Hadoop and the Hadoop Ecosystem
- Apache Hadoop Overview
- Data Processing
- Introduction to the Hands-On Exercises
- Apache Hadoop File Storage
- Apache Hadoop Cluster Components
- HDFS Architecture
- Using HDFS
- Distributed Processing on an Apache Hadoop Cluster
- YARN Architecture
- Working With YARN
- Apache Spark Basics
- What is Apache Spark?
- Starting the Spark Shell
- Using the Spark Shell
- Getting Started with Datasets and DataFrames
- DataFrame Operations
- Working with DataFrames and Schemas
- Creating DataFrames from Data Sources
- Saving DataFrames to Data Sources
- DataFrame Schemas
- Eager and Lazy Execution
- Analyzing Data with DataFrame Queries
- Querying DataFrames Using Column Expressions
- Grouping and Aggregation Queries
- Joining DataFrames
- RDD Overview
- RDD Overview
- RDD Data Sources
- Creating and Saving RDDs
- RDD Operations
- Transforming Data with RDDs
- Writing and Passing Transformation Functions
- Transformation Execution
- Converting Between RDDs and DataFrames
- Aggregating Data with Pair RDDs
- Querying Tables in Spark Using SQL
- Querying Files and Views
- The Catalog API
- Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
- Querying Tables and Views with SQL
- Querying Tables in Spark Using SQL
- Querying Files and Views
- The Catalog API
- Working with Datasets in Scala
- Datasets and DataFrames
- Creating Datasets
- Loading and Saving Datasets
- Dataset Operations
- Writing, Configuring, and Running Spark Applications
- Writing a Spark Application
- Building and Running an Application
- Application Deployment Mode
- The Spark Application Web UI
- Configuring Application Properties
- Spark Distributed Processing
- Review: Apache Spark on a Cluster
- RDD Partitions
- Example: Partitioning in Queries
- Stages and Tasks
- Job Execution Planning
- Example: Catalyst Execution Plan
- Example: RDD Execution Plan
- Distributed Data Persistence
- DataFrame and Dataset Persistence
- Persistence Storage Levels
- Viewing Persisted RDDs
- Common Patterns in Spark Data Processing
- Common Apache Spark Use Cases
- Iterative Algorithms in Apache Spark
- Machine Learning
- Example: k-means
- Introduction to Structured Streaming
- Apache Spark Streaming Overview
- Creating Streaming DataFrames
- Transforming DataFrames
- Executing Streaming Queries
- Structured Streaming with Apache Kafka
- Overview
- Receiving Kafka Messages
- Sending Kafka Messages
- Aggregating and Joining Streaming DataFrames
- Streaming Aggregation
- Joining Streaming DataFrames
- Conclusion
- Message Processing with Apache Kafka
- What Is Apache Kafka?
- Apache Kafka Overview
- Scaling Apache Kafka
- Apache Kafka Cluster Architecture
- Apache Kafka Command Line Tools
Class Materials
Each student will receive a comprehensive set of materials, including course notes and all the class examples.
Experience in the following is required for this Hadoop class:
- The ability to program in Scala or Python is required.
- Basic familiarity with the Linux command line.
Experience in the following would be useful for this Hadoop class:
- Basic knowledge of SQL.
Instructor-led courses are offered via a live Web connection, at client sites throughout Europe, and at our Geneva Training Center.