Course duration
- 3 days
Course Benefits
- Python essentials
- Capabilities of the Apache Spark platform and its machine learning module
- Terminology, concepts, and algorithms used in machine learning
Course Outline
- Defining Data Science
- Data Science, Machine Learning, AI?
- The Data-Related Roles
- Data Science Ecosystem
- Business Analytics vs. Data Science
- Who is a Data Scientist?
- The Break-Down of Data Science Project Activities
- Data Scientists at Work
- The Data Engineer Role
- What is Data Wrangling (Munging)?
- Examples of Data Science Projects
- Data Science Gotchas
- Summary
- Machine Learning Life-cycle Phases
- Data Analytics Pipeline
- Data Discovery Phase
- Data Harvesting Phase
- Data Priming Phase
- Data Cleansing
- Feature Engineering
- Data Logistics and Data Governance
- Exploratory Data Analysis
- Model Planning Phase
- Model Building Phase
- Communicating the Results
- Production Roll-out
- Summary
- Quick Introduction to Python Programming
- Module Overview
- Some Basic Facts about Python
- Dynamic Typing Examples
- Code Blocks and Indentation
- Importing Modules
- Lists and Tuples
- Dictionaries
- List Comprehension
- What is Functional Programming (FP)?
- Terminology: Higher-Order Functions
- A Short List of Languages that Support FP
- Lambda
- Common High-Order Functions in Python 3
- Summary
- Introduction to Apache Spark
- What is Apache Spark
- Where to Get Spark?
- The Spark Platform
- Spark Logo
- Common Spark Use Cases
- Languages Supported by Spark
- Running Spark on a Cluster
- The Driver Process
- Spark Applications
- Spark Shell
- The spark-submit Tool
- The spark-submit Tool Configuration
- The Executor and Worker Processes
- The Spark Application Architecture
- Interfaces with Data Storage Systems
- Limitations of Hadoop's MapReduce
- Spark vs MapReduce
- Spark as an Alternative to Apache Tez
- The Resilient Distributed Dataset (RDD)
- Datasets and DataFrames
- Spark SQL
- Spark Machine Learning Library
- GraphX
- Summary
- The Spark Shell
- The Spark Shell
- The Spark v.2 + Shells
- The Spark Shell UI
- Spark Shell Options
- Getting Help
- The Spark Context (sc) and Spark Session (spark)
- The Shell Spark Context Object (sc)
- The Shell Spark Session Object (spark)
- Loading Files
- Saving Files
- Summary
- Quick Intro to Jupyter Notebooks
- Python Dev Tools and REPLs
- IPython
- Jupyter
- Jupyter Operation Modes
- Basic Edit Mode Shortcuts
- Basic Command Mode Shortcuts
- Summary
- Data Visualization in Python using matplotlib
- Data Visualization
- What is matplotlib?
- Getting Started with matplotlib
- The matplotlib.pyplot.plot() Function
- The matplotlib.pyplot.scatter() Function
- Labels and Titles
- Styles
- The matplotlib.pyplot.bar() Function
- The matplotlib.pyplot.hist () Function
- The matplotlib.pyplot.pie () Function
- The Figure Object
- The matplotlib.pyplot.subplot() Function
- Selecting a Grid Cell
- Saving Figures to a File
- Summary
- Data Science and ML Algorithms with PySpark
- In-Class Discussion
- Types of Machine Learning
- Supervised vs Unsupervised Machine Learning
- Supervised Machine Learning Algorithms
- Classification (Supervised ML) Examples
- Unsupervised Machine Learning Algorithms
- Clustering (Unsupervised ML) Examples
- Choosing the Right Algorithm
- Terminology: Observations, Features, and Targets
- Representing Observations
- Terminology: Labels
- Terminology: Continuous and Categorical Features
- Continuous Features
- Categorical Features
- Common Distance Metrics
- The Euclidean Distance
- What is a Model
- Model Evaluation
- The Classification Error Rate
- Data Split for Training and Test Data Sets
- Data Splitting in PySpark
- Hold-Out Data
- Cross-Validation Technique
- Spark ML Overview
- DataFrame-based API is the Primary Spark ML API
- Estimators, Models, and Predictors
- Descriptive Statistics
- Data Visualization and EDA
- Correlations
- Hands-on Exercise
- Feature Engineering
- Scaling of the Features
- Feature Blending (Creating Synthetic Features)
- Hands-on Exercise
- The 'One-Hot' Encoding Scheme
- Example of 'One-Hot' Encoding Scheme
- Bias-Variance (Underfitting vs Overfitting) Trade-off
- The Modeling Error Factors
- One Way to Visualize Bias and Variance
- Underfitting vs Overfitting Visualization
- Balancing Off the Bias-Variance Ratio
- Linear Model Regularization
- ML Model Tuning Visually
- Linear Model Regularization in Spark
- Regularization, Take Two
- Dimensionality Reduction
- PCA and isomap
- The Advantages of Dimensionality Reduction
- Spark Dense and Sparse Vectors
- Labeled Point
- Python Example of Using the LabeledPoint Class
- The LIBSVM format
- LIBSVM in PySpark
- Example of Reading a File In LIBSVM Format
- Life-cycles of Machine Learning Development
- Regression Analysis
- Regression vs Correlation
- Regression vs Classification
- Simple Linear Regression Model
- Linear Regression Illustration
- Least-Squares Method (LSM)
- Gradient Descent Optimization
- Locally Weighted Linear Regression
- Regression Models in Excel
- Multiple Regression Analysis
- Evaluating Regression Model Accuracy
- The R>2
- Model Score
- The MSE Model Score
- Hands-on Exercise
- Linear Logistic (Logit) Regression
- Interpreting Logistic Regression Results
- Hands-on Exercise
- Naive Bayes Classifier (SL)
- Naive Bayesian Probabilistic Model in a Nutshell
- Bayes Formula
- Classification of Documents with Naive Bayes
- Hands-on Exercise
- Decision Trees
- Decision Tree Terminology
- Properties of Decision Trees
- Decision Tree Classification in the Context of Information Theory
- The Simplified Decision Tree Algorithm
- Using Decision Trees
- Random Forests
- Hands-On Exercise
- Support Vector Machines (SVMs)
- Hands-On Exercise
- Unsupervised Learning Type: Clustering
- k-Means Clustering (UL)
- k-Means Clustering in a Nutshell
- k-Means Characteristics
- Global vs Local Minimum Explained
- Hands-On Exercise
- Time-Series Analysis
- Decomposing Time-Series
- A Better Algorithm or More Data?
- Summary
Class Materials
Each student will receive a comprehensive set of materials, including course notes and all the class examples.
Experience in the following is required for this Spark class:
- General knowledge of statistics and programming.
Instructor-led courses are offered via a live Web connection, at client sites throughout Europe, and at our Geneva Training Center.