S9K Training


 

Consulting & Training by Randall Nagy


Soft9000.com - Products & Services

On-Line Training

SQL On-Line

Python On-Line

Java On-Line

C/C++ Training

More Training

Custom R&D

Consulting

Authorship

Communities

Web Tools

Hobbies

Soft9000.com

Custom Training

Apache Big Data: Hadoop and Spark

Community Discount: Available Soon
Training: Coming Soon!
BigData is here, but is has been done before. More evolutionary then revolutionary, discover what "Big Data" has in common with traditional data processing activities.
Beginners Welcome!
Designed for the students with absolutely no programming experience, our Big Data training opportunity will explain not only how to use Apache's Hadoop and Spark, but will also explain what Big Data is, as well as why both professionals and non-professionals can use Apache' Big Data offerings.

Training Objectives:

  • Understand was Big Data is, and how Apache's Big Data Offerings solve big-data problems.
  • Discover how to operate, as well as to create programs designed to use Hadoop's 'Core' Components.
  • Learn how to download and use a copy of Hadoop9000 - a complete Map Reduce Virtual Machine.
  • Learn how to create and debug MapReduce Programs both within - as well as without - a Hadoop Cluster.
  • Discover us use Hive, Pig, Zookeeper, YARN, as well as other Hadoop components.
  • Understand how to use browser-based user interfaces into Hadoop.
  • Review Hadoop's RESTFull support, common ports, and API-Level job management / tracking interfaces.

Lesson 1000: Problem Domain Overview

Normal Data
Big Data
Speed and Feed
Storage Concept Review
Problems Suitable for Big Data
Primary Keys
Speed and Feed
Distributed Computing

Lesson 1010: Hadoop Overview

Apache Hadoop
Apache Hadoop Logo
Typical Hadoop Applications
Hadoop Clusters
Hadoop Design Principles
Hadoop's Core Components
Hadoop Simple Definition
High-Level Hadoop Architecture
Hadoop-based Systems for Data Analysis
Other Hadoop Ecosystem Projects
Hadoop Caveats
Hadoop Distributions

Lesson 1020: Hortonworks VM

VM Concepts
* Appliance Creation
* Backup and Cloning
* VM Interfaces / IP Addresses
Browser Location (Ambari)
* Core Product Pages
* RESTFul API Interfaces
Browser Access
* Apache Hive
* Apache Pig
* User-Defined Functions
* HDFS Overview

Lesson 1030: MapReduce with Hadoop

MapReduce Overview
Problem Domain Discussion
* Conceptual Approach
* High-Level Concepts
* Hadoop Terminology NOT Used!
Concepts are Ancient
* Architectures Come and Go
* Graphical Overview
Map Reduce Operational Review
* Maps and Keys
* Sort and Shuffle
* Reduce
* Merge/Update
Hadoop Architecture: Roles and Responsibilities
* High-Level Problems
* Speed and Feed
* System Faults
* Security
Speed and Feed
* Solution: Proximity Management
* Distribute Data
* Near Machine / Cluster
* Localize Processing
System Faults
* Problem: Hardware Failure
* Load-Balancing
* Master / Slave Computing
* Data Redundancy
* Cluster Redundancy
Load Balancing
* Worker Node Monitoring
* Heartbeats
* Task Management
* Hadoop's Resource Manager (YARN)
Security
* Pro-Active
* Known Threats (Black Duck, etc.)
* New Threats (Forensics)
* Logging and Encryption

Lesson 1040: MapReduce with Hadoop (Part II)

Part One: Hadoop's Core Competencies
* Most Re-Used
* MapReduce
* Ongoing Write-Only
* Ongoing Read-Only
* Framework
Networked File System (HDFS)
* File Metadata
* Name Servers
* File Content
* Data Servers
Hadoop's MapReduce
* Two Architectures
* Job and Task Trackers
* Node Managers and YARN
Topic: Addressing Known Problems
* MapReduce2 -v- MapReduce2
* Problem: Hadoop 1 Failures
* Map / Reduce Slots
* Dynamic Re-Purposing
* In-Memory Datasets
YARN as Data Operating System
* YARN's RM History
* Non-Sensitive & generic data
* YARN Timeline Store
* Application Start / Stop
Job / Users Information
* Operational Telemetry
* YARN History
* Timeline Server -v- Application History Server (AHS)
MapReduce Job Information
* Job Names
* User Information
* Application-attempts and Details
* Containers Used and Related Information
* Pantomime: Java Usage
YARN: Command Options
* List Example
* Container Management
* Automatic Start and Stop
* Processing Logic

Lesson 1050: HDFS

Hadoop as Distributed Operating System?
* File Name (Name Node)
* File Data (Data Nodes)
* File Access (Access Node)
HDFS Data-Access Review
* HDFS High Availability
* HDFS Rack-Awareness
* HDFS Security
Prior to Hadoop 2.0.0
* NameNode SPOF
* An Active/Passive Configuration
* Data-Block Placement
HDFS Security
* Hadoop Non-Secure Mode
* Hadoop Secure Mode

Lesson 1060: MapReduce API

MapReduce with Hadoop
Java's MapReduce Job
* Conceptual Review
* Key MR Components
YARN API
* Application Submission
* Container Allocation
* Container Launch
Processing API
* 3rd Party API's
* Class IPC
* HDFS
YARN Development
* Resource Manager Communications
* Event Callback Model
* Job Lifecycle Events
* Multi-Node Communications
* Container Events
Application Operations
* Configuration
* Monitoring
* The Mapper Class
* The Reducer Class
* The Driver Class
* Combiner Pass (Optional)
* Piping Data
Required Streaming Options
* Mapper Input Location
* Reducer Output Location
* Mapper Executable
* Reducer Executable

Lesson 1070: Hands-On Pig

Why Pig?
* Hadoop Scripting Language
* Pig Latin - Simple SQL-Like Scripting
* Appeals to Non-Programmers
Pig Scripts Are
* Translated Into MapReduce Jobs
* Jobs Run on Hadoop Cluster
Pig Power
* Can Do All Hadoop Data Manipulations
* Can Call User Defined Functions (UDF)
* Can Embed Pig Scripts in Other Languages
Pig Data Sources
* Structured / Unstructured Data
* HDFS Results
Pig Operations
* Pig View on Ambari
* Pig Execution Types
* Pig's Two Execution Type / Modes
Running Existing Jobs
* Distributed Operations
* Pig Runs
* Local File System
Command Line Options

Lesson 1080: Hadoop Job Control

Job Control Basics
* Job Types
* Simple Serialized Order
* Complex
* Alternate Flows
Directed Analytic Graphs
* Hadoop's JobControl Class
* Oozie XML
Understanding Oozie
* Oozie Advantages
* Integrated Hadoop
* Shell Executions
Oozie Defined
* Oozie Execution
* Ordering
* Pre Conditions
* Post Conditions
* Parallel Execution
* Events
* Errors and Timings
Oozie Workflow
* Triggers Workflow
* Load Balancing
* Clustering
* Fail-over
Oozie Jobs
* Common Job Types
* Interfaces and Callbacks
* Command-Line Interface
* Servlet Interface
* Oozie Servlet
Processing Flow
* XML Namespaces
* Hadoop-Types
* Action Tags
* Oozie Job Management
* Hadoop Job Control

Lesson 1090: Hive Revisited

Pig... or Hive?
* Pig Audience / Hive Audience
* Problem Domain Comparison
* Request Walk Through
Command Line for Hive Server
* HiveServer 1 and 2
* Hive (CLI1)
* Beeline (CLI 2)
HiveServer2
* Disabling Legacy Mode
* Subsets
* Beeline Commands
Files and Datasets
* HiveServer Record Columnar File
* Partitioning Data
* Table Query Improvement
Table Optimizations
* Partitioning
* Caveats
* Table Bucketing
* Clustering

Lesson 2000: Apache Spark

A Short History of Spark
Spark - Change Velocity
What is Spark?
* Integrates with Hadoop
* Works with HDFS
* Uses Existing Data / File Legacies
* YARN Support
Workload Integration
* Cluster Management
* Data Parallelism
* Fault-Tolerance
Machine Learning
* AI Responses
* APIs and Algorithms
Hadoop / Clustering
* Stand-Alone
* Alternative File Systems
* Alternative Schedulers
API and Framework
* Spark Contexts? Sessions?
* Spark 1.x
* Spark 2.x
* Builder Design Pattern
Spark SQL?
* Unified Data Access
* RDDs
* Hive Compatibility
* Standard Connectivity
* Scalability
Spark Architecture
* Language API
* Data Sources
* Schema RDD
HIVE Tables
* Parquet file
* Cassandra Database
* Spark Core Data Structure
* Homogeneous Text / Serialized File
Schemas, Tables, Rows and Records
* RDD as DataFrame
* Review: DataFrame Features
* Data Size
Open-Ended Formats
* Hadoop Web Services
* Web Based REST API
* HDFS
Distributable Artifacts

Lesson 2010: Spark's Resilient Datasets

RDD's In Action
* Parallelized Processing
* RDD Serialization
* Java Object Serialization
RDD Management
* RDD Transformation ...
* Create Random Samples
* RDD Containing Union
* RDD Containing Intersection
* RDD Contains Enumerated Dataset Elements
RDD Transformations ...
* Perform Reductions
* Perform Aggregations
* Ad-Hoc Sorting
* External Command / Piping
RDD Actions
* .reduce(func)
* .collect()
* .count()
* .first()
* .take(n)
* .takeSample(...)
* .takeOrdered(...)
* .saveAsTextFile(path)
* .saveAsSequenceFile(path)
* .saveAsObjectFile(path)
More RDD API Actions
* .countByKey(...)
* .foreach(func)
Asynchronous RDD
* foreachAsync for foreach
* Returns FutureAction
Spark DataFrames
* Closely Related to RDD
* Share Mapping Capabilities
* RDD Reductions
Live data (TCP Sockets, etc.)
Associated Key-Value Operations

Lesson 2020: Spark Ecosystem

The Spark Platform
* Spark Core
* Spark SQL
* Spark Streaming
* MLlib (Machine Learning Library)
* GraphX
Honorable Mention
* Network Security
* Security providers
* Real-Time
Spark Streaming
* Mllib Analysis
* Detect New and Evolving Threats
Cluster Managers
Spark Application Architecture
* Context Management
* Cluster Managers
Cluster Context
* Spark Applications
* SparkContext
Cluster Manager Options ...
* Native Cluster Manager
* Multiple Clustered Resources
* MRv2 Cluster Manager
* Apache MESOS
Standalone Cluster Master
* Port 7077
* spark-submit
* Job Monitoring
* Sandbox Port 8088

Lesson 3020: Spark MapReduce

Data Storage System Interfaces
Legacies
* Hive Structured Tables, Block Storage (S3, etc.)
* Textual Formats (CSV, TDF, XML, JSON, etc.)
Queries
* HQL, Hbase
* Spark SQL
* Streaming Data
* Real-Time stdin
Spark vs MapReduce
* Hadoop's MapReduce Limitations
* Spark as Alternative to Apache Tez
Spark Streaming
* Streaming Jobs
* Language-Based API Bindings
* Micro-Batching
* Fault Tolerance
Spark SQL
* Selects, Joins, Primitive Types, and Queries
* Grouping, Ordering, Clustering and Sort
* UDFs, DDL, Views (etc.)
Spark Machine Learning Library
GraphX

Lesson 2035: Zookeeper

Zookeeper Defined
IPC Paradigm
Notification and Node Hierarchy
Ephemeral Nodes
Persistent Nodes
Node Watching
Zookeeper Services
Zookeeper API

Lesson 2100: Apache Sqoop (SQL-to-Hadoop)

Bulk Data Transfer
Sqoop Architecture
DIY Mappers?
* Serialization Formats
* Partition / Bucket Mappings
* Delimiters (etc.)
Data Importation
* Importing to HDFS
* Automatic Mappers
Hive Metastore
* Table Metadata
* HQL Partitions, Buckets, Delimiters (etc.)
Automatic Mappers
* Hive Native Delimiter
* Newline / Other Hive Delimiter
* Pre Processing Data into Hive Delimiters
Data Exportation
* Sqoop Connectors
* Plugin Components
* Sqoop's Extension Framework
Sqoop Connectors
* Specialized Connectors
* Connect External Systems
* Optimized Import / Export Facilities
Advanced Features
* Different Data Formats
* Compression
* Query-Based Tables
Download
* Installing
* Operations
(mod) (mod)