Topics

Unit I: Introduction to Big Data and Hadoop

Types of Digital Data:
- Structured Data
- Semi-structured Data
- Unstructured Data
Introduction to Big Data:
- Definition and Characteristics of Big Data
- Volume, Velocity, and Variety of Big Data
- Sources and Applications of Big Data
Big Data Analytics:
- Definition and Process of Big Data Analytics
- Tools and Techniques for Big Data Analytics
- Applications of Big Data Analytics
Apache Hadoop:
- Overview of Hadoop Architecture
- Hadoop Core Components - HDFS, MapReduce, YARN
Hadoop Streaming:
- Concept and Applications of Hadoop Streaming
- Processing Data with Hadoop Streaming
IBM Big Data Strategy:
- IBM's Approach to Big Data Analytics
- IBM Big Data Solutions and Products
Infosphere BigInsights and Big Sheets:
- Overview of Infosphere BigInsights and Big Sheets
- Features and Benefits of Infosphere BigInsights and Big Sheets

The Design of HDFS:
- HDFS Architecture and Components
- Data Storage and Replication in HDFS
HDFS Concepts:
- NameNode, DataNode, and Block
- HDFS File System Interface
- Data Flow and Block Management
Command Line Interface:
- Hadoop Commands for HDFS Operations
- Navigating HDFS Filesystem
- Managing Data with Hadoop Commands
Hadoop File System Interfaces:
- Java API for HDFS
- Working with HDFS Files in Java
Data Ingest with Flume and Scoop:
- Introduction to Flume and Scoop
- Data Collection and Ingestion with Flume and Scoop
Hadoop Archives:
- Creating and Managing Hadoop Archives
- Archiving and Retrieving Data with Hadoop
Hadoop I/O:
- Compression Techniques in HDFS
- Serialization and Avro
- File-Based Data Structures in HDFS

Anatomy of a MapReduce Job Run:
- MapReduce Job Flow and Execution
- Job Configuration and Parameters
Failures, Job Scheduling, Shuffle, and Sort:
- Handling Failures in MapReduce Jobs
- Job Scheduling and Resource Management
- Shuffle and Sort in MapReduce