Topics
Unit I: Introduction to Big Data and Hadoop
- Types of Digital Data:
- Structured Data
- Semi-structured Data
- Unstructured Data
- Introduction to Big Data:
- Definition and Characteristics of Big Data
- Volume, Velocity, and Variety of Big Data
- Sources and Applications of Big Data
- Big Data Analytics:
- Definition and Process of Big Data Analytics
- Tools and Techniques for Big Data Analytics
- Applications of Big Data Analytics
- Apache Hadoop:
- Overview of Hadoop Architecture
- Hadoop Core Components - HDFS, MapReduce, YARN
- Hadoop Streaming:
- Concept and Applications of Hadoop Streaming
- Processing Data with Hadoop Streaming
- IBM Big Data Strategy:
- IBM's Approach to Big Data Analytics
- IBM Big Data Solutions and Products
- Infosphere BigInsights and Big Sheets:
- Overview of Infosphere BigInsights and Big Sheets
- Features and Benefits of Infosphere BigInsights and Big Sheets
Unit II: HDFS (Hadoop Distributed File System)
- The Design of HDFS:
- HDFS Architecture and Components
- Data Storage and Replication in HDFS
- HDFS Concepts:
- NameNode, DataNode, and Block
- HDFS File System Interface
- Data Flow and Block Management
- Command Line Interface:
- Hadoop Commands for HDFS Operations
- Navigating HDFS Filesystem
- Managing Data with Hadoop Commands
- Hadoop File System Interfaces:
- Java API for HDFS
- Working with HDFS Files in Java
- Data Ingest with Flume and Scoop:
- Introduction to Flume and Scoop
- Data Collection and Ingestion with Flume and Scoop
- Hadoop Archives:
- Creating and Managing Hadoop Archives
- Archiving and Retrieving Data with Hadoop
- Hadoop I/O:
- Compression Techniques in HDFS
- Serialization and Avro
- File-Based Data Structures in HDFS
Unit III: MapReduce
- Anatomy of a MapReduce Job Run:
- MapReduce Job Flow and Execution
- Job Configuration and Parameters
- Failures, Job Scheduling, Shuffle, and Sort:
- Handling Failures in MapReduce Jobs
- Job Scheduling and Resource Management
- Shuffle and Sort in MapReduce