YouTube Comment Sentiment ETL Pipeline

The project code and process is on GitHub:

Tableau visualization below

A scalable ETL pipeline designed to extract, process, and analyze sentiment from YouTube comments in real-time, using Apache Kafka, Apache Spark, Python NLP, MongoDB, and visualized through Tableau, with orchestration managed by Apache Airflow and containerized via Docker.

 Here’s an overview of the pipeline process:

  • Data Extraction: The YouTube Data API is used to automatically fetch comments from specific videos or channels in real-time. This ensures a continuous flow of fresh data into the pipeline for analysis.

  • Data Streaming: Apache Kafka handles the real-time transmission of comments, streaming them through the system for processing. Kafka ensures that the data flows smoothly, even as the volume of comments grows.

  • Data Processing: Apache Spark processes the incoming comment data, performing necessary transformations such as cleaning and filtering out irrelevant information. It prepares the data for sentiment analysis using its distributed processing capabilities.

  • Sentiment Analysis: Python’s natural language processing libraries analyze each comment, classifying them as positive, negative, or neutral. This step provides the core insights into how audiences feel about the content.

  • Data Storage: Once analyzed, the comments and their associated sentiment scores are stored in MongoDB, a flexible NoSQL database that accommodates unstructured data. MongoDB allows for efficient querying and retrieval of results.

  • Workflow Orchestration: Apache Airflow orchestrates the entire process, ensuring that tasks are executed in the correct order, monitoring progress, and handling retries in case of failures.

  • Data Visualization: The final step involves using Tableau to create visualizations based on the sentiment data. Stakeholders can explore trends and patterns in audience reactions through interactive dashboards.

This pipeline ensures a seamless flow from comment extraction to sentiment analysis and visualization, offering an efficient and scalable solution for understanding audience sentiment on YouTube.