Войти
  • 1592Просмотров
  • 1 месяц назадОпубликованоData with Marc

The Reddit Project (Data engineering project with Airflow, Slack, Snowflake and OpenAI)

The Reddit Project (with Airflow, Slack, Snowflake, and OpenAI) Learn how we build and run a real-world data pipeline at Astronomer! In this deep-dive tutorial, we'll walk through creating a complete Airflow DAG that scrapes data from Reddit, processes it using a Large Language Model (LLM) for categorization, loads it into Snowflake, and sends notifications via Slack. This is more than just a theoretical example. Here's what you'll learn: ✅ Designing a complex data pipeline with multiple branches and tasks. ✅ Setting up Snowflake connections and creating tables using Airflow. ✅ Fetching data from the Reddit API. ✅ Using an LLM (like OpenAI or similar) directly within tasks. ✅ Implementing branching logic to skip tasks when no data is found. ✅ Filtering and transforming data. ✅ Sending automated Slack notifications upon pipeline completion. And more! Get ready to level up your Airflow skills with a practical project you can adapt for your own use cases! 🤖 Try the Astro IDE: 👨‍💻 The Code: 🏆 BECOME A PRO: 👍 Smash the like button to become an Airflow Super Hero! ❤️ Subscribe to my channel to become a master of Airflow 00:00:00 Introduction & Project Goal 00:00:28 Airflow Data Pipeline Overview 00:01:34 Slack Notification Example 00:02:44 Starting the Project: Local vs. Astro IDE 00:03:26 Step 1: Create Airflow Project 00:05:09 Step 2: Install Python Dependencies 00:06:02 Step 3: Create Reddit API Connection 00:07:52 Step 4: Create the Airflow DAG 00:09:07 Task 1: Create Snowflake Table 00:13:00 Testing the First Task 00:14:31 Task 2: Fetch Reddit Posts 00:17:39 Task 3: Check for Posts (Branching) 00:18:47 Task 4: Categorize Posts with LLM 00:22:24 Define Task Dependencies 00:26:45 Task 5: Filter DevRel Posts 00:28:13 Task 6: Send Slack Notifications 00:30:27 Conclusion