By Randy Armknecht in AI — 11 May 2026

My Watch Later Playlist Was Out of Control. I Fixed It With Open Tools and a Local LLM.

I have a habit. I scroll through YouTube's home page a few times a week, and when something catches my eye, I add it to Watch Later. The algorithm knows what it's doing; the content is generally good, and the videos I'm adding are almost always substantial, 20-plus minutes of dense technical or educational material. The backlog compounds faster than I can watch it.

The problem isn't that the content is bad. The problem is that I don't have 30 minutes to give every video to find out whether it's worth 30 minutes. I needed a triage layer between "this looks interesting" and "I will actually sit down and watch this."

The Solution

Mia (my AI assistant, described in a previous post) researched the available tooling, read through the relevant documentation, and helped me assemble something that works exactly the way I need it to work.

The pipeline is straightforward. Every morning at 8:30, a cron job fires a Python script. It reads my Watch Later playlist using yt-dlp (which authenticates via my browser's existing cookie store, no separate login required). For any video I haven't already seen a summary of, it downloads the audio track and runs it through Whisper locally to produce a transcript. That transcript goes to a local LLM, which produces a concise summary. The summary lands on my phone via Telegram.

By the time I'm having my first cup of coffee, my phone has already told me what each video is actually about. I can decide in ten seconds whether to watch the whole thing or skip it.

Why Local Models

Token costs at scale are real. A daily summarizer hitting five or ten videos, each with a 20-minute transcript, adds up. Routing all of that through a commercial API would be both expensive and unnecessary, given that the task doesn't require frontier-model reasoning. It requires clean extraction of main points from spoken content. gemma3:12b via Ollama works perfectly well.

The entire pipeline runs on my hardware: Whisper for transcription, a locally-served LLM (Gemma or similar) for summarization. The only external service involved is Telegram for delivery, which is free. No per-token cost, no data leaving the house.

The Broader Point

What I find interesting about this project isn't the implementation. It's what made the implementation possible.

yt-dlp is open-source and actively maintained. Whisper was released publicly by OpenAI. Local LLM serving (via llama.cpp, Ollama, and similar tools) is mature. Telegram's bot API is well-documented and free to use. Every component in this pipeline came from an ecosystem that prioritizes openness and composability.

None of these tools were built with "personal YouTube summarizer" in mind. They were built to solve more general problems, and they expose enough surface area to be composed into something specific. That composability is what makes customization like this possible at all. It's the Unix Philosophy, specifically Doug McIlroy's Unix Pipe design, escaped from a single OS and applied across a global network and modern social media + AI.

The ability to wire together open tools into something that fits exactly one person's workflow, without needing to convince a product team that the feature is worth building, is genuinely one of the most valuable properties of the current technical ecosystem. It's not a feature. It's an architecture choice that the open-source community has been making for decades, and AI has made it significantly more accessible to people who aren't full-time engineers.

My Watch Later backlog is under control. The solution is ~500 lines of Python, runs on a cron job, and costs nothing to operate. If the need changes, I can change the tool. No one at YouTube needed to engineer a new feature, and that's the point.

The Solution

Why Local Models

The Broader Point

Subscribe to ClearText