Long-tail bugs

June 1, 2026

I call these long-tail bugs because they are very hard to catch unless there is a huge sample size, and probably impossible to catch when looking through the results of some process manually.

Cleaning up transcripts

A lot of the problems I’ve been encountered come from trying to clean up transcripts. By default, they have no punctuation, paragraph breaks, or are filled with typos (e.g. Mag 7 shows up as Max 7, which is hard to catch).

Prohibited content

Many videos contain profanity or content that will get blocked by clean up for many LLMs. Had to add some extra instructions in the cleanup prompt to try to bypass this…

Music

Some videos are mostly just music, often with barely any spoken words. Could be some music video, or some promotional marketing video. Accounting for these is low priority right now, since they are far and few in-between (as long as you don’t import some EDM music video channel or something).

Shorts

Not exactly a bug, but shorts are often just recycled clips of longer videos. Currently, 1-2 minutes should be excluded, but there will always be exceptions. There is also no reliable way to check whether a video is a short or not.

Anything to do with LLMs

LLMs are powerful, but they fundamentally introduce a lot of variance into any process, even when the “temperature” setting is set to be absolutely deterministic.

Also, because this relies on third-party service providers, there is always a risk of “503 service unavailable” errors, or deprecation of some model without knowing (Gemini 2.0 Flash Lite was deprecated as of today, Cerebras deprecated their Llama 3.1-8B in May, etc.), or simply when my credits are not auto-topped up.

To circumvent this, there should always be backup options.

Malformed transcripts

It turns out some really old videos such as the one below have a totally messed up transcript - and no, this video is not in Russian 😂:

These videos with malformed transcripts have now been marked appropriately, and will require a manual transcription (speech-to-text) in order to generate a proper transcript.

This kind of long-tail bug was only caught with the help of running some aggregate health-check for whether the cleaned up transcript has a threshold of > 10% mismatch of characters versus the raw transcript. Turns out LLMs did not exactly recognize this case as garbled/malformed before.

US English not available, falls back to Arabic…

This video is in English, but it turns out old transcripts might somehow be in the completely wrong language. Or not.

There is no en (US english) default transcript, but rather a en-GB (UK english) only. Thus the transcript pulled is the first available one, which in this case defaulted to Arabic (came up first when sorted A-Z).