Two transcoders, one workDir: the race I didn't see coming

A few weeks back I wrote about pointing a second transcoder at my shared job queue: a Windows desktop running WSL with an RTX 4060 Ti, claiming jobs alongside the Hetzner production box. The queue lives in Azure SQL. Both workers compete for the next pending video, claim it via an EF Core optimistic-concurrency check, encode the HLS ladder, upload the segments to Blob Storage, and move on.

It worked. For weeks.

Then I uploaded a 16 GB 4K feature. Two and a half hours of video. The encode took just over an hour on the GPU. The status went ready. The thumbnail rendered. I clicked play.

The video played for one second and stopped.

The smoking gun in the logs

I pulled the latest worker log and grepped for the file's blob name. Two lines, back to back, on the same source file:

info: Uploaded 19473 HLS files to hls/{guid}/
fail: HLS transcode failed
      System.IO.FileNotFoundException:
      Could not find file '/tmp/hls-{guid}/v2/seg_2487.ts'

Read those carefully. "Uploaded 19473 HLS files" is the success log message at the end of the upload loop. So one iteration ran to completion. Then another iteration on the same source threw because a segment file vanished mid-loop.

Two encodes on the same video. Same GUID, same /tmp/hls-{guid} workDir. One finished, its finally { Directory.Delete(workDir, recursive: true) } cleaned up. The other was still calling File.OpenRead on a segment that no longer existed on disk.

The "plays one second" symptom turned out to be a consequence of the same race. The first run wrote a complete set of playlists — 4867 segments per rendition. The second run, having been interrupted mid-encode, wrote truncated 3562-entry playlists and uploaded them over the top of the first run's good ones. The 1080p rendition happened to survive intact; the others were truncated to roughly two hours of a 2h42m source. hls.js picked one of the lower renditions during its bandwidth probe, hit the cliff almost immediately, and gave up.

The guard that wasn't enough

The puzzle was that I already had a guard against this. The relevant field in the Videos model:

[ConcurrencyCheck]
public string? TranscodeStatus { get; set; }

[ConcurrencyCheck] makes EF Core include the original column value in the WHERE clause of generated UPDATEs. The claim path is a SELECT followed by an UPDATE. If two workers both see 'pending' and both try to flip it to 'processing', the loser's UPDATE matches zero rows, EF Core throws DbUpdateConcurrencyException, and a catch block returns null. That guard has been in place since the day I went multi-worker, and it works for the obvious race.

So how did two workers end up encoding the same video?

The startup recovery is the hole

Each worker's background service had a recovery routine that ran on boot. The intent was reasonable: "if a row says processing and the worker just started, the host must have crashed mid-encode last time — reset it so the queue can re-attempt." The implementation was a one-liner LINQ query:

var stuck = await db.Videos
    .Where(v => v.TranscodeStatus == "processing")
    .ToListAsync();
foreach (var v in stuck)
    v.TranscodeStatus = "pending";
await db.SaveChangesAsync();

What it actually says, in distributed terms, is: "if any row in the entire database is processing, reset it." When the Hetzner box restarted for a deploy, it reset the WSL box's in-flight rows back to pending. The WSL box's ffmpeg kept going, blissfully unaware. Some other worker claimed the now-pending row a few seconds later and started a fresh encode. Two ffmpegs, same workDir, the race I just described.

The [ConcurrencyCheck] guard never had a chance, because the recovery query bypassed it. By the time the second worker called TryClaimAsync, the row legitimately said pending again, and the SELECT-then-UPDATE pair was clean. The system did exactly what the code told it to do.

The fix

Add a machine filter, and use ExecuteUpdateAsync so the recovery is one atomic statement instead of a SELECT-then-iterate-then-SaveChanges:

var count = await db.Videos
    .Where(v => v.TranscodeStatus == "processing"
             && v.TranscodeMachine == Environment.MachineName)
    .ExecuteUpdateAsync(s => s
        .SetProperty(v => v.TranscodeStatus, "pending")
        .SetProperty(v => v.TranscodeStartedAt, (DateTime?)null)
        .SetProperty(v => v.TranscodeFinishedAt, (DateTime?)null),
    stoppingToken);

That's it. The recovery now only touches rows the rebooting machine itself was responsible for. Other workers' in-flight encodes are untouched. Each box can deploy, crash, or be rebooted without yanking work out from under its siblings.

What this fix is, and what it isn't

It's a fix for the bug that was burning me on a Tuesday. It's not a fix for the bigger shape of the problem. The next failure mode in line is "machine dies and never restarts" — its processing rows now stay pinned forever, because only the dead machine would have reset them on its own startup. A CLI transcode reset <id> exists to recover them manually, but that's an operator action.

The proper answer is to stop owning the lock at all, and rent it from Azure Service Bus instead. Receive a message with peek-lock, run the encode while the SDK auto-renews the lock in the background, call CompleteMessageAsync when done. If the worker dies mid-job, the lock expires within minutes and another worker picks the same message up. It's the same heartbeat pattern I would write by hand, but Microsoft owns the code and the failure modes.

That's the next move on the roadmap: Service Bus to manage dispatch, Azure Container Apps to run the workers. Ephemeral containers, autoscaled on queue depth, with no machine identity to leak in the first place. The fix I shipped today buys me the runway to do that properly, rather than in a hurry.

The lesson

Optimistic concurrency guards the race you're thinking about. It doesn't guard the race next door, where some other piece of well-intentioned code rewrites the state you were guarding.

Look at every piece of code that writes to a status column in your distributed system. If any of it is a bulk update without a per-owner filter — a "tidy up the stragglers" loop, a periodic janitor task, a startup recovery routine — that's where your race lives.

The fix is usually four lines. The work is in noticing the loop exists at all.

— Mícheál.