You have to actually decode the MPEG stream... motion vectors, MBAFF, deinterlacing. etc
Yes, but you don't actually run a decode loop on each byte, but rather on long run lengths in each frame. So while the brute force reduction to "12 instructions per byte is probably too slow" is possibly pessimistic, "128 instructions per byte is probably fast enough" is probably accurate. Because even operating on a few pixels at a time is probably time for more than about a thousand instructions to execute, which should be plenty, and I expect the actual amount is higher. But mere hundreds of instructions per processing iteration is probably too slow.
This kind of processing is exactly what video HW is for. Especially modern PC display chips, which are not only fast and have instructions tuned to the MPEG apps, but also have lots of parallelism. I wonder whether there's a way for a Core to harness a bank of nVidia chips to decode video into frames which aren't displayed locally, but streamed out to a relatively slow MD for display. Gonna need a fat fat LAN though.