Quantcast

Results 1 to 11 of 11

Thread: DCT-based video codec on SH2 experiment

  1. #1
    Sports Talker
    Join Date
    Sep 2013
    Posts
    39
    Rep Power
    0

    Default DCT-based video codec on SH2 experiment

    It's almost-sort-of feasible. I modified a JPG encoder to skip the variable length huffman codes and just output DCT coefficients and zero-runs with a simple byte encoding. Then I encoded a bunch of individual video frames and lumped them together in a 32x ROM. It's 160x216 (double width pixels), monochrome, uses both CPUs, and it gets about 10fps.

    http://www.hyakushiki.net/misc/ldvid3.32x

    (palette doesn't get set correctly sometimes on real 32x, just reset and try again)

  2. #2
    WCPO Agent LinkueiBR's Avatar
    Join Date
    Oct 2013
    Posts
    982
    Rep Power
    41

    Default

    This is amazing and looks really great!

    I ever wanted to see an App like "Video 2 SEGA CD or SEGA CD 32x"... 32x can show really good FMVs even more with the SEGA CD 32x mode...

    Keep it up!
    Last edited by LinkueiBR; 07-25-2019 at 11:15 AM.
    VISUAL SHOCK!
    SPEED SHOCK!
    SOUND SHOCK!
    NOW IS TIME TO THE 68000 HEART ON FIRE!


    Shadow of the Beast II - Enhanced Colors:
    http://www.romhacking.net/hacks/2275/

    Sunset Riders - Enhanced Colors:
    http://www.romhacking.net/hacks/2287/

    Turrican - Fixed:
    http://www.romhacking.net/hacks/2535/

  3. #3
    Sports Talker
    Join Date
    Sep 2013
    Posts
    39
    Rep Power
    0

    Default

    Wondering how I could speed this up, when the reason for the CPU load is the requirement to do 64 multiply+adds for every nonzero coefficient. Pre-scaled lookup tables can be used to avoid the multiply, although this doesn't help any by itself because multiplying doesn't take long on the SH2 to begin with. But it can be combined with another trick, that being a 32-bit CPU can add two pairs of 16-bit words at the same time (for a restricted set of values which avoid problems with the low word overflowing). Almost like a SIMD instruction.

    Of course there isn't enough RAM to contain lookup tables for all 2K possible coefficient values so the range has to be limited with a special quantization table, and separate routines are used for positive/negative and big/small coefficients.

    This routine would probably be a good bit faster on the Saturn with 32-bit SDRAM and not having to wait for framebuffers to swap before beginning the next frame. (and there would be more room for lookup tables)

    same video clip but higher FPS: http://www.hyakushiki.net/misc/ldvid8.32x
    different video clip: http://www.hyakushiki.net/misc/ldvidz.32x

  4. #4
    Hedgehog-in-Training Hedgehog-in-TrainingRoad Rasher
    Join Date
    Sep 2016
    Posts
    315
    Rep Power
    7

    Default

    DCT is far too slow for older processors. Array lookups aka vector quantization aka cinepak-style, like sega did, is about the only viable approach according to my research. I looked into what the N64 could play for FMV, and it could just barely play xvid 320x136@25fps (using nearly all cpu, meaning optimization would be required to add sound). I also tried fast wavelets from the libpgf project, as the image quality of compressed wavelets is more pleasing to the eye, but it only achieved about 10fps, too low.

    N64 is faster than Saturn and way more so for 32x. If N64 can barely handle DCT, it'd be advisable to use VQ for slower systems.

  5. #5
    Hero of Algol TrekkiesUnite118's Avatar
    Join Date
    May 2010
    Age
    31
    Posts
    8,087
    Rep Power
    121

    Default

    Quote Originally Posted by roce View Post
    DCT is far too slow for older processors. Array lookups aka vector quantization aka cinepak-style, like sega did, is about the only viable approach according to my research. I looked into what the N64 could play for FMV, and it could just barely play xvid 320x136@25fps (using nearly all cpu, meaning optimization would be required to add sound). I also tried fast wavelets from the libpgf project, as the image quality of compressed wavelets is more pleasing to the eye, but it only achieved about 10fps, too low.

    N64 is faster than Saturn and way more so for 32x. If N64 can barely handle DCT, it'd be advisable to use VQ for slower systems.
    Well, there is MPEG Sofdec for the Saturn that some games use with so-so results:
    https://www.youtube.com/watch?v=nz-XTb6PhpY

  6. #6
    Raging in the Streets Sik's Avatar
    Join Date
    Jan 2011
    Posts
    3,353
    Rep Power
    63

    Default

    As said earlier, multiply on SH-2 barely costs anything, and in fact look-up tables are problematic as you risk lots of cache misses (so they're only useful for things that are pretty hard to compute line sines or the like). Also RAM on the 32X is pretty limited (256KB between both SH-2s, and ROM accesses are a tad slower so you may want to avoid them for anything that's accessed a lot). I'd outright advise against look-up tables unless it's shredding off lots of calculations at once.

  7. #7
    Sports Talker
    Join Date
    Sep 2013
    Posts
    39
    Rep Power
    0

    Default

    Quote Originally Posted by roce View Post
    DCT is far too slow for older processors. Array lookups aka vector quantization aka cinepak-style, like sega did, is about the only viable approach according to my research. I looked into what the N64 could play for FMV, and it could just barely play xvid 320x136@25fps (using nearly all cpu, meaning optimization would be required to add sound). I also tried fast wavelets from the libpgf project, as the image quality of compressed wavelets is more pleasing to the eye, but it only achieved about 10fps, too low.

    N64 is faster than Saturn and way more so for 32x. If N64 can barely handle DCT, it'd be advisable to use VQ for slower systems.
    Conventional wisdom was that you needed a Pentium-90 to do MPEG1, which is far beyond 32x/Saturn although I don't know about N64. What I tried to do is like a simplified MJPEG, just to see how far the dual SH2s could be pushed. But it could still use a bit more oomph.

    VQ-type codec is already proven on systems of this generation so there's not much to explore there

    Now that I think about it, there's no law that says DCTs have to use an 8x8 block though is there? Smaller blocks would take less CPU time, although compression wouldn't be as effective. But less time spent doing multiply-adds is more time to spend on a better packing scheme...

  8. #8
    Hedgehog-in-Training Hedgehog-in-TrainingRoad Rasher
    Join Date
    Sep 2016
    Posts
    315
    Rep Power
    7

    Default

    Recent codecs use larger DCT blocks indeed, IIRC up to 64x64, and some older ones used 4x8 or 8x4.

  9. #9
    Raging in the Streets Sik's Avatar
    Join Date
    Jan 2011
    Posts
    3,353
    Rep Power
    63

    Default

    Quote Originally Posted by bakemono View Post
    Now that I think about it, there's no law that says DCTs have to use an 8x8 block though is there? Smaller blocks would take less CPU time, although compression wouldn't be as effective. But less time spent doing multiply-adds is more time to spend on a better packing scheme...
    But if you do smaller blocks then you need to compute more blocks to get the same resolution, so you'd negate the advantage I imagine.

    Lowering the resolution may actually work better…

  10. #10
    Hero of Algol TrekkiesUnite118's Avatar
    Join Date
    May 2010
    Age
    31
    Posts
    8,087
    Rep Power
    121

    Default

    Not sure how relevant this is, but EA also made their own codec for some of their Saturn games that's supposedly MPEG based:

    https://wiki.multimedia.cx/index.php...ronic_Arts_TGQ

    Warcraft II and Crusader No Remorse use it. The results don't seem to be too bad:

  11. #11
    Sports Talker
    Join Date
    Sep 2013
    Posts
    39
    Rep Power
    0

    Default

    Quote Originally Posted by Sik View Post
    But if you do smaller blocks then you need to compute more blocks to get the same resolution, so you'd negate the advantage I imagine.

    Lowering the resolution may actually work better…
    When you convert between pixel values and DCT coefficients the number of computations actually scales with the square of the data size (at least in the worst case scenario). Meaning if you have an 8x8=64 block you might have to do 64x64=4096 calculations. A 4x8 block would only be 32x32=1024 calculations at most, so two of those blocks added together is still less than a single 8x8 block. The reason is because each value in the destination data is affected by EVERY value in the source data. Take a look at the wiki page and you can see all the scary math https://en.wikipedia.org/wiki/Discrete_cosine_transform
    This is also why I use lookup tables since cosine and division would be required otherwise. It doesn't fit in the cache so it does miss but on the plus side the data is always used in order so the burst read from SDRAM works effectively to get the needed values.
    Not sure how relevant this is, but EA also made their own codec for some of their Saturn games that's supposedly MPEG based:

    https://wiki.multimedia.cx/index.php...ronic_Arts_TGQ
    They use 8x8 DCTs and variable-length codes although different than the JPEG ones. Decoding that at 15fps with YUV-RGB conversion seems like no small feat.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •