This may be kind of a dumb idea, but how about saving a ton of triangles (or maybe quadrilaterals?) at varying angles/proportions in ROM to try and pre-render the polygons instead?
It would be an enormous waste of space but the Neo can handle 1024mb of graphics without bankswitching and ROM is cheap today.
The 68k need then only set the sprites' priority, zoom, and palette (out of 256). Even if you include a single background layer and a few bitmaps, you would still have around 300 "polygons" left to play with, which is probably more than what StarFox used in normal gameplay.
Could this work or is it just impossible?
On another note, in my opinion SNK missed a huge opportunity by not putting a 32-bit/3D processor in the Neo CD. Imagine all that memory used for textured polygons...
Scrubbing together 1GB NOR flash or some EPROMs will cost a nice chunk of money...
Death To MP3,
:3
Mida sa loed ? Nagunii aru ei saa"Gnirts test is a shit" New and growing website of total jawusumness !
Is it an asynchronousity issue? (differing clock speeds meaning unmatches memory accesses . . . the same thing that would prevent interleaving with the MD VDP's DMA transfer engine -unless they'd included a mode where the DMA mechanism ran at a speed separate from the rest of the VDP, or at a different speed in vblank -of course, you'd need fast enough ROM to allow such an interleaved mode)
Anyway, yeah, double buffering would make things a lot easier in that case . . . except I thought the Neo's VDP ran at 6 or 12 MHz, similar to the CPU. (which, again, should facilitate such interleaving -except similar clock speeds with total asynchonousity would still prevent the exact timing needed for such sharing)
You could use 1 bank with contention and only have the CPU access in vblank, but there's little need to even consider that anyway. (for that matter, an interleaving scheme would be a bit unnecessary to consider even if it did work since we're talking about the Neo Geo and thus don't need to consider special low-cost concerns, even if they were using it in a game back then)
Actually, if they WERE using it in a game back then, they probably wouldn't burden the CPU with so much and probably would have put a bunch more coprocessor support on-cart instead. (a fast CPU, TMS340, DSP, and/or blitter, or something totally custom)
Actually, I'm a little surprised that (with all the expensive of Neo Geo games) they didn't push something like that for certain games that would benefit from such effects.
That includes VRAM on-cart?On NES you just do your write through the PPU into VRAM, no extra headache or anything else...
Even for solid filling or such (that could be slower with the ASIC), could there be any advantage for offloading that drawing to the ASIC to free up the 68k for other things. (primitive rendering pipelining of sorts, send commands to the ASIC to render, go back to processing other things, and then send another command as needed -especially if the ASIC supports lists/chains of commands to be sent at a time)
Would that method eat up more RAM (for buffering the textures) than the "normal" line by line method?
From the sound of it, that same method would also be useful for the Jaguar (the blitter's scaling/rotation/texture feature is rather like the Sega CD ASIC). Or maybe even for a triangle rasterizer on the Saturn or 3DO. (as an alternative to folded quads or line by line rasterization)
Yeah, for a while I thought the small polygons (ships/objects/etc) were realtime polygons, but looking carefully it's pretty obvious that they're scaled animated sprites (carefully optimized for flips at that).
It makes sense too given that would give a much faster/smoother result and leave CPU time for other things. (especially since it's constantly streaming video data, meaning the sub-CPU is going to be occupied much of the time)
That and the original Silpheed also used pre-rendered polygonal objects as "sprites" (really software blitter objects given the platforms it was released for -and the original PC8801 in particular), and probably not realtime scaled either, but plain animation due to resource limitations. (it also uses a mostly solid color BG for simplified blitting and 8 pixel movement increments in the PC88 version -not sure about EGA- to work easily on byte boundaries -it's a planar bitmap)
No command lists for the ASIC... it's one of the improvements for VDP1 on the Saturn.
The 68000 stores to the appropriate registers, one of which starts the operation. When done, the ASIC sets a flag and (if enabled) generates a Level 1 interrupt. So lists of operations can be handled asynchronously... via interrupt handler, but they still require CPU intervention.
Yes. No packing the textures. However, remember that the "texture" in the SCD is a stamp map - you can reuse tiles; the tiles can also be flipped. So you just need to be more creative about your textures.Would that method eat up more RAM (for buffering the textures) than the "normal" line by line method?
Yeah, I could see the Saturn using this in particular to avoid needing to reprogram triangle based engines. It's probably easier to split the textures into triangles than to make pre-distorted textures for quads. You'll have essentially the same render time as they're both overdrawing, but overdrawing 0s is probably just slightly faster.From the sound of it, that same method would also be useful for the Jaguar (the blitter's scaling/rotation/texture feature is rather like the Sega CD ASIC). Or maybe even for a triangle rasterizer on the Saturn or 3DO. (as an alternative to folded quads or line by line rasterization)
Death To MP3,
:3
Mida sa loed ? Nagunii aru ei saa"Gnirts test is a shit" New and growing website of total jawusumness !
You know, how many pixels does the ASIC write per access? Because memory is 16-bit, and having to access the same word four times (once per pixel) seems... stupid.
You'd have to do four writes with the CPU (or set aside a register to accumulate four pixels) if you do anything other than a straight copy. Remember that the pixels can come from non-adjacent locations in the texture. The four pixels may be in a row in the destination, but probably not in the source.
But that's one reason GPUs went VERY quickly to only supporting 16 or 32 bit output - the pixels match the bus better. Not the only reason, of course, but if you're only going to write one pixel, might as well match the bus width.
Touché, but then shouldn't we take into account the calculation time for both reading and writing? Also, what's the clock speed of the ASIC? How many cycles does it take to draw four pixels?
Technically that isn't true anymore since everything goes through a caché that's using a larger data bus. It's more something to do with alignment than with bus size (also, 8-bit graphics mode had the issue of being paletted, which gave lots of trouble with graphics calculations...).
The timing for the ASIC does take into account writes as well as reads... as well as CPU accesses of word ram, and refresh cycles. The ASIC clock is the same as the CPU, 12.5MHz. The exact timing formulas are illegible in the docs that exist, but at a guess, I'd say it's 3 clocks for every access made, be it reading a pixel, writing a pixel, CPU access, or refresh. There are some boundary issues that effect the clock, and also some differences for the memory mode, but like I said, the current scan of the docs can't be read for this one area. I do wish someone came up with a clean scan as I'd love to see the exact timing.
Yes, modern GPUs don't have this issue anymore... among many others. It was more an issue in early generations of accelerators.Technically that isn't true anymore since everything goes through a caché that's using a larger data bus. It's more something to do with alignment than with bus size (also, 8-bit graphics mode had the issue of being paletted, which gave lots of trouble with graphics calculations...).
The only use of the 16-bit wide memory is for the CPU and Sub-CPU to access.
The ASIC does pixel by pixel rendering, it can't work on bytes or words, just nybbles, and doesn't support more flexible blitting operations, just copying/moving texture stamps along with the scaling/rotation algortithm to stretch and rotate those stamps.
If it supported simple block copy or block fill (or copy/fill with hardware masking support for that matter), that could have been implemented on 16 bit words, or if the scaling/rotation rendering mechanism supported modes for working with 8 or 16-bit pixels rather than just 4-bit ones (which, of course, would only be useful for drawing dithered objects on the Genesis -with pairs or groups of 4 pixels- or for future add-ons like the 32x).
Its a bit of a shame it at least didn't have simple DMA block copy like the MD VDP (but a bit faster), especially with connectivity to program RAM to allow much faster updates to word RAM than the CD's 68k can manage. (a 16-bit 12.5 MHz block copy engine should manage at least 4.17 MB/s assuming 3 cycle long random accesses -80 ns FPM RAM cycle times should be 180 ns so 3 cycles is as fast as you can go- with no special timing/pipelining to set the destination address concurrently and cut out wait states for writes, and perhaps 6.25 MB/s if such pipelining is used, 2x that peak if fast page mode was supported)
sing the texture mapping feature with 16-bit reads or writes with 4-bit pixels would require a significantly more advanced chip with read/write buffers supporting fetching up to 4 pixels at a time and building a row of up to 4 pixels to spit back out. (or just a write buffer to make multiple single pixel reads, but still allow up to 4 pixels to be buffered for output)
The Saturn, jaguar, and (I think) 3DO all lack such a feature, but the Playstation and jaguar II (unreleased) do and I believe most/all later platforms do as well. (along with far more sophisticated caching and pipelining)
In fact, the jaguar's texture mapping feature works very much like the Sega CD's ASIC (rendering scaled/rotated rectangles) except it adds 8bpp to 16bpp indexing as well as the ability to work with variable pixel depths of 8, 16, or 32-bits (it might do 4-bits too, but that would be to a 16 color framebuffer since I don't think there's any 16-color indexing support, unfortunately -there is for "sprite" objects renderd by the object processor, but not for the blitter, and no indexing for 32-bits either, only 8 to 16 bit indexing).
So in the jaguar's case, you'd obviously be making use of 16-bit writes most of the time (and either 8 or 16-bit reads -probably 8-bit indexed to save space), but still only a fraction of the main bus, and also unbuffered to make use of fast page mode, so each read and write takes 5 cycles. (much like how the MCD's ASIC takes 3 cycles per read/write -though it may have some automatic clearing support for overwrite, and that could make use of fast page mode for the second access and only add 1 more cycle rather than 3 -same for the Jaguar, if overwrites occur)
However, that's not the only area you could speed things up, the 3DO and Saturn both separate source (texture memory) and destination (framebuffer) into different banks/buses of RAM to allow fast page mode much of the time (without more advanced design to allow heavy on-chip line buffers and caching like the PSX did -or the jaguar for many other operations, just not texture mapping -the Jaguar also could have done that and actually supports 2-bank interleaving so it wouldn't even need the complexity of an added bus, but would need the cost of more RAM chips for the 2nd bank, though the CoJag did it and with VRAM at that -as it is, the 4k GPU scratchpad can be used to accelerate textures somewhat).
Anyway, for the MCD, that's pretty significant as you already have a 2nd bus to work with like the 3DO (halt the CPU to do a texture fetch in CPU RAM), or an even more useful option could have been allowing the 2 word RAM chips (2 64kx16-bit DRAMs) to support interleaved fast page mode accesses (employ a DRAM controller capable of holding 1 page open on each chip simultaneously) and thus allow considerably use of fast page mode (80 ns) reads/writes from one bank to another. (one as source and another as destination)
Not only would that be very useful for accelerating rendering in general (with textures in one bank and buffer in the other), but it also could have been useful for the 2-pass rendering method Chilly Willy and I were discussing a while back. (use the ASIC for column rendered games sideways and then do a 2nd pass to rotate the lines to columns and complete a final pass with the line based floor/ceiling -for a doom-like game- ) Of course, for such a scheme, you'd need to have some textures in each bank and buffer space in each as well. (so RAM use is a bit tighter and updates by the 68k more frequent -still probably more efficient than contending over the 68k's bus)
Yes, and that's what I meant by a write buffer (a register to accumulate pixels for longer words/phrases -and same issue with the Jaguar, except you'd want a 64-bit write buffer there).
And yes, a write buffer would be more often needed than for reads, though having both could still be useful (multiple pixels and work with them internally to fill the write buffer and do additional fetches as needed -more useful in some cases than other). You could also do buffers longer than the bus width to take advantage of page mode. (if you had a 64-bit write buffer on the CD ASIC -or 256 bits in the jaguar-, that would mean 1 16-bit random write followed by 3 page mode writes -of course, only at peak use in cases where the destination would be at least 16 pixels wide)
Except they also moved very quickly to 64-bits (and not just for faster framebuffer reads and 2D blitting), and that requires additional buffers and caching to make proper use of. (just as systems working with 4/8/16-bit pixels on a 32-bit bus would -like the Playstation)But that's one reason GPUs went VERY quickly to only supporting 16 or 32 bit output - the pixels match the bus better. Not the only reason, of course, but if you're only going to write one pixel, might as well match the bus width.
And then moved to 128-bits in the late 1990s, requiring even heavier buffering of phrases of pixels to make full use of the bandwidth. (not to mention buffering/caching to make optimal use of page mode accesses to DRAM -even more so if working with shared memory with the CPU, like in some laptops and the Xbox or 360 for that matter)
And more recently 256 bits, but with 128 and occasional 64-bit examples still floating around some newer stuff. (especially looking at ATi and NVidia's offerings over the last 2 decades)
The N64's RSP (not sure about RDP) worked on 128-bits internally, but 8 (or 9) externally.
Dreamcast went 64-bit, PS2 was 16-bit for main with RDRAM (which was also 16-bits on PCs standard -not sure what the width on-die GPU memory is in the PS2), Xbox was 128-bit (dual channel), and I'm not sure what the GC used. (either for external DRAM or the on-die PSRAM for the GPU or GPU -same for Wii)
I think both the 360 and PS3's GPUs use 256-bit external buses.
The Saturn was one of the very few examples of a major 3D accelerator product where the blitter/GPU bus was the width of the pixels (assuming there's not a write buffer for 8bpp mode). The Jaguar had buffering for some 3D and most 2D operations to work on 64-bits and use fast page mode (but not for texture mapping, which Flare guessed would be more of a "spice" used sparingly when they were laying down the design in 1990), and the PSX GPU definitely buffers heavily for both 32-bit read/write and page mode use.
The original Rage accelerator also seems to have buffered for 16 (not sure about 8) bit pixels/textels with 32-bit reads/writes and page mode use. (actually, wiki's specs on the ATi GPU page lists 320 MB/s max bandwidth, but with 40 MHz EDO DRAM, that would need a 64-bit bus and not 32-bit as mentioned . . . odd -it also mentions 40 M pixels/textels per second, though that figure would conform more to 32-bits at 40 MHz for 16-bit pixels -that would be like the PSX's 33M 16-bit textels per second on a 33 MHz 32-bit bus, with that also being the theoretical peak and not real-world performance with refresh, page breaks, and framebuffer scanning overhead taken into account -granted, you WOULD reach that speed for short burst in real-world performance, if not actually faster for cached textures)
Yes, assuming the approximate 180 ns memory cycle time for 80 ns FPM DRAM is accurate, that would mean no fewer than 3 cycles per complete read/write. (aside from possible cases of page mode use -which may have been employed for drawing on top of other graphics -where a few consecutive reads/writes would be useful for clearing and writing -it would be useful for masking if you ever did more than 1 pixel at a time, but that's a bit moot with single pixels-ie just don't draw transparent pixels at all)
Depending how refresh is handled, it might not be a significant hit at all, but another consideration would also be any additional cycles lost to processing between reads and writes. (I believe the jaguar's blitter loses one more cycle between reads and writes for its textures, no idea what the Sega CD is like though -or the Saturn VDP1 for that matter)
Except because the 68000 simply can't deal with memory that doesn't have a 16-bit data bus, period (Z80 RAM has an 8-bit data bus, but the hack needed to get that to work means the 68000 can't do a word access to it). The ASIC uses the same memory the 68000 accesses, so it must be 16-bit too. If the ASIC only did 4-bit accesses, that means the memory must support 4-bit writes too. I really doubt that's the case, as far as I know all 16-bit RAM only provided at most 8-bit writes (with the /USB and /LSB lines). So, it must be accessing more than 4-bit at once.
Also here's what I mean. With your logic, it'd go like this:
- Read word from bitmap
- Read word from texture
- Write word into bitmap
- Read word from bitmap
- Read word from texture
- Write word into bitmap
- Read word from bitmap
- Read word from texture
- Write word into bitmap
- Read word from bitmap
- Read word from texture
- Write word into bitmap
What I mean is that the ASIC is probably doing this:
- Read word from texture
- Read word from texture
- Read word from texture
- Read word from texture
- Write word into bitmap
See my point?
If doing actual 4-bit aligned addressing and individual 4-bit reads/writes are an issue, it might just be working on 4 bits and then masking that to a word (or possibly byte) aligned output.
To do the latter it would need more advanced logic to support a multi-pixel wide write buffer, that's something the jaguar also lacked that would have helped a lot with texture mapping bandwidth.Also here's what I mean. With your logic, it'd go like this:
- Read word from bitmap
- Read word from texture
- Write word into bitmap
- Read word from bitmap
- Read word from texture
- Write word into bitmap
- Read word from bitmap
- Read word from texture
- Write word into bitmap
- Read word from bitmap
- Read word from texture
- Write word into bitmap
What I mean is that the ASIC is probably doing this:
- Read word from texture
- Read word from texture
- Read word from texture
- Read word from texture
- Write word into bitmap
See my point?
And those are things I already mentioned . . . there actually things that came up on Atariage in context of the jaguar. (again, the jag's texture mapping works rather like the ASIC with single pixels at a time for scaled/rotated rectangular stamps -with added CPU/GPU grunt for warped 3D- but it can work on different pixel depths up to 32-bits, but all of those depths are still single pixel reads/writes -usually 8 or 16 bit reads are used and 16-bit writes)
It was explicitly mentioned (in that Jaguar discussion) that adding a multi-pixel write buffer would have been one of the simpler changes to the blitter dramatically improve texture mapping performance on the Jaguar. (for 16-bit pixels, a 64-bit write buffer would be about 2.5x faster than single pixel texture mapping)
I don't think the Saturn's VDP1 even supports a word buffer for writes in 8bpp mode (though that would make 8bpp rendering more attractive), not sure about the 3DO. (it's a 32-bit bus and normally using a 16bpp framebuffer, so 2 pixels buffered maximum)
Also, why would you need to read a word from the bitmap if you positively knew it was going to be zeros (for non overwritten scaled objects -like for non overlapping objects on sprite or BG cells)?
And for cases where you DO want overwrites, wouldn't it be more efficient to read the bitmap (destination) after reading the texture? (since the next access will be to that same address for the bitmap and thus facilitate page mode operation)
You're going by the assumption the ASIC is just 4-bit. In fact, we say ASIC to refer to the renderer, but if I recall correctly the ASIC actually contained all the custom hardware in the Mega CD (much like its ASIC counterpart in the Mega Drive side), so it's already pretty complex for starters. It's still less complex than the ASICs in the MD though, meaning it should have had enough die space for 12 more bits...
Because of your own flawed assumption. If the ASIC only manipulates 4 bits at once, but memory needs 16-bit, it needs to buffer 16-bit of data from memory, modify the 4-bit that are affected, then write them back. This also means the ASIC already has a 16-bit buffer to hold the data... by which point you may as well just modify all 16-bit at once. See my logic?
My point still stands, it's still doing more reads than it could when not doing overwrites.
There are currently 1 users browsing this thread. (0 members and 1 guests)