Quantcast

Results 1 to 10 of 10

Thread: Cycles, cycles, and cycles!

  1. #1
    ding-doaw Raging in the Streets tomaitheous's Avatar
    Join Date
    Sep 2007
    Location
    Sonoran Desert
    Age
    41
    Posts
    3,981
    Rep Power
    74

    Default Cycles, cycles, and cycles!

    Alright, you mothersuckas! Time to get down to brass tacks.

    stef: Touko said you have some pixel, clear, pixel fill calculations for the MD. How fast can you clear a 4bit pixel screen buffer in local cpu ram? (say something like 256x192).


    Also, this thread is to discuss cycle times, FX, tricks, etc - anything related. Whether for the MD or possible on the MD. Both MD and PCE are super lacking in the demo department. Well, Overdrive came out but the list is still pretty small and not that impressive (excluding Overdrive). So... yeah.

  2. #2
    Outrunner Stef's Avatar
    Join Date
    Aug 2011
    Location
    France
    Posts
    602
    Rep Power
    25

    Default

    I think i can start by copy / pasting that :

    Quote Originally Posted by Stef
    Just to give you an idea where i am with that:

    - SGDK is able to transform about 10000 vertices / second (3D rotation+translation and the 2D projection).
    That give a maximum of 500 vertices per frame for 20 FPS refresh rate but that is if you only do that, without any rendering or anything XD
    And when you manipulate objects, many overheads comes, you need to setup severals contexts, combine transformations... so the usable and realist performance is lower than that, i guess we can probably use scene with a maximum of 100 to 150 vertices per frame. Note that these are numbers for real and complete 3D transformations and maybe we can find trick to only do partials transformations when this is possible. For instance the Starfox case does not need any rotations for world (only for objects).

    - The polygon filling code is fast too but i can hardly give an estimation about the fillrate. It really depends from the size of the polygon, the clipping overhead and others stuff. Tiny polygons are time killer as you process many things to finally render very few pixels...
    I guess in ideal case (large or very large polygon) the fill rate can be up to 5 MP/s (very rough estimation) but reasonably i think 2 MP/s is a better average estimation. If we take the base resolution : 256 * 160 = 40960 pixels that means we can fill the whole screen about 50 times per second without overdraw... Of course again that is very dependent from polygons size.

    Reality now is that we have to share the CPU time between others heavy tasks as well :
    - Transfer and convert the main memory bitmap buffer to video memory tiles.
    - Clear bitmap buffer at each frame.
    - Game logic, 3D collisions...

    For the transfer and conversion, i am doing it on the 68k. I would really like to find a trick in tilemap arrangement and bitmap rendering to be able to use the DMA here, that would give an important speed boost (factor > 2x). Currently the transfer and conversion code eat all the blank time which represent almost 40% of the total CPU time !
    Also the clear bitmap operation eats many 68k cycles, i tried to implement a Fast Fill Bitmap mode sometime ago which was really nice for fast clear operation but the drawing part brings so much overheads that it ended by being too complex and slower than optimized "brute force" classic rendering :-/

    Quote Originally Posted by gasega68k
    Wow , that's a lot of information Stef .
    Well, for the rotation and projection , from what you say I think your code is faster , but in my case I can tell you in terms of cycle count is close to 1000 cycles per vertice ( depends on multiplications ) .
    The polygon filling code , I can tell you also in terms of cycles is around 722 cycles per line in the worst case (large polygons) , and about 266 cycles in the best case (small polygons) .
    I've done demos with 256 * 128 and 256 * 160 , in fact the demo starfox I did use 256 * 160 as in the demo you did , but I use the resolution of 320 * 224 with a "window" of 256 * 160 .

    To transfer to the VRAM I either use DMA because the bitmap buffer is not organized to be able to use the DMA , but I use a little trick to transfer the bitmap buffer as fast as possible to the VRAM , this is what I do: first I separate into small blocks of 1k bitmap buffer ( for 256 * 128 there are 16 blocks and 256 * 160 there are 20 blocks) , so when I go to move to the VRAM each of these blocks , I first read the Vcount to see if we are in the space of " blanking " ( 256 * 160 would have 102 lines of " blanking " ) , and if there is enough time to transfer one of these blocks of 1k ( about 11 lines would be needed ) , then I disable the video, to transfer to the maximum speed this block and then I enable the video again, but if there is not enough time , I did not disable the video and this block will be transferred slower.
    To clear the buffer , I use MOVEM instruction , it takes about 74 lines in 256 * 128 mode and about 92 lines in 256 * 160 mode .
    I 'm sure I can improve the code for filling polygons , and the code for the rotation and projection .

    I'll create a new thread in a few hours , to show some of the demos I have done, the first will be the demo of Starfox .
    The first version I did of this Starfox was in October 2011 and the last modification I did was in June 2013 , but this demo is fully done in 3d , ie can move to any direction.

    Quote Originally Posted by Stef
    I tried to count the number of cycle taken by my 3D transform methods and i obtained 814 cycles just for the vertex transformation and almost 400 cycles for vertex 2D projection which is actually more than yours ! But i counted 70 cycles for MULS instruction and 160 cycles for DIVS which is pessimist and probably more than the average case (actually it is as i benchmarked about 10000 vertices transformed per second which mean close to 800 cycles instead of the 1200 counted here).
    You can see the source here : https://code.google.com/p/sgdk/sourc...rc/maths3D_a.s

    About the polygon filling, i tried to do the count on a per line basis. I think i have a minimum of 250/300 cycles per line and maximum of 650/700 cycles... It's funny to see we have really close numbers here, i guess our code are somehow similar
    Same for the bitmap clear stuff, i heavily used the MOVEM instruction to make it as fast as possible.
    Again, you can see the code here : https://code.google.com/p/sgdk/sourc...nk/src/bmp_a.s


    I think that there is no way to use the DMA for the transfer as it seems you cannot use it neither, or not without heavy changes in the bitmap rendering code.
    I'm also using the extended blank area to transfer to VRAM at full speed (at least, full 68k speed). It's why the Bitmap engine use 256x160 resolution, i actually split the transfer on 3 frames in NTSC and 2 frames on PAL systems which also mean you are limited to maximum of 20 FPS for NTSC and 25 FPS for PAL.
    See the bitmap engine description for more details in the bmp.h header file :
    https://code.google.com/p/sgdk/sourc.../include/bmp.h

    Funny enough it seems we done very similar stuff after all, i'm not the only crazy guy attempting doing 3D on the Sega Genesis
    I'm really impatient to see your own starfox / 3d demo stuff by the way

  3. #3
    Shake well before use Master of Shinobi Robotwo's Avatar
    Join Date
    Sep 2011
    Location
    Sundsvall , Sweden
    Age
    27
    Posts
    1,017
    Rep Power
    28

  4. #4
    Outrunner Stef's Avatar
    Join Date
    Aug 2011
    Location
    France
    Posts
    602
    Rep Power
    25

    Default

    Quote Originally Posted by tomaitheous View Post
    stef: Touko said you have some pixel, clear, pixel fill calculations for the MD. How fast can you clear a 4bit pixel screen buffer in local cpu ram? (say something like 256x192).
    Yeah i had some talks with Touko and i gave severals time those numbers he never admitted
    Anyway that's pretty easy to calculate after all, basically a 68000 can transfer a word each 4 cycles... of course because of the instruction flow you don't get that exact number but for a simple fast fill operation you can use the movem instruction and you will be close enough.

    For instance my bitmap clear method you can find here:
    http://code.google.com/p/sgdk/source...nk/src/bmp_a.s

    consume:

    init
    12+8
    (8+(8*11))
    (14*4)
    = 172

    loop
    ((((8+(8*13)) * 10) + 10) * 38)
    = 42940

    end
    ((((8+(8*13)) * 3) 336
    (8+(8*11)) 96
    (12+(8*11) 100
    16
    = 548

    total: 43660 cycles for the complete method

    The bitmap size is 256*160 (4bpp) so that give about 20480 bytes cleared in 43660 cycles.
    If we put it that on the 7.67 Mhz 68000, that give you a fill rate of ~ 3.6 MB/s... ~4.26 cycles / word which is close from the 4 cycles / word.
    This is exactly what i like with the 68000, its instruction set is powerful enough to let you optimize the code to its max strength even if every instruction eat many cycles in general

    When you take the 6502 for instance, at 1 Mhz it should be able to transfer 1 MB/s as it access memory at 1 Mhz. But actually it is very far from that because of the inefficient and limited instruction set & logic. For instance it actually consumes 2 cycles for INX which is a single byte instruction and 2 cycles for ADC #imm which is 2 bytes length. That does not make any sense for me, the INX should take 1 cycle. I guess that is because of the limited multiplexed internal stages (fetch / decoding / execution...) but in the end that is really inefficient. I do not know the maximum memory fill capability of the 6502 (i could probably calculate it but i'm not an expert of this CPU) but i guess it's really far from 1 MB/s
    Last edited by Stef; 12-20-2013 at 05:21 AM.

  5. #5
    ding-doaw Raging in the Streets tomaitheous's Avatar
    Join Date
    Sep 2007
    Location
    Sonoran Desert
    Age
    41
    Posts
    3,981
    Rep Power
    74

    Default

    This is what I have for the 6280:

    Code:
      (code list)
    1bit
            stz abs,x   ;5 cycles for 8 pixels. 
            
            0.625 cycles per pixel
    
    2bit 
                  
            stz abs,x   ;5 cycles
            stz abs,x   ;5 cycles. 10 cycles for 8 pixels
    
            1.250 cycles per pixel
      
    3bit
    
          
            stz abs,x   ;5 cycles
            stz abs,x   ;5 cycles
            stz abs,x   ;5 cycles. 15 cycles for 8 pixels
            
            1.875 cycles per pixel
    
    4bit
    
            stz abs,x   ;5 cycles
            stz abs,x   ;5 cycles
            stz abs,x   ;5 cycles
            stz abs,x   ;5 cycles. 20 cycles for 8 pixels
            
            2.500 cycles per pixel
    That's for a code list.

    The PCE VDC is planar, so you can do any one of those combinations.

    You can also put the VDC in 2bpp mode too. For either the background or sprites. I've used this to do a neat trick on the BG to simulate a weird transparency or transition effects on a scanline basis; colors per tile need to be set in a specific way. When you set this for the BG (2bpp mode), there's a control reg bit that selects which 2bpp pixels to read (first set or second set). For sprites, it's done in the SAT (the cell address is 2bpp aligned, not 4bpp). Mednafen fully supports this IIRC (other emulators either have code hacks or only partial support).

    The VDC only supports 16bit values and WORD addressing (no bytes). Same for VRAM to VRAM DMA (copies WORDs only). You can't just write a single byte to vram (which made my NES PPU emulation code on the PCE a pain in the ass). But there are two tricks you can do for BYTE writes. One is fast and simple. You write $00 (or whatever pad value) to the LSB of the vram write reg. When you write to the MSB of that reg, it activates the latch and transfers the WORD buffer to vram. If you write only to the latch/MSB, the last value left in the LSB gets transferred along with it. I.e. you don't need to write LSB more than once - only write to the MSB. The other trick is slower; you write to the LSB instead of the MSB to the write port, read the MSB from the vram read port and write that back to the MSB write port. It's slower, but allows for more complicated tricks (transparency effects or such).

    You can read/write to vram during active display without delays or such (though you might get a small delay during hblank, if the VDC is fetching sprite pixel data. This is good if you need to sync the cpu for code in a scanline interrupt. I have a few cpu sync effects, just waiting for a demo or game to put them in). But the PCE doesn't quite have the bandwidth that MD has, for updating vram. DMA'ing instructions were added to the 6280 (more commonly referred to as block transfer instructions), but they halt the CPU during that period. Which can cause sync problems for hsync effects, and for audio you need to break them down into multiple small transfers in order not to effect interrupt sample playback. There's a faster solution though. The 6280 also has special instructions; ST0, ST1, ST2. They are store immediate to port instructions. And the ports are fixed to the VDC port addresses; Reg port, LSB port, and MSB port. You can embed graphics into these opcodes, though it doubles the storage space (fine for ram though). The ST1/ST2 opcodes take 5 cycles to store a byte to VDC port (4 cycles + 1 penalty cycle). It's 2 cycles faster than Txx dma instructions (6 cycles +1 penalty cycle per byte to VDC port destination) and doesn't stall the cpu, but still not as fast the MD's VDP dma. There's 455 cpu cycles per scanline, so only 91 bytes max per scanline transferable. A maximum of 23,842 bytes in a single NTSC frame can be copied (though that means using 100% vblank time too). Practical limit will be lower though, if you need to update audio regs, SAT entries, tilemap updates, etc.

    But i counted 70 cycles for MULS instruction and 160 cycles for DIVS which is pessimist and probably more than the average case
    Why not use a Mul and Div LUT (look up table) to bring those cycle times down?

  6. #6
    Outrunner Stef's Avatar
    Join Date
    Aug 2011
    Location
    France
    Posts
    602
    Rep Power
    25

    Default

    The MD does not natively support anything but 4bpp, still you can easily emulate 2bpp or 1bpp by playing with palette (i emulated 2bpp in the Bad Apple demo this way), still having native 2bpp / 1bpp support is a bit more convenient for data arrangement, i think i would have get a better compression ratio with native 2bpp tile (instead of merging 2 frames).
    Touko told me about the block transfer instruction on th 6280 but as as you pointed it brings hsync and audio problem problem if you use it for large transfer so you have to split it. Actually we have the same problem with movem instruction (or div) which can delay the HInt on the 68000, you have to take care of that... hopefully it can't delay more than 170 cycles in the worst case which is less than the number of cycle per scanline (488) so you will never miss an HInt.
    Given your fastest instruction, it seems you can do VRAM fill at ~5.25 cycles / byte (if we take the loop overhead) which is not bad ! But you cannot do that on a main RAM bitmap buffer where you would use the block transfer instruction instead i guess.
    On the MD you have a DMA VRAM fill command so you would prefer to use it when possible
    Something which is really sad is on the MD is that internally VRAM transfer are done on byte width even if the VDP port is word width (as on the PCE), i guess that limitation come from the internal VRAM bus connection but because of that the VRAM bandwidth is limited to 3.23 MB/s when CRAM or VSRAM can transfer be up to 6.46 MB/s...
    It would have be nice to have the same 16 bits capabilities to VRAM

    A big advantage on the PCE is you can transfer at full speed in active period which is not the case on the MD. You have to transfer in VBlank, transferring during active period is a waste of 68000 cycles (or you have to limit them to HBlank only). So if you want to transfer more that what the VBlank can do you have to extends the VBlank or accept a big penalty on 68000 cycles.

    About the div / mul instructions, you cannot always use lookup tables. The mul instruction does 16x16=32bits so you would require a 16 GB lookup table to do it
    And even worst for the division as it does 32/16=16:16 ...
    Still in some case where 8x8=16 mul or 8/8=8 div are enough then a lookup table can be useful but this is rarely the case ! For 3D calculation 16 bits fixed point accuracy is a minimum.

  7. #7
    ding-doaw Raging in the Streets tomaitheous's Avatar
    Join Date
    Sep 2007
    Location
    Sonoran Desert
    Age
    41
    Posts
    3,981
    Rep Power
    74

    Default

    Given your fastest instruction, it seems you can do VRAM fill at ~5.25 cycles / byte (if we take the loop overhead) which is not bad ! But you cannot do that on a main RAM bitmap buffer where you would use the block transfer instruction instead i guess.
    Vram fill? You mean cpu writes to vram? If the buffer or graphics are embedded opcodes, the loop can be unrolled to huge lengths. I've had embedded bitmaps the size of 2k chunks (large dynamic tile areas) . I.e. it would transfer 2048 bytes before the RTS back to the loop. And there's no re-prep work on the ST1/ST2 opcodes (like with self modifying code). You can bring that 0.25 cycle overhead down to 0.01 cycles or lower. The same for buffer clear on any of the 1bpp to 4bpp local buffer maps. If the buffer is a fixed location, you don't even need indexing. Just a machine generated list of STZ opcodes (5 cycles) and a RTS at the end.



    There's a trick you can do on the PCE VDC that negates the usage of a buffer in CPU ram, though. It's plays on a number of advantages of the VDC. The first being that you can write to vram during active display. The second is that tiles and sprites are planar. There actually two different planar graphics; sprites aren't stored in composite or linear planar. They're stored in segmented planar bitmaps. That is to say, each plane is a whole 1bit 16x16 bitmap cell, followed by another 32bytes after it, etc. Each has their pro's and con's for bitmap/pixel type effects.

    The third advantage is that you can setup the vram write incrementer other than +1. And fourth being that you can setup the tiles in a way that makes it easier to transfer to vram in a non tile format layout. You can do these last two parts on the MD for column rendering (plus the MD has that nice advantage of linear/packed pixels), but on the PCE - you can setup the tiles and vram incrementer to write to horizontal 'lines' of pixels. You just point the VDC to which 'line' of the bitmap you want to render to. So you can actually avoid ~any~ local buffer in cpu ram, and just use vram directly. Of course, you don't get direct random pixel access. But you ~do~ get linear rendering on a horizontal bitmap line basis. I.e. It's pretty good for polygon rendering. I'm in the middle of writing an engine that parses objects and just stores their start and stop points into an array, to be rendered directly to VRAM.

    That removes the need to clear a local buffer and transfer a local buffer. Plus, having a bitmap 'line' is faster to render to than local ram because the VDC has an auto incrementer (and some other logic is cut out as well). And there's also some speed-ups you can do (the longer lines can use the 'only write to the MSB/latch' trick to cut the byte writes in half). The PCE also has the same advantage that if you use less than 4bpp mode, you get faster 'rendering' to vram (less writes). I never told Touko about this. Actually, I don't think I've shown anyone this trick before. At least, not in detail. It's great for polygon rendering. Or anything similar. Matter of fact, I'm working on an effect that uses this for an up coming demo (but they aren't 3D polygons).

    AFAIK, the reason the ~primary~ part of this particular trick (the vram incrementer) doesn't work on the MD, is because since the tiles on the VDP aren't planar (1 byte != 8 pixels). I.e. You can't have the VDP increment after writing a LONG (4bytes). Were as on the PCE, you do it in two passes. I.e. You write the second 'line' to second planar bank for 3bpp/4bpp bitmap modes (if you're using that many colors). Though, I guess you could do a weird interleaved/gap/segmented setup as two pass on the VDP. I.e. a 'bitmap line' in vram would have 4pixel gaps, thus requiring a second pass on the same line. But then the problem lays in the fact that you don't have full access to vram during active display. Brute forcing it during active display would be inefficient - but still doable. And for vblank (whether forced/clipped for more vblank lines or not) the CPU would have full access to vram. You think that would give you enough cycle savings to negate using a local buffer on the MD? Maybe even have the VDP vram-vram transfer optimization for large run of same color pixels? Though I can see that being a pain to setup.

  8. #8
    Outrunner Stef's Avatar
    Join Date
    Aug 2011
    Location
    France
    Posts
    602
    Rep Power
    25

    Default

    Quote Originally Posted by tomaitheous View Post
    Vram fill? You mean cpu writes to vram? If the buffer or graphics are embedded opcodes, the loop can be unrolled to huge lengths. I've had embedded bitmaps the size of 2k chunks (large dynamic tile areas) . I.e. it would transfer 2048 bytes before the RTS back to the loop. And there's no re-prep work on the ST1/ST2 opcodes (like with self modifying code). You can bring that 0.25 cycle overhead down to 0.01 cycles or lower.
    Yeah i was referring a simple VRAM clear as when you want to clear your bitmap buffer. Of course you can reduce the 5 cycles almost but i just taken an example of light enrolling to not waste ROM.

    The same for buffer clear on any of the 1bpp to 4bpp local buffer maps. If the buffer is a fixed location, you don't even need indexing. Just a machine generated list of STZ opcodes (5 cycles) and a RTS at the end.
    I see, i guess STZ use 16 bits addressing ? but the problem on the PCE is the limited main RAM, you just can't use local bitmap buffer or the resolution would be really limited...

    There's a trick you can do on the PCE VDC that negates the usage of a buffer in CPU ram, though. It's plays on a number of advantages of the VDC. The first being that you can write to vram during active display. The second is that tiles and sprites are planar. There actually two different planar graphics; sprites aren't stored in composite or linear planar. They're stored in segmented planar bitmaps. That is to say, each plane is a whole 1bit 16x16 bitmap cell, followed by another 32bytes after it, etc. Each has their pro's and con's for bitmap/pixel type effects.

    The third advantage is that you can setup the vram write incrementer other than +1. And fourth being that you can setup the tiles in a way that makes it easier to transfer to vram in a non tile format layout. You can do these last two parts on the MD for column rendering (plus the MD has that nice advantage of linear/packed pixels), but on the PCE - you can setup the tiles and vram incrementer to write to horizontal 'lines' of pixels. You just point the VDC to which 'line' of the bitmap you want to render to. So you can actually avoid ~any~ local buffer in cpu ram, and just use vram directly. Of course, you don't get direct random pixel access. But you ~do~ get linear rendering on a horizontal bitmap line basis. I.e. It's pretty good for polygon rendering. I'm in the middle of writing an engine that parses objects and just stores their start and stop points into an array, to be rendered directly to VRAM.
    If i understand correctly you configure 16x16 tiles this way (for a 256 pixels screen) :

    00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
    10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
    ...

    Then by using a 32 bytes vram increment step you can actually write in this order:
    - first bitplan of the first line of tile 00
    - second bitplan of first line of tile 00
    - third bitplan of first line of tile 00
    - last bitplan of first line of tile 00
    - first bitplan of the first line of tile 01
    - second bitplan of first line of tile 01
    - third bitplan of first line of tile 01
    - last bitplan of first line of tile 01
    - ...

    So except that memory is bit planned memory you have a sort of bitmap buffer right ?
    I wrote sometime ago a scanline bitmap engine which taken a list of segment for each scanline then i was using only some fill operation.
    The implementation was pretty complete and working but i finally did not used it as performance were not that great :-/
    It was really fast for simple polygon rendering without overdraws but as soon you had more complexes operations (segment inserting, merging...) the overhead in handling the complex structure (you have to use linked segment list with a minimalist dynamic segment allocation system) killed the gain. Also primitives as line or pixel plot went very slow compared to classical method ! It's why i finally preferred to get back to traditional bitmap rendering. If you are interesting i can give you the source code, i should have it sitting somewhere

    That removes the need to clear a local buffer and transfer a local buffer. Plus, having a bitmap 'line' is faster to render to than local ram because the VDC has an auto incrementer (and some other logic is cut out as well). And there's also some speed-ups you can do (the longer lines can use the 'only write to the MSB/latch' trick to cut the byte writes in half). The PCE also has the same advantage that if you use less than 4bpp mode, you get faster 'rendering' to vram (less writes). I never told Touko about this. Actually, I don't think I've shown anyone this trick before. At least, not in detail. It's great for polygon rendering. Or anything similar. Matter of fact, I'm working on an effect that uses this for an up coming demo (but they aren't 3D polygons).
    That seems promising and i'm really looking forward that. Still i believe the structure to put in place (sort of list of color segment) is definitely not trivial to implement and can eat more time than expected. My implementation was 100% C though so it could have been faster but i preferred to not invest anymore on that as the traditional rendering was actually simpler and faster in almost case :-/

    AFAIK, the reason the ~primary~ part of this particular trick (the vram incrementer) doesn't work on the MD, is because since the tiles on the VDP aren't planar (1 byte != 8 pixels). I.e. You can't have the VDP increment after writing a LONG (4bytes). Were as on the PCE, you do it in two passes. I.e. You write the second 'line' to second planar bank for 3bpp/4bpp bitmap modes (if you're using that many colors). Though, I guess you could do a weird interleaved/gap/segmented setup as two pass on the VDP. I.e. a 'bitmap line' in vram would have 4pixel gaps, thus requiring a second pass on the same line. But then the problem lays in the fact that you don't have full access to vram during active display. Brute forcing it during active display would be inefficient - but still doable. And for vblank (whether forced/clipped for more vblank lines or not) the CPU would have full access to vram. You think that would give you enough cycle savings to negate using a local buffer on the MD? Maybe even have the VDP vram-vram transfer optimization for large run of same color pixels? Though I can see that being a pain to setup.
    Actually there is so much issues while having the buffer in VRAM on MD that you definitely want to get it in main ram, just think about read operation... or the fact we cannot easily setup a way for contiguous write (or maybe by consuming 2 plans) as you can with the PCE.
    But with the local buffer we still have the frustration that we cannot use the DMA to transfer it to VRAM. It's really a shame as the transfer / conversion itself consume close to 40% of the CPU time ! Of course that allow the 3D rendering to be optimized as much as possible but it's like we are using a 4.7 Mhz 68000. Using the DMA would lower the CPU usage for transfer to a bit less than 20% which is a nice gain Actually i discussed a bit with gasega68k to see how modify the buffer arrangement for that but in every case it requires some important modifications in the bitmap rendering code itself which is not acceptable imo (possible for simple polygon rendering, much more problematic for point, line or even texturing).
    Anyway i think you can always find specials tricks or workaround when it is for very specific cases, as the one you are trying to do Just have a look to the last Overdrive demo version, now it does 60 FPS 3D rendering and the objects are much bigger they were in the first version. Maybe they use some tricks you could not use for a real 3D engine but still it is very impressive to see that on a megadrive =)

  9. #9
    ding-doaw Raging in the Streets tomaitheous's Avatar
    Join Date
    Sep 2007
    Location
    Sonoran Desert
    Age
    41
    Posts
    3,981
    Rep Power
    74

    Default

    Quote Originally Posted by Stef View Post
    About the div / mul instructions, you cannot always use lookup tables. The mul instruction does 16x16=32bits so you would require a 16 GB lookup table to do it
    And even worst for the division as it does 32/16=16:16 ...
    Still in some case where 8x8=16 mul or 8/8=8 div are enough then a lookup table can be useful but this is rarely the case ! For 3D calculation 16 bits fixed point accuracy is a minimum.
    I was thinking of something along the lines of a*b can be represented f(a+b)-f(a-b) with f(x)=x^2/4. On the 65x, they use a 9bit LUT for F(x) because they do multiplication is 8bit stages and accumulate them (where A and B are 8bit values). On a stock 6502, you can get 16bit*16bit->32bit mul in less than 190 cycles. You could speed that up on the 68k, though 70 cycles is pretty damn fast already. I could have sworn it was slower than that on 68k - heh.

    And there are a few ways to cut down on mul/div tables. One for A*B is array[A>>1][B], then test the bit to add; if set then add B. I.e. it cuts the table in half by on storing even values. You can cut it down more, with more shifts and adds. It gets more complicated than that and it the speed is specifically dependent on the cpu's ISA (like on the 65x, LUTs are the magic elixir for its short comings - though at the expense of space/storage. Indexing is free on the 65x, even if small in offset. Allows for some really crazy code optimization with small and large pre-calc LUTs).

    43660 cycles for the complete method
    Plus you still have to write to the buffer and then transfer it. If you could cut out the transfer-to-vram and clear-buffer part, I'm sure that would be a decent speed up. True though, there is overhead for the method but so far it's shaping up to be pretty speedy in comparison to the other method. It doesn't lend well to pixel pattern shading or textures, but the goal is to put flat shaded polygons or such style of graphics to vram as fast as possible.

    If i understand correctly you configure 16x16 tiles this way (for a 256 pixels screen) :

    00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
    10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
    ...

    Then by using a 32 bytes vram increment step you can actually write in this order:
    Yeah, if you use sprites. But the incrementer is in words: +1, +32, +64, +128 words. So it's more convoluted layout. Each has their advantages, but I tend to use tiles instead of sprites for the pseudo bitmap in vram. The tile format is composite. Meaning the first byte of the tile is the first 8 pixels for plane 1 and the second byte is the second 8 pixels for plane 2. So a WORD transfer writes 8 pixels of two planes. The tile is a single 2bpp tile. The next 2bpp 8x8 tile immediately following that in vram, is the pixels for planes 3 and 4. Just like the SNES tile format (composite). The tiles are setup in a way that I just continuously write 2bit pixels in a line, then I repoint the VDC to the second set for the last 2bit pixel line to write (assuming 4bpp).

    Edit: Just saw this:
    I see, i guess STZ use 16 bits addressing ? but the problem on the PCE is the limited main RAM, you just can't use local bitmap buffer or the resolution would be really limited...
    Yeah, STZ is 16bit addressing. The system ram is 8k, but there are options. One option is to use the "populous' rom. That hucard has an additional 32k of ram on it for doing buffer/bitmap effects. But, no flashcard supports that AFAIK (not even the Turbo Everdrive) - but emulators do. The next option is to use the CD project. That gives you the base 8k along with either 64k, 256k (SuperCD), or 256k+2megabytes (arcade card). The third option is to use a hucard, but one that needs a CD system. I.e. when you plug a hucard into a CD system, the base 64k of original CDRAM is always there to use. Of course you could do something else, but that requires building a custom card and somehow getting an emulator to support it.

    Also, while most PCE setups have the 8k hardware bank mapped to address range $0000-1fff, you can map anything you want there (excluding the base 64k CD ram, but the 192k of SuperCD ram works fine there). Especially since using ST1/ST2 opcodes, so hardware bank needs to be mapped to reg select and write to vram (I did this with my first NES2PCE game - NES Dragon Warrior running on SuperCD). 256x200x4bpp bitmap buffer = 25,600 bytes. That easily fits into the 64k cpu logical address range.
    Last edited by tomaitheous; 12-22-2013 at 11:39 PM.

  10. #10
    Outrunner Stef's Avatar
    Join Date
    Aug 2011
    Location
    France
    Posts
    602
    Rep Power
    25

    Default

    Quote Originally Posted by tomaitheous View Post
    I was thinking of something along the lines of a*b can be represented f(a+b)-f(a-b) with f(x)=x^2/4. On the 65x, they use a 9bit LUT for F(x) because they do multiplication is 8bit stages and accumulate them (where A and B are 8bit values). On a stock 6502, you can get 16bit*16bit->32bit mul in less than 190 cycles. You could speed that up on the 68k, though 70 cycles is pretty damn fast already. I could have sworn it was slower than that on 68k - heh.
    Just found the method you described here :
    http://everything2.com/title/Fast+6502+multiplication

    But that would not work on a 68000 where using so much lookup table and process would eat many cycles ! Better to use a single and larger lookup table.
    Also it looks like they describe a method to do 8x8-->16 bits mul. In this case i wonder if the conventional software way (shift and add) would be that much slower ? I guess it is because of the required 16 bits arithmetic...
    On the 68000, 70 cycles is a maximum when using direct register (http://oldwww.nvg.ntnu.no/amiga/MC68...mstandard.HTML) so except in a specific 8x8=16 case, using lookup tables won't help ! But that is definitely not true for division which can takes up to 140 cycles ! I guess you use 1/x lookup table and apply multiplication on 6502, exactly as we try to do on 68000 when that is possible

    And there are a few ways to cut down on mul/div tables. One for A*B is array[A>>1][B], then test the bit to add; if set then add B. I.e. it cuts the table in half by on storing even values. You can cut it down more, with more shifts and adds.
    Cutting the lookup table length get you closer to the software implementation, i think it's a trade off depending what you need to save

    It gets more complicated than that and it the speed is specifically dependent on the cpu's ISA (like on the 65x, LUTs are the magic elixir for its short comings - though at the expense of space/storage. Indexing is free on the 65x, even if small in offset. Allows for some really crazy code optimization with small and large pre-calc LUTs).
    It is still nice to have that fast lookup table possibility. When you said indexing is free, is it 16 bits indexing ?


    Plus you still have to write to the buffer and then transfer it. If you could cut out the transfer-to-vram and clear-buffer part, I'm sure that would be a decent speed up. True though, there is overhead for the method but so far it's shaping up to be pretty speedy in comparison to the other method. It doesn't lend well to pixel pattern shading or textures, but the goal is to put flat shaded polygons or such style of graphics to vram as fast as possible.
    True, and it also brings the big advantage of avoiding overdraw drawing when dealing with many polygon (you have to take care of Z information in your scanline buffer) I think i could have pushed it further and maybe i will give another shot on that latter.

    Yeah, STZ is 16bit addressing. The system ram is 8k, but there are options. One option is to use the "populous' rom. That hucard has an additional 32k of ram on it for doing buffer/bitmap effects. But, no flashcard supports that AFAIK (not even the Turbo Everdrive) - but emulators do. The next option is to use the CD project. That gives you the base 8k along with either 64k, 256k (SuperCD), or 256k+2megabytes (arcade card). The third option is to use a hucard, but one that needs a CD system. I.e. when you plug a hucard into a CD system, the base 64k of original CDRAM is always there to use. Of course you could do something else, but that requires building a custom card and somehow getting an emulator to support it.

    Also, while most PCE setups have the 8k hardware bank mapped to address range $0000-1fff, you can map anything you want there (excluding the base 64k CD ram, but the 192k of SuperCD ram works fine there). Especially since using ST1/ST2 opcodes, so hardware bank needs to be mapped to reg select and write to vram (I did this with my first NES2PCE game - NES Dragon Warrior running on SuperCD). 256x200x4bpp bitmap buffer = 25,600 bytes. That easily fits into the 64k cpu logical address range.
    Too bad the Turbo Everdrive do not support the additional 32k of RAM from populous ! That would be handy and emulators support it so you can easily test it
    Anyway you always have the CD ram extension solution.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •