Wow , that's a lot of information

Stef .
Well, for the rotation and projection , from what you say I think your code is faster , but in my case I can tell you in terms of cycle count is close to 1000 cycles per vertice ( depends on multiplications ) .
The polygon filling code , I can tell you also in terms of cycles is around 722 cycles per line in the worst case (large polygons) , and about 266 cycles in the best case (small polygons) .
I've done demos with 256 * 128 and 256 * 160 , in fact the demo starfox I did use 256 * 160 as in the demo you did , but I use the resolution of 320 * 224 with a "window" of 256 * 160 .
To transfer to the VRAM I either use DMA because the bitmap buffer is not organized to be able to use the DMA , but I use a little trick to transfer the bitmap buffer as fast as possible to the VRAM , this is what I do: first I separate into small blocks of 1k bitmap buffer ( for 256 * 128 there are 16 blocks and 256 * 160 there are 20 blocks) , so when I go to move to the VRAM each of these blocks , I first read the Vcount to see if we are in the space of " blanking " ( 256 * 160 would have 102 lines of " blanking " ) , and if there is enough time to transfer one of these blocks of 1k ( about 11 lines would be needed ) , then I disable the video, to transfer to the maximum speed this block and then I enable the video again, but if there is not enough time , I did not disable the video and this block will be transferred slower.
To clear the buffer , I use MOVEM instruction , it takes about 74 lines in 256 * 128 mode and about 92 lines in 256 * 160 mode .
I 'm sure I can improve the code for filling polygons , and the code for the rotation and projection .
I'll create a new thread in a few hours , to show some of the demos I have done, the first will be the demo of Starfox .
The first version I did of this Starfox was in October 2011 and the last modification I did was in June 2013 , but this demo is fully done in 3d , ie can move to any direction.