Quantcast

Results 1 to 12 of 12

Thread: 68000 Assembly Optimization - Useful Links, Practical Examples and Discussion

  1. #1
    Hero of Algol
    Join Date
    Aug 2010
    Posts
    7,611
    Rep Power
    168

    Default 68000 Assembly Optimization - Useful Links, Practical Examples and Discussion

    Last edited by Barone; 09-23-2015 at 12:48 AM.

  2. #2
    Hero of Algol
    Join Date
    Aug 2010
    Posts
    7,611
    Rep Power
    168

    Default Corporation (MD) (1992) (FPS)

    Sample #01





    Sample #02





    Sample #03





    Sample #04





    Sample #05

    Last edited by Barone; 09-23-2015 at 01:13 AM.

  3. #3
    Mastering your Systems Shining Hero TmEE's Avatar
    Join Date
    Oct 2007
    Location
    Estonia, Rapla City
    Age
    29
    Posts
    10,081
    Rep Power
    108

    Default

    Don't forget to take advantage of the autoincrement on address registers, that stuff allows to get some insane optimisations going on when you're willing te rearrange data. Decrement is less useful as it isn't free but it has its uses.
    In your second optimisation you could get rid of the SUBQ and have a MOVE.W D0, -(A6). Should win you 2 cycles.
    Death To MP3, :3
    Mida sa loed ? Nagunii aru ei saa "Gnirts test is a shit" New and growing website of total jawusumness !
    If any of my images in my posts no longer work you can find them in "FileDen Dump" on my site ^

  4. #4
    Hero of Algol
    Join Date
    Aug 2010
    Posts
    7,611
    Rep Power
    168

    Default

    Quote Originally Posted by TmEE View Post
    Don't forget to take advantage of the autoincrement on address registers, that stuff allows to get some insane optimisations going on when you're willing the rearrange data. Decrement is less useful as it isn't free but it has its uses.
    Yep, it's very useful. I've used it in the samples above as much as I could figure out.

    Quote Originally Posted by TmEE View Post
    In your second optimisation you could get rid of the SUBQ and have a MOVE.W D0, -(A6). Should win you 2 cycles.
    But the SUBQ is decrementing 4 bytes, using the pre-decrement on a move.w will give me only 2 bytes of decrement, no?
    I think I would need to use MOVE.l instead for it to be applicable.

  5. #5
    Road Rasher l_oliveira's Avatar
    Join Date
    Jul 2010
    Posts
    258
    Rep Power
    12

    Default

    On the given examples what was the goal? Pack the code for memory efficiency or gain CPU cycles?
    Depending on what you want to do (between these two characteristics) it completely changes the approach.

    On modern compilers people no longer care about the size of the compiled code. They only care about cache hits, branch prediction and similar stuff which obviously don't apply to this architecture.

    I suggest you comment the examples you used so the goals for each specific change are laid bare for understanding. Some people (like me for example) aren't yet on the cycle count phase of ASM programming learning.

    Thanks for bringing this topic up for discussion, too.
    I'd like to share some wisdom:
    Quote Originally Posted by TmEE
    I'm too lazy to be unstoppable.
    you can have shame, as long as you don't feel it

  6. #6
    Hero of Algol
    Join Date
    Aug 2010
    Posts
    7,611
    Rep Power
    168

    Default

    Quote Originally Posted by l_oliveira View Post
    On the given examples what was the goal? Pack the code for memory efficiency or gain CPU cycles?
    Depending on what you want to do (between these two characteristics) it completely changes the approach.
    The main goal was to save CPU cycles.
    But in several cases when you're hacking stuff and trying to avoid the overhead of injecting additional jumps, like in Corporation, you first have to tighten up your code in order to open up spacce for other optimizations.

    Quote Originally Posted by l_oliveira View Post
    I suggest you comment the examples you used so the goals for each specific change are laid bare for understanding. Some people (like me for example) aren't yet on the cycle count phase of ASM programming learning.
    Thanks for bringing this topic up for discussion, too.
    My intention was to provide those samples so people could have ideas for optimizations which the available Internet references don't cover.
    I'd like to comment each step but it would slowdown, a lot, the development of the performance improvement for this game and I don't want to see it happening. I have very limited time so I try to give focus to what will benefit more people and what makes me more motivated to work on.

  7. #7
    Road Rasher
    Join Date
    Apr 2013
    Location
    SF Bay Area, California
    Posts
    301
    Rep Power
    20

    Default

    Quote Originally Posted by Barone View Post
    But the SUBQ is decrementing 4 bytes, using the pre-decrement on a move.w will give me only 2 bytes of decrement, no?
    I think I would need to use MOVE.l instead for it to be applicable.
    Pre-decrement will only decrement by 2 in this case, but the address $C00002 is functionally equivalent to $C00000 (unless you're doing a long-word write to it).

    Also, you can optimize the first example further by replacing
    Code:
        lea $FFFF0016, a0
        lea $FFFF0017, a1
        lea $FFFF001E, a2
        lea $FFFF0002, a3
        lea $FFFF0003, a4
        moveq #4, d0
        move.w (a6), d0
        btst.l #1, d0
    with
    Code:
        lea $FFFF0002, a3
        lea (1, a3), a4
        lea ($14, a3), a0
        lea ($15, a3), a1
        lea ($1C, a3), a2
        moveq #4, d0
        btst.b #1, (1, a6)
    That shaves off 18 cycles.

    General rules of thumb:

    - Access to memory is expensive so try to keep things in registers and prefer smaller (in number of words) instructions
    - The memory addressing modes are pretty powerful and have generally good performance, learn when to use them
    - Simpler modes are still generally faster though so if you're going to use a complicated effective address multiple times it's often better to calculate it once with lea
    - The M68000 is mostly 16-bit internally so it's optimal to stick to 16-bit values when you can
    - However, doing one 32-bit operation is still general preferable to two 16-bit ones so if you actually need 32-bits for an operation or you can manage to stuff two 16-bit operations into one 32-bit one
    - shifts and rotates are kind of expensive (6 cycles + 2 * number of bits shifted), but swap is cheap
    - multiplication and division are super expensive, if multiplying/dividing by a constant use a combination of shifts and add/sub
    - Stick commonly used globals in the high-32K of work RAM so you can use the absolute short addressing mode on them, stick the stack somewhere in the low 32K

  8. #8
    Hero of Algol
    Join Date
    Aug 2010
    Posts
    7,611
    Rep Power
    168

    Default

    Wow, amazing stuff, man! Thanks a lot!
    I'll fix and improve the code based on that.

    By the instructions timing table I linked, I had understood (maybe I've misunderstood it?) that lea $FFFF0016, a0 was just 4 cycles. I found it weird since its a a lengthy thing and with direct memory but I was trying to believe on those tables.
    How many cycles it takes in reality? Do you have a more reliable/complete instructions timing table to share with us?



    Quote Originally Posted by Mask of Destiny View Post
    - Stick commonly used globals in the high-32K of work RAM so you can use the absolute short addressing mode on them, stick the stack somewhere in the low 32K
    Could you share an example, please?

  9. #9
    Road Rasher
    Join Date
    Apr 2013
    Location
    SF Bay Area, California
    Posts
    301
    Rep Power
    20

    Default

    Quote Originally Posted by Barone View Post
    By the instructions timing table I linked, I had understood (maybe I've misunderstood it?) that lea $FFFF0016, a0 was just 4 cycles. I found it weird since its a a lengthy thing and with direct memory but I was trying to believe on those tables.
    How many cycles it takes in reality?
    It takes 12 cycles.

    Quote Originally Posted by Barone View Post
    Do you have a more reliable/complete instructions timing table to share with us?
    I generally just use the Motorola 68000 User Manual, but the tables you linked appear to have the correct values. I think you might be looking at the wrong entry. lea (An) is indeed 4 cycles, but An here means an address register (address register indirect). The variant you were using is xxx.L/(xxx).L (absolute long indirect) which is listed as 12 cycles.

    Quote Originally Posted by Barone View Post
    Could you share an example, please?
    This is probably more useful when writing code from scratch rather than trying to optimize an existing game. Generally I'll set the initial stack pointer to $FFFF8000 (growing down of course) and start putting variables at that same address (growing up). In vasm, you can allocate globals like so:

    Code:
        rsset $FFFF8000
    some_byte_var  rs.b 1
    other_byte        rs.b 1
    some_word       rs.w 1
    and then using them with the absolute short mode would look like this:
    Code:
        move.w (some_word).w, d0
        move.b (some_byte_var).w, d1
    Some assemblers will optimize a long absolute address reference into a short one automatically where possible so the .w is not necessarily required. Anyway the short form is one word shorter and as a result 4 cycles faster. The short form works for the top and bottom 32K of the 32-bit address space (so $0-$7FFF and $FFFFF8000 to $FFFFFFFF) as it's just sign extending the 16-bit operand to a 32-bit one.

  10. #10
    Hero of Algol
    Join Date
    Aug 2010
    Posts
    7,611
    Rep Power
    168

    Default

    I knew what (An) and all the nomenclature meant, but I was really reading the LEA's table completely wrong for some reason, (facepalm).

    Anyway, that's one of the reasons why I like to share anything that I do, because there's always people out there who know it better and will point your mistakes or misconceptions. Awesome, really, thanks a lot.
    And this is also good news to me because it means I can further improve the code in terms of reducing the cycles it takes and expect more performance improvement.



    Yeah, trying to modify something which is already written and probably designed in a non-optimal way is frustrating at times but I actually prefer to hack stuff than do it from scratch. Both because then I can try to improve/fix the games I've always loved and also because I have developed a lot of stuff from scratch in my professional and hacking is something still "new" to me and which I really enjoy doing.
    Last edited by Barone; 09-23-2015 at 06:14 PM.

  11. #11
    Mastering your Systems Shining Hero TmEE's Avatar
    Join Date
    Oct 2007
    Location
    Estonia, Rapla City
    Age
    29
    Posts
    10,081
    Rep Power
    108

    Default

    Quote Originally Posted by Barone View Post
    But the SUBQ is decrementing 4 bytes, using the pre-decrement on a move.w will give me only 2 bytes of decrement, no?
    I think I would need to use MOVE.l instead for it to be applicable.
    The point is to get to VDP Data port, and C00002 is that also. Praise mirrors :P
    C00000 and C00002 are Data port, C00004 and C00006 are Control port.
    EDIT: MoD already covered this.
    Last edited by TmEE; 09-23-2015 at 07:45 PM.
    Death To MP3, :3
    Mida sa loed ? Nagunii aru ei saa "Gnirts test is a shit" New and growing website of total jawusumness !
    If any of my images in my posts no longer work you can find them in "FileDen Dump" on my site ^

  12. #12
    Hero of Algol
    Join Date
    Aug 2010
    Posts
    7,611
    Rep Power
    168

    Default

    This is awesome, thanks for the info!

    I'll fix all the stuff you guys have highlighted and I'll update the images after that.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •