Using Assembly: AKA - Now we're cooking with GAS
Let me start by saying inline assembly is a joke - don't bother. If you're going to use some assembly in your project, do it right and make a separate assembly file. It's actually easier than trying to remember all those ridiculous formats inline assembly require. It's also much more powerful.
Let's start with 68000 assembly; you may use it for Genesis or 32X projects. The standard compiler in the GNU Compiler Collection is AS; it is a part of binutils. You use "standard" Motorola syntax; the main differences from more common assemblers is the directives. Let's look at a few.
This sets the current section to the code section. With the linker script I supply with my Genesis toolchain, this puts any code or data following this directive into the rom.
This sets the current section to the data section. This puts any code or data following the directive into ram... with the proper startup code. In actuality, the code and data after this directive is still in the rom, but it has all references set so that when the startup code copies this code and data into ram, it will be referenced properly. If you look at the crt0.s file in my example, you'll see the loop that copies the code and data in the data section to ram.
Let's briefly look at a few handy directives you'll commonly use.
Code:
.byte 0x00,0x01,0x02,0x03
.word 0x0001,0x0203
.long 0x00010203
These directives make tables of bytes, words, and longs, respectively. You can have one value (as in the line with .long), or multiple values (as in the other two). Values are stored in big-endian format since that is the format the 68000 uses. It does NOT handle alignment of the data, so be careful with words and longs, which must be on at least a word boundary to avoid an exception. Which brings us to our next directive.
This directive increments the current address to the next boundary specified. For the 68000 version of AS, the value of the align directive is in bytes. 2 means align to two bytes, 4 means align to 4 bytes, etc. When in doubt, use an align directive. Be liberal - it doesn't hurt and is needed before words, longs, and instructions to avoid an exception. Only byte data can be on a byte boundary. Most hacks or homebrew that work in emulators, but fail on real hardware, usually have data or code on a byte boundary somewhere because they forgot to use the align directive.
Here are two more handy directives.
Code:
.ascii "SEGA MD Example "
.asciz "SEGA MD Example "
These turn the string into a byte table of the ASCII values of the string. The difference is the asciz directive always adds a byte of 0x00 to the end of the string. That would be necessary for standard C strings, which MUST be null terminated.
Now let's look at labels. AS for the 68000 allows you to use labels exactly the same as in C. There are no special characters needed. The labels are case sensitive, so watch the capitalization! Any label you use in assembly that isn't defined in the file is assumed to be defined in another file, so there is no extern directive like you see in C. To define a label as being seen externally requires the global directive.
Code:
.global gTicks
gTicks:
.long 0
The global directive tells AS that the specified label may be seen by other files. Not using the global directive makes any other defined labels local, so there is no static directive like you see in C. Labels may refer to code or data. It's up to the programmer to use what the labels refer to properly. There is no type safety - you can easily use a C array of longs as bytes in an assembly file. You could use C code as data, or vice versa.
So now how do we use code in assembly that properly interacts with C? This is what is referred to as the Application Binary Interface, or ABI. The 68000 ABI says that when you call an assembly language function, you must save registers D2 to D7, and A2 to A6. D0, D1, A0, and A1 may be modified freely without saving or restoring them. If you can, only use those registers in your assembly for better speed. When you have to use more registers, you must push them on the stack before you do so, and pop them off the stack before returning. When the code is entered, the stack points to a long that is the return address. The long at the stack pointer + 4 is the first argument passed to the function, if any. The long at the stack pointer + 8 is the second argument passed to the function, if any. Additional arguments, if any, are accessed on the stack in a similar manner, being at + 12, + 16, + 20, etc. First, in this case, means the left-most argument, with other arguments going right accessed at higher addresses. All arguments are pushed as long values regardless of size! If you pass a byte from C to assembly, it's pushed on the stack as a long. The return value, if any, should be left in D0.
Let's look at a simple example.
Code:
.align 2
| short set_sr(short new_sr);
| set SR, return previous SR
| entry: arg = SR value
| exit: d0 = previous SR value
.global set_sr
set_sr:
moveq #0,d0
move.w sr,d0
move.l 4(sp),d1
move.w d1,sr
rts
The "|" character indicates the start of a comment. You can also use /* */ for comments. Notice how we put an align directive before the function. Do this if your code ever follows byte data to avoid exceptions. Notice that since we only use D0 and D1, we don't need to save or restore any registers. Notice that the argument (short new_sr) is accessed at the stack pointer + 4. Notice how we put the old version of the status register in D0 as the return value.
Now let's look at a more complex example.
Code:
| short get_pad(short pad);
| return buttons for selected pad
| entry: arg = pad index (0 or 1)
| exit: d0 = pad value (0 0 0 1 M X Y Z S A C B R L D U) or (0 0 0 0 0 0 0 0 S A C B R L D U)
.global get_pad
get_pad:
move.l d2,-(sp)
move.l 8(sp),d0 /* first arg is pad number */
cmpi.w #1,d0
bhi no_pad
add.w d0,d0
addi.l #0xA10003,d0 /* pad control register */
movea.l d0,a0
bsr.b get_input /* - 0 s a 0 0 d u - 1 c b r l d u */
move.w d0,d1
andi.w #0x0C00,d0
bne.b no_pad
bsr.b get_input /* - 0 s a 0 0 d u - 1 c b r l d u */
bsr.b get_input /* - 0 s a 0 0 0 0 - 1 c b m x y z */
move.w d0,d2
bsr.b get_input /* - 0 s a 1 1 1 1 - 1 c b r l d u */
andi.w #0x0F00,d0 /* 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 */
cmpi.w #0x0F00,d0
beq.b common /* six button pad */
move.w #0x010F,d2 /* three button pad */
common:
lsl.b #4,d2 /* - 0 s a 0 0 0 0 m x y z 0 0 0 0 */
lsl.w #4,d2 /* 0 0 0 0 m x y z 0 0 0 0 0 0 0 0 */
andi.w #0x303F,d1 /* 0 0 s a 0 0 0 0 0 0 c b r l d u */
move.b d1,d2 /* 0 0 0 0 m x y z 0 0 c b r l d u */
lsr.w #6,d1 /* 0 0 0 0 0 0 0 0 s a 0 0 0 0 0 0 */
or.w d1,d2 /* 0 0 0 0 m x y z s a c b r l d u */
eori.w #0x1FFF,d2 /* 0 0 0 1 M X Y Z S A C B R L D U */
move.w d2,d0
move.l (sp)+,d2
rts
| 3-button/6-button pad not found
no_pad:
move.w #0xF000,d0 /* SEGA_CTRL_NONE */
move.l (sp)+,d2
rts
| read single phase from controller
get_input:
move.b #0x00,(a0)
nop
nop
move.b (a0),d0
move.b #0x40,(a0)
lsl.w #8,d0
move.b (a0),d0
rts
Notice how we use D2, so we must save it at the start, and restore it before we return. Because we pushed D2 onto the stack, any arguments are now one long further back on the stack; that is why the argument is now accessed at the stack pointer + 8 instead of + 4. Notice that we use both kinds of comments this function. Be sure to comment your code well enough that you can figure out what you are doing when you look at your code years later. Again, the return value is left in D0.
We are not going to cover more in this post as this should be enough to get you going. If you wish to know more of the directives, especially macros, consult the binutils documentation for AS.
Now let's look at AS for the SH2. It's very nearly the same as AS for the 68000 as all platforms use most of the same directives. So we mainly need to concern ourselves with the differences. The data directives are the same, but there is a difference as far as alignment goes - words need to be on word boundaries, like the 68000, but longs need to be on long boundaries; the 68000 can take longs on word or long boundaries - the SH2 cannot. Which brings the first difference in the directives - the value used with the align directive is the power of two used to set the current address. For example, the following are the same.
68000
Code:
.align 2
.align 4
.align 16
SH2
Code:
.align 1
.align 2
.align 4
SH2 code must be word aligned, but is slightly faster if aligned to a 16 byte boundary (.align 4).
Our next difference is in the labels - the SH2 assembler needs a "_" (underscore) character in front of labels if they are defined in C code, or if you wish to access the label from C. For example.
Code:
! Cache clear line function
! On entry: r4 = ptr - should be 16 byte aligned
.align 4
.global _CacheClearLine
_CacheClearLine:
mov.l _cache_flush,r0
or r0,r4
mov #0,r0
mov.l r0,@r4
rts
nop
This function would be called from C like this.
Code:
CacheClearLine(&data_variable);
Some more differences immediately are apparent. You use "!" for a comment instead of "|"; you can still use /* */ for comments. The next difference is in how arguments are passed in; the first argument is passed in R4, the second in R5, the third in R6, and the fourth in R7. Any other arguments are pushed on the stack. Use no more than four arguments for best speed. On the SH2, you must save R8 to R14, and the PR register. For example.
Code:
! Entry: r4 = sound buffer to fill
.global _fill_buffer
_fill_buffer:
sts.l pr,@-r15 /* save return address */
mov.l r8,@-r15
mov.l r9,@-r15
mov.l r10,@-r15
mov.l r11,@-r15
mov.l r12,@-r15
mov.l r13,@-r15
mov.l r14,@-r15
mov r4,r14
! lots of code left out here as it isn't needed for the example
mov.l @r15+,r14
mov.l @r15+,r13
mov.l @r15+,r12
mov.l @r15+,r11
mov.l @r15+,r10
mov.l @r15+,r9
mov.l @r15+,r8
lds.l @r15+,pr /* restore return address */
rts
nop
A practical example of an assembly routine is my code to stretch the frame buffer contents - it takes a half a line of data and doubles it to fill an entire line.
Code:
! void ScreenStretch(int src, int width, int height, int interp);
! On entry: r4 = src pointer, r5 = width, r6 = height, r7 = interpolate
.align 4
.global _ScreenStretch
_ScreenStretch:
cmp/pl r7
bt ss_interp
! stretch screen without interpolation
0:
mov r5,r3
shll r3
mov r3,r2
shll r2
add r4,r3
add r4,r2
1:
add #-2,r3
mov.w @r3,r0
extu.w r0,r1
shll16 r0
or r1,r0
mov.l r0,@-r2
cmp/eq r3,r4
bf 1b
/* next line */
mov.w ss_pitch,r0
dt r6
bf/s 0b
add r0,r4
rts
nop
ss_interp:
! stretch screen with interpolation
0:
mov r5,r3
shll r3
mov r3,r2
shll r2
add r4,r3
add r4,r2
mov #0,r7
1:
add #-2,r3
mov.w @r3,r0
mov.w ss_mask,r1
and r0,r1 /* masked curr pixel */
shll16 r0
add r1,r7 /* add to masked prev pixel */
shlr r7 /* blended pixel */
or r7,r0 /* curr pixel << 16 | blended pixel */
mov r1,r7 /* masked prev pixel = masked curr pixel */
mov.l r0,@-r2
cmp/eq r3,r4
bf 1b
/* next line */
mov.w ss_pitch,r0
dt r6
bf/s 0b
add r0,r4
rts
nop
ss_mask:
.word 0x7BDE
ss_pitch:
.word 640
As far as the sections go, with my 32X toolchain, the text section still goes to the rom, while data goes to the SDRAM.
Okay, that's enough for now. I encourage people to look at the assembly files in the Tic Tac Toe examples, as well as the Yeti3D example.