Atariscne.org

← A brain dump of what I have learned working with the GCC m68k backend

What a Compiler Must Get Right (That You Don't)

When you write assembly, you know the context. You know which registers hold what, whether a pointer is aligned, whether the loop count fits in a word. You know because you put it there.

Consider a demo screen where you reserve a6 for the background rasters. You update the pointer in the VBL interrupt and just advance with a minimal (a6)+ in the HBL. All your other code simply does not touch a6, because you wrote all of it. Need to update a screen pointer? Just write the global. No function call, no overhead, no uncertainty.

A compiler has no such luxury. It must be correct for every possible input the language allows. It cannot "just know" that a pointer is word-aligned, or that two buffers never overlap, or that a register is free. It must prove it, or assume the worst. And not all code it calls may even have been compiled by it. Maybe it was built with an older compiler, a different language, or maybe you wrote it in assembly yourself.

Word Size and -mshort

As an Atari ST assembly programmer, word size is the natural default. Reading and writing longs costs 4 extra bus cycles. Almost every integer you need fits comfortably in ±32,767. Just as you naturally type int a = b in C, you almost instinctively reach for move.w var_b,d1 in assembly.

But GCC defaults to 32-bit integers. So a + b becomes add.l d0,d1, costing 4 extra cycles. Every integer on the stack costs 4 bytes instead of 2. And every integer multiply on 68000 becomes a library call to __mulsi3: three multiply operations plus overhead, costing 270–360 cycles where a single muls.w would cost 40–72. Who remembers to type short instead of int every time? The C standard library certainly does not.

The Atari compilers of Old, PureC and HiSoft C, used 16-bit int. The C standard allows it (shortintlong), and it is far more natural for our CPU. GCC's -mshort flag restores this by enabling short integers, and both mintlib and libcmini is available in short integer-ready versions from Thorsten Otto's download page. Even on the Falcon030, Atari in their infinite wisdom kept the ST memory data bus 16 bits wide, so short integers wins there too.

long is still 32-bit with -mshort, and so are size_t and ptrdiff_t. Use them when you need the range. ±32K covers screen coordinates in all ST and Falcon resolutions, but the byte offset to the last pixel of a 320×200×4-bitplane screen is 32,000, right at the edge. A bit more, just open that lower border, and long is required to offset those last rows.

When you do mix widths, the CPU must sometimes extend. Signed extension is cheap: ext.w and ext.l take 4 cycles. But there is no zero-extend instruction, so unsigned values need and.l #$ffff,d0 to clear those upper bits, costing 16 cycles. This favors signed types when the range allows it, and staying in one width to avoid extension altogether.

Signed Division: Two Paths Where You Expected One

In assembly, dividing by 16 is natural: asr.l #4,d0, done. You know the value is a screen coordinate, always positive, so there is nothing more to think about. And when writing C, you know the compiler will optimize divide by 16 into a shift for you, so you write x / 16 for readability. Right?

Not quite. The C standard says signed integer division truncates toward zero: 13 / 16 and -13 / 16 must both give zero. But an arithmetic right shift truncates toward negative infinity: -13 >> 4 gives -1, not zero. They are not interchangeable for negative numbers.

So divide by 16 compiles as both shifts, one for positive, one for negative, with a branch to choose. Here is what stock GCC-15.2 produces in put_pixel:

    tst.l %d0                  | is x negative?
    jlt .L9                    | yes — take the fixup path
    asr.l #4,%d0               | no — simple shift, the common case
    ...
.L9:
    moveq #15,%d4              | bias for rounding toward zero
    add.l %d4,%d0
    asr.l #4,%d0               | shift with bias applied

That is an extra branch, an extra add, and an entire duplicate code path — all to handle the case of negative screen coordinates that will never happen.

The fix is simple: make a habit of either using unsigned types when you can (even casting just for the computation works), or use >> directly, as you would in asm. For my game in the making replacing divides with explicit shifts reduced the binary size by almost 4k alone.

A third option: __builtin_assume(x >= 0) promises the compiler that x is never negative, so it emits the simple shift for for the divide without changing the expression itself. If the promise is false, the compiler may generate invalid code.

I'm in the habit of writing my own assert macros, that crash when run on macOS host, and emits __builtin_assume on for Atari target. Correctness enforced on host, and performance ensured on target.

While on compiler hints: __builtin_expect(condition, 1) tells the compiler which way a branch will likely go, so it can lay out the common path as fall-through. On 68000 the difference is modest (12 vs 8 cycles), but on a 68060 a correctly predicted branch is essentially free while a misprediction costs more than three multiply instructions. Since C++20 the prettier [[likely]] and [[unlikely]] attributes work too, though __builtin_expect has the benefit of also being available in plain C. If you find either syntax ugly, hide them behind a #define. I do :).

Side Effects and Pointer Aliasing

A C compiler must assume that any function call can modify any global or pointed-to memory. You can help with __attribute__((pure)) annotations (reads memory but does not modify it, think strcmp) and the even stronger __attribute__((const)) (result depends only on arguments, no memory reads at all, think sqrt). I wish the names had been the other way around, but here we are.

Even without function calls, writes through one pointer can invalidate reads through another unless the compiler can prove they point to different memory. Consider a matrix-vector multiply:

void matrix_mul(short *a, short *b, short *r, unsigned short n) {
    for (unsigned short i = 0; i < n; i++) {
        r[i] = 0;
        for (unsigned short j = 0; j < n; j++) {
            r[i] += a[i * n + j] * b[j];
        }
    }
}

You would keep the sum in a register and write it once after the loop. But a, b, and r are all short* pointers, and the compiler cannot prove they do not overlap. If r[i] and a[i * n + j] happen to be the same location, each write to r[i] could change what the next read of a[...] returns. So the accumulator must be written back every iteration:

Here is the inner loop that GCC generates (compiled with -O2 -mshort -mfastcall):

.L3:
    moveq #0,%d1
    move.w %d2,%d1
    add.l %d1,%d1
    move.w (%a2,%d1.l),%d1    | load a[i*n+j]
    muls.w (%a0)+,%d1          | multiply by b[j]
    add.w %d1,%d0              | accumulate sum
    move.w %d0,-2(%a1)         | write sum back to r[i] EVERY iteration
    addq.w #1,%d2
    cmp.w %d3,%d2
    jne .L3

That move.w %d0,-2(%a1) inside the loop: 12 wasted cycles per iteration. The fix is __restrict (or restrict in C99), a promise that this pointer never alias any other pointer you have access to:

void matrix_mul(short *__restrict a, short *__restrict b,
                short *__restrict r, unsigned short n) {

And the inner loop turns to this:

.L12:
    moveq #0,%d1
    move.w %d2,%d1
    add.l %d1,%d1
    move.w (%a2,%d1.l),%d1    | load a[i*n+j]
    muls.w (%a0)+,%d1          | multiply by b[j]
    add.w %d1,%d0              | accumulate in register
    addq.w #1,%d2
    cmp.w %a1,%d2
    jne .L12
    move.w %d0,(%a3)+          | write r[i] ONCE after inner loop

The write happens once after the loop instead of N times inside it. This is a practical takeaway regardless of which compiler you use: if two pointers of the same type never overlap, mark them as restricted. The compiler cannot know unless you tell it. But be sure, just as with assume hints, if you have it wrong the compiler will generate incorrect code.

But As Assembly Programmers, We Make Our Own Rules

In assembly, conventions are whatever we decide. Your polygon function takes a list of vertex pointers in a0 or maybe a global pointer. A zero value terminates the list, or the vertex count lives in d7 because it happened to be free? An SNDH music player expects the subsong number in d0 for init, as defined by the format. What registers are clobbered? Document it, and callers will save what they need.

Except — not everything is documented. The SNDH format defines where the subsong number is, but not which registers the init, replay, or teardown routines clobbers. For a TTraK-exported song, it may clobber none of them. For an old Mad Max tune, it might be all. If you replay from a timer interrupt, you had better save every register, just in case. The freedom of assembly becomes a liability: without an enforced contract, the only safe option is the most conservative one.

With great powers comes great responsibility.

ABIs: The Contract a Compiler Cannot Break

This is exactly why you need an ABI: a strict set of rules that every compiled function must follow. It must work across compilation units, across libraries, and even across compiler versions and programming languages.

If you have ever pushed arguments onto the stack before a TOS trap, you already understand the principle:

    clr.l -(%sp)              | push NULL (4 bytes, a long)
    move.w #32,-(%sp)          | push Super function number (2 bytes, a word)
    trap #1                    | call GEMDOS
    addq.l #6,%sp              | clean up the stack

This is equivalent to calling long Super(short func, long param). The caller pushes arguments right to left onto the stack, the callee reads them, and the caller cleans up. It is slow (every argument costs two memory accesses: one push, one read), but straightforward.

The Default: Motorola Unix ABI

By default, GCC uses the Motorola Unix ABI, which works similarly, but at least does not require a trap.

Every argument passes through memory, pushed form right to left onto the stack. For put_pixel with its four arguments, that means four pushes by the caller and four stack reads by the callee, as you can see in the Part 1 listing, where every argument is loaded from a stack offset.

-mfastcall: Register-Passing ABI

Thorsten Otto's GCC fork adds fastcall ABI using -mfastcall, a register-passing ABI similar to what PureC and PurePascal used. HiSoft C let you configure register passing to some extent, but the principle is the same. We do it informally in assembly because it is easier to type and easier to keep in your head. Fastcall ABI just formalizes it. Arguments that fit in registers go in registers; the rest spills to the stack.

  Motorola Unix (default) Fastcall
Integer args stack d0-2, then stack
Pointer args stack a0-1, then stack
Return value d0 and a0 d0 or a0
Clobbered d0-1/a0-1 d0-2/a0-1
Preserved d2-7/a2-6 d3-7/a2-6

Structs, unions, and floats that fit in 32 bits are passed in a register and returned in d0. Types larger than 32 bits are passed on the stack. For return values larger than 32 bits, the caller passes a hidden pointer in a1 to where the result should be stored.

This hidden pointer has an important consequence for C++. A non-static member function has an implicit this pointer, passed in a0. If that function also returns a struct larger than 32 bits, the hidden return pointer goes in a1. That is both address argument registers consumed before you even get to your actual parameters — everything else goes on the stack. Register pressure is a real constraint, and one we will explore in depth in a future post when we get to the register allocator.

-flto: Link-Time Optimization

The recurring qualifier throughout this post has been "unless the compiler can prove." It cannot assume two pointers do not alias, that a loop count fits in a word, or that a function has no side effects. Proof requires visibility, and when compiling render.c, the compiler has no idea what update_palette() in palette.c actually does. Three instructions? Clobbers a global? It must assume the worst.

-flto changes this. Link-Time Optimization compiles every source file into an intermediate representation instead of machine code. At link time, the compiler sees the entire program as one unit. Now it can prove that update_palette() is not only not writing to the screen buffer, is also trivial; so inline it. That two pointers always point to different arrays; so reorder freely.

For C++ with many small methods, constructors, accessors, iterators, LTO is transformative. The cost is compile time: the final link reprocesses everything, roughly doubling a full build. A practical workflow: develop with -O2, ship with -O2 -flto.

How Much Does It Help?

For a library of many small functions like libcmini, the prologue and epilogue dominate and the win for fastcall ABI is large. For a big C++ project with aggressive inlining and link-time optimization, the function body dominates and the win is smaller. Across the projects I have worked with (Chroma Grid, libcmini, CoreMark, etc.) the improvement is in the 5–10%. No guarantees, but an easy win. And as a bonus, it makes writing assembly callable from C much easier: your arguments are right there in registers as I like them.

For my game in progress the difference is 87,003 bytes without and 72,766 bytes with -flto executable, and double digit performance improvement to the per frame update.

Putting It All Together

Let us revisit put_pixel. Here is the Part 1 version again, stock GCC-15.2, -O2, 32-bit integers, default stack ABI, and divide by 16:

put_pixel:                             | stock GCC-15.2, -O2
    movem.l %d2-%d4,-(%sp)
    move.l 20(%sp),%d0                 | x — from stack
    move.l 24(%sp),%a0                 | y — in an address register?
    moveq #15,%d1
    and.l %d0,%d1
    move.l #32768,%d2
    lsr.l %d1,%d2                      | mask
    moveq #0,%d3
    move.b 31(%sp),%d3                 | col — from stack
    move.l %a0,%d1                     | y * 80 done as y*5*16
    add.l %a0,%d1
    add.l %d1,%d1
    add.l %a0,%d1
    add.l %d1,%d1
    add.l %d1,%d1
    tst.l %d0                          | is x negative?
    jlt .L9                            | signed division fixup path
    asr.l #4,%d0
    ...                                | address calculation, ~mask setup
    move.w %d1,%a1                     | ~mask stashed in address register
.L5:
    move.w (%a0)+,%d1
    btst %d0,%d3
    jeq .L3
    or.w %d2,%d1
    move.w %d1,-2(%a0)                 | write back with negative offset
    addq.l #1,%d0
    moveq #4,%d1
    cmp.l %d0,%d1
    jne .L5
    movem.l (%sp)+,%d2-%d4
    rts
.L3:                                   | same as above but with ~mask
    ...
.L9:                                   | entire duplicate path for x < 0
    ...

Now here is the same function, with one code change: x / 16x >> 4, and compiled with added -mshort -mfastcall options:

put_pixel:                             | stock GCC-15.2, -O2 -mshort -mfastcall
    move.l %d4,-(%sp)
    move.l %d3,-(%sp)
    move.w %d0,%d3                     | x already in d0
    and.w #15,%d3
    move.w #-32768,%d4
    lsr.w %d3,%d4                      | mask — word shift, not long
    clr.w %d3
    move.b %d2,%d3                     | col already in d2
    muls.w #20,%d1                     | y * 20 — one instruction
    asr.w #4,%d0                       | x >> 4 — no sign fixup needed
    ...                                | address calculation, ~mask setup
    add.l %d1,%a0                      | screen already in a0
.L4:
    move.w (%a0)+,%d1
    move.w %d3,%d2
    asr.w %d0,%d2
    btst #0,%d2
    jeq .L2
    or.w %d4,%d1
    move.w %d1,-2(%a0)
    addq.w #1,%d0
    cmp.w #4,%d0
    jne .L4
    move.l (%sp)+,%d3
    move.l (%sp)+,%d4
    rts
.L2:                                   | same as above but with ~mask
    ...

From 39 instructions down to 28. One less callee saved reg, no stack loads, no signed division fixup, word operations throughout. On the 68000, the fastest path drops from 584 to 532 cycles.

Not bad for zero backend changes — just compiler flags and one source-level tweak. Almost 30% less code, 10% faster, and it can get better...


Takeaway: No matter how good a compiler is, it cannot rewrite your bubble sort to use a better-suited algorithm. Step one is always to choose the right algorithm for the task. Step two is to provide as much information as possible for the compiler to work with.

  • Compile with -mshort -mfastcall -flto
  • Use explicit shifts instead of divides for power-of-two divides, or __builtin_assume to promise non-negative values
  • __builtin_unreachable for unreachable code
  • Prefer signed types over unsigned to avoid costly zero-extension
  • Force small functions to be inlined with [[gnu::always_inline]]
  • Mark non-aliasing pointers __restrict / restrict
  • Mark pure functions [[pure]] or [[const]]
  • Mark cold branches [[unlikely]] or use __builtin_expect

Now that we know the rules GCC has to abide by, in the next post we shall explore how GCC actually transforms your code, from C text through ~300 transformation passes to the assembly that lands int your executable.

Comments

Troed
Monday, 23 March 2026 21:40
These posts are pure gold. Thanks for letting us read!
Quote
mikro
Tuesday, 24 March 2026 00:31
Just a small correction (unless I overlooked something):

> and both mintlib and libcmini are available in short integer-ready versions from Thorsten Otto's download page.

His website explicitly says:

Quote:
By default, sizeof(int) == 4, but if you compile with -mshort then sizeof(int) == 2 (unsupported by the current MiNTLib).
libcmini does support -mshort, though.
Quote
PeyloW
Tuesday, 24 March 2026 02:14
Quoting mikro:
Just a small correction (unless I overlooked something):

> and both mintlib and libcmini are available in short integer-ready versions from Thorsten Otto's download page.

His website explicitly says:

Quote:
By default, sizeof(int) == 4, but if you compile with -mshort then sizeof(int) == 2 (unsupported by the current MiNTLib).


libcmini does support -mshort, though.
Curses. Yes I believe you are right.
mintlib compiles with no warnings with -mshort for me, but when I try using it instead of libcmini my test executable does not launch.
Shame, I guess know what my next project can be ;).
Quote
42Bastian
Tuesday, 24 March 2026 04:28
Great post. Any C/C++ should at least be aware what is going on under the hood!
Quote
42Bastian
Tuesday, 24 March 2026 04:29
C/C++ progammer I meant ;-)
Quote
zerkman
Wednesday, 25 March 2026 14:27
After using GCC for decades (not on atari though), you may sort of be able to make an educated guess on what's going on behind the scenes.
But this is nothing like opening the hood and actually seeing it working.
Thanks for those posts PeyloW, really looking forward to reading the upcoming ones ;)
Quote

Add comment

Submit

© 2025-2026 The Atariscne Collective | info@atariscne.org