http://users.ece.utexas.edu/~valvano/assmbly/stack.htm

Stack Alignment

“Stack alignment” just means the address of the stack (SP or ESP) is a multiple of the machine word size (so always divisible by 8 for 64-bit mode, 4 for 32-bit, 2 for 16-bit). Generally this means you assume your BIOS / bootloader / OS / runtime initializes it that way, and you only change ESP by multiples of the word size. Writing 64-bit software and want to push/pop a single byte on the stack? To keep the stack aligned, you should change ESP by 8 (the number of bytes in 64 bits). You don’t have to do anything with the extra 7 bytes, but you need to allocate them to maintain the invariant that ESP is always a multiple of 8.

Why is this a thing? It’s because CPU / cache / memory generally talk to each other only about aligned addresses (this allows less address pins and hashmap pressure issues because low bits are always zero). But it complicates implementation of what the hardware should do when software asks for an un-aligned read/write (you’ll need to split it into two when it crosses the alignment boundary). Some CPU’s handle this by making unaligned memory access an error.

What about x86? Unaligned memory access is allowed on x86, but it’s slower. Which means it’s better for software to avoid it where possible [1]. So low-level libraries, OS’s, compilers, etc. are all written to keep ESP aligned. And it’s also advised for any hand-written assembly code.

[1] Look up “structure padding” for your favorite C compiler and you’ll find out about how it puts empty bytes in struct memory layout so individual fields are aligned; you can often configure or disable this behavior. Here is a discussion of this by Eric S. Raymond. He explains how to re-order your structure fields to minimize the memory wasted on this padding

  • Um sacrifício de memória para ganhar velocidade de acesso.

Para alinhar uma stack com “stack boundary 16 bytes”, é preciso que a quantidade de bytes (em hexadecimal) alocados seja múltiplo de 16 (o resultado de %ESP modulo 16 tem que ser 8, isto é, o endereço do topo da stack deve ser múltiplo de 16). Exemplo:

  • De acordo com a System V AMD64 ABI, a stack deve sempre ter um alinhamento de 16-byte
N modulo 16 = 8 <--- ALINHADO
 
-------------------------------------------------
 
44 modulo 16 = 12 <--- DESALINHADO
+4 Bytes
48 modulo 16 = 0 <--- DESALINHADO
+8 Bytes
56 modulo 16 = 8 <--- ALINHADO
 

Código exemplo:

void do_math(void)
{
    int x = 10;
    int y = 44;
    int z = 36;
    int w = 109;
    int a[4] = { 1, 2, 3, 4 };
    a[0] = x * a[0];
    a[1] = y * x;
    a[2] = a[1] * z;
    a[3] = w * a[2];
    printf("%d\n", a[3]);
}

No exemplo, temos um total de 8 variáveis do tipo ‘int’ e 4 caracteres utilizados pelo ‘printf()‘.

Precisamos alocar um total de 32 + 4 = 36 bytes.

0x36 NÃO É MÚLTIPLO DE 16, ISSO NÃO É BOM. ENTÃO:

  1. Adicionando mais 8, temos 44 (módulo 16 ≠ 8)
  2. Adicionando mais 4, temos 52 (módulo 16 = 8), ALINHADO!

Então precisamos alocar 16 bytes (sem utilidade) para conseguir o alinhamento.

mov qword ptr [rsp+24], r8
mov qword ptr [rsp+16], rdx
mov dword ptr [rsp+8], ecx
sub rsp, 40
call do_math
xor eax, eax
add rsp, 40
ret 0
push rbp
mov rbp, rsp
sub rsp, 60 ; <-- Alocando para alinhamento
mov rax, qword ptr ss:[rbp+30]
mov qword ptr ss:[rbp-40], rax
mov qword ptr ss:[rbp+18], r9
mov qword ptr ss:[rbp+28], r8
mov qword ptr ss:[rbp+10], rdx
mov qword ptr ss:[rbp+20], rcx

Mas pq precisa do alinhamento? Basicamente para não perder tempo lendo bytes desnecessários. No caso, queremos ler W, porém temos a stack desalinhada, então precisaríamos ler ‘0000’ e ‘0001’ depois ‘0002’ e ‘0003’, e ficar apenas com ‘0001’ e ‘0002’.

+----+
|0000| B
|0001| W
+----+
|0002| W
|0003|
+----+

É claro que isso consome tempo, então, para ganhar velocidade, gastamos espaço para ganhar alinhamento:

+----+
|0000| B
|0001| -
+----+
|0002| W
|0003| W
+----+

Veja que temos um espaço sem utilidade, mas ‘W’ agr se encontra alinhado e não é mais preciso perder tempo lendo bytes sem necessidade.

STACK ARGUMENTS (Caller & Callee)

  • %ESP - Extended Stack Pointer register which purpose is to let you know where on the stack you are. As the stack grows downward in memory (from higher address values to lower address values) the %ESP register points to the lowest memory address.

  • %EBP - Extended Base Stack Pointer pointing to the base address of stack. Local variables are accessed by subtracting offsets from %EBP and function parameters are accessed by adding offsets to it.

Note: Positive offsets from RBP access arguments passed on the stack. Negative offsets from RBP access local variables.

Callee:

Push the value of EBP onto the stack, and then copy the value of ESP into EBP using the following instructions:

push ebp
mov  ebp, esp

“This initial action maintains the base pointer, EBP. The base pointer is used by convention as a point of reference for finding parameters and local variables on the stack. When a subroutine is executing, the base pointer holds a copy of the stack pointer value from when the subroutine started executing. Parameters and local variables will always be located at known, constant offsets away from the base pointer value. We push the old base pointer value at the beginning of the subroutine so that we can later restore the appropriate base pointer value for the caller when the subroutine returns. Remember, the caller is not expecting the subroutine to change the value of the base pointer. We then move the stack pointer into EBP to obtain our point of reference for accessing parameters and local variables”.

“Remember that offsets from rbp are into our stack frame and offsets from rsp are placing arguments on the stack for the function call”

https://www.cs.virginia.edu/~evans/cs216/guides/x86.html

Caller clean-up (cdecl)

Consider the following C source code snippet:

int callee(int, int, int);
 
int caller(void)
{
	return callee(1, 2, 3) + 5;
}

On x86, it might produce the following assembly code (Intel syntax):

caller:
    ; make new call frame
    ; (some compilers may produce an 'enter' instruction instead)
    push    ebp       ; save old call frame
    mov     ebp, esp  ; initialize new call frame
    ; push call arguments, in reverse
    ; (some compilers may subtract the required space from the stack pointer,
    ; then write each argument directly, see below.
    ; The 'enter' instruction can also do something similar)
    ; sub esp, 12      : 'enter' instruction could do this for us
    ; mov [ebp-4], 3   : or mov [esp+8], 3
    ; mov [ebp-8], 2   : or mov [esp+4], 2
    ; mov [ebp-12], 1  : or mov [esp], 1
    push    3
    push    2
    push    1
    call    callee    ; call subroutine 'callee'
    add     esp, 12   ; remove call arguments from frame
    add     eax, 5    ; modify subroutine result
                      ; (eax is the return value of our callee,
                      ; so we don't have to move it into a local variable)
    ; restore old call frame
    ; (some compilers may produce a 'leave' instruction instead)
    mov     esp, ebp  ; most calling conventions dictate ebp be callee-saved,
                      ; i.e. it's preserved after calling the callee.
                      ; it therefore still points to the start of our stack frame.
                      ; we do need to make sure
                      ; callee doesn't modify (or restores) ebp, though,
                      ; so we need to make sure
                      ; it uses a calling convention which does this
    pop     ebp       ; restore old call frame
    ret               ; return

🌱 Back to Garden