Most part of the “raspvisor” is based on s-matyukevich/raspberry-pi-os series that is basically a “how to develop a simple OS to Raspberry Pi from scratch” and provides a deeper and more detailed understanding of the OS concepts and implementations of this Hypervisor.
From “raspvisor” we got a file named as boot.S:
The .section and .globl labels are notes to the linker. The first is about where this code belongs in the compiled binary. In a little bit, we are going to specify where that is. The second specifies that _start is a name that should be visible from outside of the assembly file.
#include "arm/mmu.h"
#include "arm/sysregs.h"
#include "mm.h"
#include "peripherals/base.h"
.section ".text.boot" // Make sure the linker puts this at the start of the kernel image
.globl _start // Execution starts here
_start:
// Check processor ID is zero (executing on main core), else hang
mrs x0, mpidr_el1
and x0, x0, #0xFF // Check processor id
cbz x0, master // Hang for all non-primary CPU
b proc_hang
// We're not on the main core, so hang in an infinite wait loop
proc_hang:
b proc_hangAn ARM processor will always reset into the highest implemented exception level (EL3). The normal flow is to jump from EL3 to EL1 (normal application exception level), but in this specific case, all the “applications” will run above the hypervisor, so start the hypervisor on EL2 first (change from EL 3 to 2 before getting into 1).
The variables SCTLR_VALUE_MMU_DISABLED, HCR_VALUE, SCR_VALUE and SPSR_VALUE are all defined on the file sysregs.h
master:
// Initial EL is 3. Change EL from 3 to 2
ldr x0, =SCTLR_VALUE_MMU_DISABLED
msr sctlr_el2, x0
ldr x0, =HCR_VALUE
msr hcr_el2, x0
ldr x0, =SCR_VALUE
msr scr_el3, x0
ldr x0, =SPSR_VALUE
msr spsr_el3, x0
adr x0, el2_entry
msr elr_el3, x0
eretWHERE:
-
INSTRUCTIONS:
-
REGISTERS:
sctlr_el2(System Control Register) - Provides top level control of the system, including its memory system.
scr_el3(Secure Configuration Register) - Defines the configuration of the current Security state.
- The Security state of EL0, EL1, and EL2. The Security state is either Secure or Non-secure.
- The Execution state at lower Exception levels.
- Whether IRQ, FIQ, SError interrupts, and External abort exceptions are taken to EL3.
- Whether various operations are trapped to EL3.
spsr_el3(Saved Program Status Register) - Holds the saved process state when an exception is taken to EL3.
elr_el3(Exception Link Register) - When taking an exception to EL3, holds the address to return to.
On EL2 entry point the bss section is cleaned and the page table for EL2 is created to enable MMU. At the end, the code jumps to hypervisor_main() function:
el2_entry:
// Clean the BSS section.
adr x0, bss_begin // Start address of section
adr x1, bss_end // End address of section
sub x1, x1, x0 // Size of the section
bl memzero // from "utils.S"
bl __create_page_tables
mov x0, #VA_START
add sp, x0, #LOW_MEMORY
// Set the translation table base address
adrp x0, pg_dir
msr ttbr0_el2, x0
// Set the control register for stage 1 of the EL2
ldr x0, =(TCR_VALUE)
msr tcr_el2, x0
// Set the Virtualization Translation Control Register (EL2)
ldr x0, =(VTCR_VALUE)
msr vtcr_el2, x0
// Set the Memory Attribute Indirection Register
ldr x0, =(MAIR_VALUE)
msr mair_el2, x0
// clear TLB
tlbi alle1
ldr x2, =hypervisor_main
mov x0, #SCTLR_MMU_ENABLED
// Check "Where" to understanding the following code using barrier instructions
dsb ish
isb
msr sctlr_el2, x0 // enable MMU
isb
br x2
/* Macros to facilitate Page Tables Entries creation */
.macro create_pgd_entry, tbl, virt, tmp1, tmp2
create_table_entry \tbl, \virt, PGD_SHIFT, \tmp1, \tmp2
create_table_entry \tbl, \virt, PUD_SHIFT, \tmp1, \tmp2
.endm
.macro create_table_entry, tbl, virt, shift, tmp1, tmp2
lsr \tmp1, \virt, #\shift
and \tmp1, \tmp1, #PTRS_PER_TABLE - 1 // table index
add \tmp2, \tbl, #PAGE_SIZE
orr \tmp2, \tmp2, #MM_TYPE_PAGE_TABLE
str \tmp2, [\tbl, \tmp1, lsl #3]
add \tbl, \tbl, #PAGE_SIZE // next level table page
.endm
.macro create_block_map, tbl, phys, start, end, flags, tmp1
lsr \start, \start, #SECTION_SHIFT
and \start, \start, #PTRS_PER_TABLE - 1 // table index
lsr \end, \end, #SECTION_SHIFT
and \end, \end, #PTRS_PER_TABLE - 1 // table end index
lsr \phys, \phys, #SECTION_SHIFT
mov \tmp1, #\flags
orr \phys, \tmp1, \phys, lsl #SECTION_SHIFT // table entry
9999: str \phys, [\tbl, \start, lsl #3] // store the entry
add \start, \start, #1 // next entry
add \phys, \phys, #SECTION_SIZE // next block
cmp \start, \end
b.ls 9999b
.endm
/* Create Page Tables for MMU */
__create_page_tables:
mov x29, x30 // save return address
adrp x0, pg_dir
mov x1, #PG_DIR_SIZE
bl memzero
adrp x0, pg_dir
mov x1, #VA_START
create_pgd_entry x0, x1, x2, x3
/* Mapping kernel and init stack */
mov x1, xzr // start mapping from physical offset 0
mov x2, #VA_START // first virtual address
ldr x3, =(VA_START + DEVICE_BASE - SECTION_SIZE) // last virtual address
create_block_map x0, x1, x2, x3, MMU_FLAGS, x4
/* Mapping device memory */
mov x1, #DEVICE_BASE // start mapping from device base address
ldr x2, =(VA_START + DEVICE_BASE) // first virtual address
ldr x3, =(VA_START + PHYS_MEMORY_SIZE - SECTION_SIZE) // last virtual address
create_block_map x0, x1, x2, x3, MMU_DEVICE_FLAGS, x4
mov x30, x29 // restore return address
retfrom utils.S:
.globl memzero
memzero:
str xzr, [x0], #8 // xzr = special register to store 0 value
subs x1, x1, #8
b.gt memzero
retWHERE:
- Following the code you will find some weird instructions (‘dsb’, ‘ish’, ‘isb’). They are “Memory Barriers Instruction” and they exists because modern processors have weird ways to fetch instructions, execute instructions and access memory (not in an procedural way like you think instinctively). For more details check this stackoverflow post, this blog post and this article.
void hypervisor_main() {
uart_init();
init_printf(NULL, putc);
printf("=== raspvisor ===\n");
init_task_console(current);
init_initial_task();
irq_vector_init();
timer_init();
disable_irq();
enable_interrupt_controller();
(...)
}The first thing that the main() function does, is create the task for the console and then create the primordial task on the task list
- The
*task[NR_TASKS]list of typetask_structis used by the scheduler and it is defined as a global variable on sched.c
static struct task_struct init_task = INIT_TASK;
struct task_struct *current = &(init_task);
struct task_struct *task[NR_TASKS] = {
&(init_task),
};The task_struct is defined as:
struct task_struct {
struct cpu_context cpu_context;
long state;
long counter;
long priority;
long preempt_count;
long pid; // used as VMID
unsigned long flags;
const char *name;
const struct board_ops *board_ops;
void *board_data;
struct mm_struct mm;
struct cpu_sysregs cpu_sysregs;
struct task_stat stat;
struct task_console console;
};The function init_task_console allocates the necessary buffer and struct for the console fifo stuff (not relevant for the hypervisor understanding):
void init_task_console(struct task_struct *tsk) {
tsk->console.in_fifo = create_fifo();
tsk->console.out_fifo = create_fifo();
}The init_initial_task just set the name of the primordial task to “IDLE”
void init_initial_task() {
task[0]->name = "IDLE";
}The irq_vector_init is written in ASM and is used to write the VBAR_EL1 register with the vector table address.
.globl irq_vector_init
irq_vector_init:
adr x0, vectors // load VBAR_EL1 with virtual
msr vbar_el1, x0 // vector table address
ret
The vectors is defined as:
.align 11
.globl vectors
vectors:
ventry sync_invalid_el1t // Synchronous EL1t
ventry irq_invalid_el1t // IRQ EL1t
ventry fiq_invalid_el1t // FIQ EL1t
ventry error_invalid_el1t // Error EL1t
ventry sync_invalid_el1h // Synchronous EL1h
ventry el1_irq // IRQ EL1h
ventry fiq_invalid_el1h // FIQ EL1h
ventry error_invalid_el1h // Error EL1h
ventry el0_sync // Synchronous 64-bit EL0
ventry el0_irq // IRQ 64-bit EL0
ventry fiq_invalid_el0_64 // FIQ 64-bit EL0
ventry error_invalid_el0_64 // Error 64-bit EL0
ventry sync_invalid_el0_32 // Synchronous 32-bit EL0
ventry irq_invalid_el0_32 // IRQ 32-bit EL0
ventry fiq_invalid_el0_32 // FIQ 32-bit EL0
ventry error_invalid_el0_32 // Error 32-bit EL0All the entries on the vector table are declared using the macro ventry:
.macro ventry label
.align 7
b \label
.endmMost part of the entries are dummy, for example:
sync_invalid_el1t:
handle_invalid_entry 1, SYNC_INVALID_EL1t
irq_invalid_el1t:
handle_invalid_entry 1, IRQ_INVALID_EL1t
fiq_invalid_el1t:
handle_invalid_entry 1, FIQ_INVALID_EL1tThe macro handle_invalid_entry show messages according to the values present on the EL1 Exception Syndrome Register (ESR_EL1) and EL1 Exception Link Register (ELR_EL1) :
.macro handle_invalid_entry el, type
kernel_entry \el
mov x0, #\type
mrs x1, esr_el1
mrs x2, elr_el1
bl show_invalid_entry_message
b err_hang
.endmIn the first line, you can see that another macro is used: kernel_entry. We will discuss it shortly. Then we call show_invalid_entry_message and prepare 3 arguments for it. The first argument is exception type that can take one of these values. It tells us exactly which exception handler has been executed. The second parameter is the most important one, it is called ESR which stands for Exception Syndrome Register.
The macro kernel_entry and kernel_exit is also used. Both of them are similar, and basically save the context before exception handling. All the important registers values are stored on stack. This process could be made using push/pop instructions (like we use to see on x86 and x64 implementations), but instead it uses stp/ldp, because of performance issues, as explained on this commit:
The push/pop instructions can be suboptimal when saving/restoring large
amounts of data to/from the stack, for example on entry/exit from the
kernel. This is because:
(1) They act on descending addresses (i.e. the newly decremented sp),
which may defeat some hardware prefetchers
(2) They introduce an implicit dependency between each instruction, as
the sp has to be updated in order to resolve the address of the
next access.
This patch removes the push/pop instructions from our kernel entry/exit
macros in favour of ldp/stp plus offset..macro kernel_entry
sub sp, sp, #S_FRAME_SIZE
stp x0, x1, [sp, #16 * 0]
stp x2, x3, [sp, #16 * 1]
stp x4, x5, [sp, #16 * 2]
stp x6, x7, [sp, #16 * 3]
stp x8, x9, [sp, #16 * 4]
stp x10, x11, [sp, #16 * 5]
stp x12, x13, [sp, #16 * 6]
stp x14, x15, [sp, #16 * 7]
stp x16, x17, [sp, #16 * 8]
stp x18, x19, [sp, #16 * 9]
stp x20, x21, [sp, #16 * 10]
stp x22, x23, [sp, #16 * 11]
stp x24, x25, [sp, #16 * 12]
stp x26, x27, [sp, #16 * 13]
stp x28, x29, [sp, #16 * 14]
add x21, sp, #S_FRAME_SIZE
mrs x22, elr_el2 // Save the address of the instruction that fire the exception
mrs x23, spsr_el2
stp x30, x21, [sp, #16 * 15]
stp x22, x23, [sp, #16 * 16]
bl vm_leaving_work
.endm
.macro kernel_exit
bl vm_entering_work
ldp x30, x21, [sp, #16 * 15]
ldp x22, x23, [sp, #16 * 16]
msr elr_el2, x22 // Restore the address of the instruction that fire the exception
msr spsr_el2, x23
ldp x0, x1, [sp, #16 * 0]
ldp x2, x3, [sp, #16 * 1]
ldp x4, x5, [sp, #16 * 2]
ldp x6, x7, [sp, #16 * 3]
ldp x8, x9, [sp, #16 * 4]
ldp x10, x11, [sp, #16 * 5]
ldp x12, x13, [sp, #16 * 6]
ldp x14, x15, [sp, #16 * 7]
ldp x16, x17, [sp, #16 * 8]
ldp x18, x19, [sp, #16 * 9]
ldp x20, x21, [sp, #16 * 10]
ldp x22, x23, [sp, #16 * 11]
ldp x24, x25, [sp, #16 * 12]
ldp x26, x27, [sp, #16 * 13]
ldp x28, x29, [sp, #16 * 14]
add sp, sp, #S_FRAME_SIZE
eret
.endmThis argument is taken from esr_el2 register, which is described on page 2431 of AArch64-Reference-Manual. This register contains detailed information about what causes an exception. The third argument is important mostly in case of synchronous exceptions. Its value is taken from already familiar to us elr_el2 register, which contains the address of the instruction that had been executed when the exception was generated.
For synchronous exceptions, this is also the instruction that causes the exception. After show_invalid_entry_message function prints all this information to the screen we put the processor in an infinite loop because there is not much else we can do.
After an exception handler finishes execution, we want all general purpose registers to have the same values they had before the exception was generated. If we don’t implement such functionality, an interrupt that has nothing to do with currently executing code, can influence the behavior of this code unpredictably. That’s why the first thing we must do after an exception is generated is to save the processor state. This is done in the kernel_entry macro. This macro is very simple: it just stores registers x0 - x30 to the stack.
There is also a corresponding macro kernel_exit, which is called after an exception handler finishes execution. kernel_exit restores processor state by copying back the values of x0 - x30 registers. It also executes eret instruction, which returns us back to normal execution flow. By the way, general purpose registers are not the only thing that needs to be saved before executing an exception handler, but it is enough for a simple kernel for now.
The important entries are el1_irq, el0_irq, el0_sync where the first 2 entries branch the code to the handler written in C code.
el1_irq:
kernel_entry 1
bl handle_irq
kernel_exit 1
el0_irq:
kernel_entry 0
bl handle_irq
kernel_exit 0
el0_sync:
kernel_entry 0
mrs x25, esr_el1 // read the syndrome register
lsr x24, x25, #ESR_ELx_EC_SHIFT // exception class
cmp x24, #ESR_ELx_EC_SVC64 // SVC in 64-bit state
b.eq el0_svc
cmp x24, #ESR_ELx_EC_DABT_LOW // data abort in EL0
b.eq el0_da
handle_invalid_entry 0, SYNC_ERRORvoid handle_irq(void)
{
unsigned int irq = get32(IRQ_PENDING_1);
switch (irq) {
case (SYSTEM_TIMER_IRQ_1):
handle_timer_irq();
break;
default:
printf("Inknown pending irq: %x\r\n", irq);
}
}The next function called in main is the timer_init() defined in timer.c:
- Raspberry Pi system timer is a very simple device. It has a counter that increases its value by 1 after each clock tick. It also has 4 interrupt lines that connect to the interrupt controller(so it can generate 4 different interrupts) and 4 corresponding compare registers (
TIMER_C0,TIMER_C1,TIMER_C2,TIMER_C3). - When the value of the counter becomes equal to the value stored in one of the compare registers the corresponding interrupt is fired.
- That’s why, before we will be able to use system timer interrupts, we need to initialize one of the compare registers with a non-zero value, the larger the value is - the later an interrupt will be generated. This is done in
timer_initfunction:
void timer_init(void) {
put32(TIMER_C1, get32(TIMER_CLO) + interval);
}Another thing that we need to do is to unmask all types of interrupts. Let me explain what I mean by “unmasking” an interrupt. Sometimes there is a need to tell that a particular piece of code must never be intercepted by an asynchronous interrupt. Imagine, for example, what happens if an interrupt occurs right in the middle of kernel_entry macro? In this case, processor state would be overwritten and lost. That’s why whenever an exception handler is executed, the processor automatically disables all types of interrupts. This is called “masking”, and this also can be done manually if we need to do so.
.globl disable_irq
disable_irq:
msr daifset, #2
retDevices usually don’t interrupt processor directly: instead, they rely on interrupt controller to do the job. Interrupt controller can be used to enable/disable interrupts sent by the hardware. Raspberry Pi interrupt controller has 3 registers that hold enabled/disabled status for all types of interrupts. For now, we are only interested in timer interrupts, and those interrupts can be enabled using ENABLE_IRQS_1 register
So here is the function that enables system timer IRQ number 1.
void enable_interrupt_controller()
{
put32(ENABLE_IRQS_1, SYSTEM_TIMER_IRQ_1);
}