Most part of the “raspvisor” is based on s-matyukevich/raspberry-pi-os series that is basically a “how to develop a simple OS to Raspberry Pi from scratch” and provides a deeper and more detailed understanding of the OS concepts and implementations of this Hypervisor.

From “raspvisor” we got a file named as boot.S:

The .section and .globl labels are notes to the linker. The first is about where this code belongs in the compiled binary. In a little bit, we are going to specify where that is. The second specifies that _start is a name that should be visible from outside of the assembly file.

#include "arm/mmu.h"
#include "arm/sysregs.h"
#include "mm.h"
#include "peripherals/base.h"
 
.section ".text.boot" // Make sure the linker puts this at the start of the kernel image
 
.globl _start         // Execution starts here
 
_start:
	// Check processor ID is zero (executing on main core), else hang
  mrs x0, mpidr_el1
  and x0, x0, #0xFF    // Check processor id
  cbz x0, master       // Hang for all non-primary CPU
  b proc_hang
  // We're not on the main core, so hang in an infinite wait loop
 
proc_hang:
  b proc_hang

An ARM processor will always reset into the highest implemented exception level (EL3). The normal flow is to jump from EL3 to EL1 (normal application exception level), but in this specific case, all the “applications” will run above the hypervisor, so start the hypervisor on EL2 first (change from EL 3 to 2 before getting into 1).

The variables SCTLR_VALUE_MMU_DISABLED, HCR_VALUE, SCR_VALUE and SPSR_VALUE are all defined on the file sysregs.h

master:
  // Initial EL is 3. Change EL from 3 to 2
  ldr x0, =SCTLR_VALUE_MMU_DISABLED
  msr sctlr_el2, x0
 
  ldr x0, =HCR_VALUE
  msr hcr_el2, x0
 
  ldr x0, =SCR_VALUE
  msr scr_el3, x0
 
  ldr x0, =SPSR_VALUE
  msr spsr_el3, x0
 
  adr x0, el2_entry
  msr elr_el3, x0
 
  eret

WHERE:

  • INSTRUCTIONS:

  • REGISTERS:

sctlr_el2 (System Control Register) - Provides top level control of the system, including its memory system.

scr_el3 (Secure Configuration Register) - Defines the configuration of the current Security state.

  • The Security state of EL0, EL1, and EL2. The Security state is either Secure or Non-secure.
  • The Execution state at lower Exception levels.
  • Whether IRQ, FIQ, SError interrupts, and External abort exceptions are taken to EL3.
  • Whether various operations are trapped to EL3.

spsr_el3 (Saved Program Status Register) - Holds the saved process state when an exception is taken to EL3.

elr_el3 (Exception Link Register) - When taking an exception to EL3, holds the address to return to.

On EL2 entry point the bss section is cleaned and the page table for EL2 is created to enable MMU. At the end, the code jumps to hypervisor_main() function:

el2_entry:
 
	// Clean the BSS section.
  adr x0, bss_begin     // Start address of section
  adr x1, bss_end       // End address of section
  sub x1, x1, x0        // Size of the section
  bl  memzero           // from "utils.S"
 
  bl  __create_page_tables
 
  mov x0, #VA_START
  add sp, x0, #LOW_MEMORY
	
	// Set the translation table base address
  adrp  x0, pg_dir
  msr ttbr0_el2, x0
	
	// Set the control register for stage 1 of the EL2
  ldr x0, =(TCR_VALUE)
  msr tcr_el2, x0
	
	// Set the Virtualization Translation Control Register (EL2)
  ldr x0, =(VTCR_VALUE)
  msr vtcr_el2, x0
	
	// Set the Memory Attribute Indirection Register
  ldr x0, =(MAIR_VALUE)
  msr mair_el2, x0
 
  // clear TLB
  tlbi alle1
 
  ldr x2, =hypervisor_main
  mov x0, #SCTLR_MMU_ENABLED
 
	
	// Check "Where" to understanding the following code using barrier instructions
  dsb ish
  isb
  msr sctlr_el2, x0   // enable MMU
  isb
 
  br  x2
 
/* Macros to facilitate Page Tables Entries creation */
 
.macro  create_pgd_entry, tbl, virt, tmp1, tmp2
  create_table_entry \tbl, \virt, PGD_SHIFT, \tmp1, \tmp2
  create_table_entry \tbl, \virt, PUD_SHIFT, \tmp1, \tmp2
.endm
 
.macro  create_table_entry, tbl, virt, shift, tmp1, tmp2
  lsr \tmp1, \virt, #\shift
  and \tmp1, \tmp1, #PTRS_PER_TABLE - 1           // table index
  add \tmp2, \tbl, #PAGE_SIZE
  orr \tmp2, \tmp2, #MM_TYPE_PAGE_TABLE
  str \tmp2, [\tbl, \tmp1, lsl #3]
  add \tbl, \tbl, #PAGE_SIZE                      // next level table page
.endm
 
.macro  create_block_map, tbl, phys, start, end, flags, tmp1
  lsr \start, \start, #SECTION_SHIFT
  and \start, \start, #PTRS_PER_TABLE - 1         // table index
  lsr \end, \end, #SECTION_SHIFT
  and \end, \end, #PTRS_PER_TABLE - 1             // table end index
  lsr \phys, \phys, #SECTION_SHIFT
  mov \tmp1, #\flags
  orr \phys, \tmp1, \phys, lsl #SECTION_SHIFT     // table entry
9999: str \phys, [\tbl, \start, lsl #3]           // store the entry
  add \start, \start, #1                          // next entry
  add \phys, \phys, #SECTION_SIZE                 // next block
  cmp \start, \end
  b.ls  9999b
.endm
 
/* Create Page Tables for MMU */
 
__create_page_tables:
  mov x29, x30                                   // save return address
 
  adrp  x0, pg_dir
  mov x1, #PG_DIR_SIZE
  bl  memzero
 
  adrp  x0, pg_dir
  mov x1, #VA_START
  create_pgd_entry x0, x1, x2, x3
 
  /* Mapping kernel and init stack */
 
  mov x1, xzr                                         // start mapping from physical offset 0
  mov x2, #VA_START                                   // first virtual address
  ldr x3, =(VA_START + DEVICE_BASE - SECTION_SIZE)    // last virtual address
  create_block_map x0, x1, x2, x3, MMU_FLAGS, x4
 
  /* Mapping device memory */
 
  mov   x1, #DEVICE_BASE                                       // start mapping from device base address
  ldr   x2, =(VA_START + DEVICE_BASE)                          // first virtual address
  ldr x3, =(VA_START + PHYS_MEMORY_SIZE - SECTION_SIZE)        // last virtual address
  create_block_map x0, x1, x2, x3, MMU_DEVICE_FLAGS, x4
 
  mov x30, x29     // restore return address
  ret

from utils.S:

.globl memzero
memzero:
  str xzr, [x0], #8    // xzr = special register to store 0 value
  subs x1, x1, #8
  b.gt memzero
  ret

WHERE:

  • Following the code you will find some weird instructions (‘dsb’, ‘ish’, ‘isb’). They are “Memory Barriers Instruction” and they exists because modern processors have weird ways to fetch instructions, execute instructions and access memory (not in an procedural way like you think instinctively). For more details check this stackoverflow post, this blog post and this article.
void hypervisor_main() {
 
  uart_init();
  init_printf(NULL, putc);
  printf("=== raspvisor ===\n");
 
  init_task_console(current);
  init_initial_task();
  irq_vector_init();
  timer_init();
  disable_irq();
  enable_interrupt_controller();
 
	(...)
 
}

The first thing that the main() function does, is create the task for the console and then create the primordial task on the task list

  • The *task[NR_TASKS] list of type task_struct is used by the scheduler and it is defined as a global variable on sched.c
static struct task_struct init_task = INIT_TASK;
struct task_struct *current = &(init_task);
struct task_struct *task[NR_TASKS] = {
    &(init_task),
};

The task_struct is defined as:

struct task_struct {
  struct cpu_context cpu_context;
  long state;
  long counter;
  long priority;
  long preempt_count;
  long pid; // used as VMID
  unsigned long flags;
  const char *name;
  const struct board_ops *board_ops;
  void *board_data;
  struct mm_struct mm;
  struct cpu_sysregs cpu_sysregs;
  struct task_stat stat;
  struct task_console console;
};

The function init_task_console allocates the necessary buffer and struct for the console fifo stuff (not relevant for the hypervisor understanding):

void init_task_console(struct task_struct *tsk) {
  tsk->console.in_fifo = create_fifo();
  tsk->console.out_fifo = create_fifo();
}

The init_initial_task just set the name of the primordial task to “IDLE”

void init_initial_task() {
  task[0]->name = "IDLE";
}

The irq_vector_init is written in ASM and is used to write the VBAR_EL1 register with the vector table address.

.globl irq_vector_init
irq_vector_init:
	adr	x0, vectors				  // load VBAR_EL1 with virtual
	msr	vbar_el1, x0				// vector table address
	ret
 

The vectors is defined as:

.align	11
.globl vectors 
vectors:
	ventry	sync_invalid_el1t			// Synchronous EL1t
	ventry	irq_invalid_el1t			// IRQ EL1t
	ventry	fiq_invalid_el1t			// FIQ EL1t
	ventry	error_invalid_el1t		// Error EL1t
 
	ventry	sync_invalid_el1h			// Synchronous EL1h
	ventry	el1_irq					      // IRQ EL1h
	ventry	fiq_invalid_el1h			// FIQ EL1h
	ventry	error_invalid_el1h		// Error EL1h
 
	ventry	el0_sync				      // Synchronous 64-bit EL0
	ventry	el0_irq					      // IRQ 64-bit EL0
	ventry	fiq_invalid_el0_64		// FIQ 64-bit EL0
	ventry	error_invalid_el0_64	// Error 64-bit EL0
 
	ventry	sync_invalid_el0_32		// Synchronous 32-bit EL0
	ventry	irq_invalid_el0_32		// IRQ 32-bit EL0
	ventry	fiq_invalid_el0_32		// FIQ 32-bit EL0
	ventry	error_invalid_el0_32	// Error 32-bit EL0

All the entries on the vector table are declared using the macro ventry:

.macro	ventry	label
.align	7
	b	\label
.endm

Most part of the entries are dummy, for example:

sync_invalid_el1t:
	handle_invalid_entry 1, SYNC_INVALID_EL1t
 
irq_invalid_el1t:
	handle_invalid_entry 1, IRQ_INVALID_EL1t
 
fiq_invalid_el1t:
	handle_invalid_entry 1, FIQ_INVALID_EL1t

The macro handle_invalid_entry show messages according to the values present on the EL1 Exception Syndrome Register (ESR_EL1) and EL1 Exception Link Register (ELR_EL1) :

.macro handle_invalid_entry el, type
	kernel_entry \el
	mov	x0, #\type
	mrs	x1, esr_el1
	mrs	x2, elr_el1
	bl	show_invalid_entry_message
	b	err_hang
.endm

In the first line, you can see that another macro is used: kernel_entry. We will discuss it shortly. Then we call show_invalid_entry_message and prepare 3 arguments for it. The first argument is exception type that can take one of these values. It tells us exactly which exception handler has been executed. The second parameter is the most important one, it is called ESR which stands for Exception Syndrome Register.

The macro kernel_entry and kernel_exit is also used. Both of them are similar, and basically save the context before exception handling. All the important registers values are stored on stack. This process could be made using push/pop instructions (like we use to see on x86 and x64 implementations), but instead it uses stp/ldp, because of performance issues, as explained on this commit:

The push/pop instructions can be suboptimal when saving/restoring large
amounts of data to/from the stack, for example on entry/exit from the
kernel. This is because:
 
  (1) They act on descending addresses (i.e. the newly decremented sp),
      which may defeat some hardware prefetchers
 
  (2) They introduce an implicit dependency between each instruction, as
      the sp has to be updated in order to resolve the address of the
      next access.
 
This patch removes the push/pop instructions from our kernel entry/exit
macros in favour of ldp/stp plus offset.
.macro  kernel_entry
  sub sp, sp, #S_FRAME_SIZE
  stp x0, x1, [sp, #16 * 0]
  stp x2, x3, [sp, #16 * 1]
  stp x4, x5, [sp, #16 * 2]
  stp x6, x7, [sp, #16 * 3]
  stp x8, x9, [sp, #16 * 4]
  stp x10, x11, [sp, #16 * 5]
  stp x12, x13, [sp, #16 * 6]
  stp x14, x15, [sp, #16 * 7]
  stp x16, x17, [sp, #16 * 8]
  stp x18, x19, [sp, #16 * 9]
  stp x20, x21, [sp, #16 * 10]
  stp x22, x23, [sp, #16 * 11]
  stp x24, x25, [sp, #16 * 12]
  stp x26, x27, [sp, #16 * 13]
  stp x28, x29, [sp, #16 * 14]
 
  add x21, sp, #S_FRAME_SIZE
 
  mrs x22, elr_el2  // Save the address of the instruction that fire the exception
  mrs x23, spsr_el2
 
  stp x30, x21, [sp, #16 * 15]
  stp x22, x23, [sp, #16 * 16]
 
  bl vm_leaving_work
.endm
 
.macro  kernel_exit
  bl vm_entering_work
 
  ldp x30, x21, [sp, #16 * 15]
  ldp x22, x23, [sp, #16 * 16]
 
  msr elr_el2, x22 // Restore the address of the instruction that fire the exception
  msr spsr_el2, x23
 
  ldp x0, x1, [sp, #16 * 0]
  ldp x2, x3, [sp, #16 * 1]
  ldp x4, x5, [sp, #16 * 2]
  ldp x6, x7, [sp, #16 * 3]
  ldp x8, x9, [sp, #16 * 4]
  ldp x10, x11, [sp, #16 * 5]
  ldp x12, x13, [sp, #16 * 6]
  ldp x14, x15, [sp, #16 * 7]
  ldp x16, x17, [sp, #16 * 8]
  ldp x18, x19, [sp, #16 * 9]
  ldp x20, x21, [sp, #16 * 10]
  ldp x22, x23, [sp, #16 * 11]
  ldp x24, x25, [sp, #16 * 12]
  ldp x26, x27, [sp, #16 * 13]
  ldp x28, x29, [sp, #16 * 14]
  add sp, sp, #S_FRAME_SIZE
  eret
.endm

This argument is taken from esr_el2 register, which is described on page 2431 of AArch64-Reference-Manual. This register contains detailed information about what causes an exception. The third argument is important mostly in case of synchronous exceptions. Its value is taken from already familiar to us elr_el2 register, which contains the address of the instruction that had been executed when the exception was generated.

For synchronous exceptions, this is also the instruction that causes the exception. After show_invalid_entry_message function prints all this information to the screen we put the processor in an infinite loop because there is not much else we can do.

After an exception handler finishes execution, we want all general purpose registers to have the same values they had before the exception was generated. If we don’t implement such functionality, an interrupt that has nothing to do with currently executing code, can influence the behavior of this code unpredictably. That’s why the first thing we must do after an exception is generated is to save the processor state. This is done in the kernel_entry macro. This macro is very simple: it just stores registers x0 - x30 to the stack.

There is also a corresponding macro kernel_exit, which is called after an exception handler finishes execution. kernel_exit restores processor state by copying back the values of x0 - x30 registers. It also executes eret instruction, which returns us back to normal execution flow. By the way, general purpose registers are not the only thing that needs to be saved before executing an exception handler, but it is enough for a simple kernel for now.

The important entries are el1_irq, el0_irq, el0_sync where the first 2 entries branch the code to the handler written in C code.

el1_irq:
	kernel_entry 1 
	bl	handle_irq
	kernel_exit 1 
 
el0_irq:
	kernel_entry 0 
	bl	handle_irq
	kernel_exit 0 
 
el0_sync:
	kernel_entry 0
	mrs	x25, esr_el1				          // read the syndrome register
	lsr	x24, x25, #ESR_ELx_EC_SHIFT		// exception class
	cmp	x24, #ESR_ELx_EC_SVC64			  // SVC in 64-bit state
	b.eq	el0_svc
	cmp	x24, #ESR_ELx_EC_DABT_LOW		  // data abort in EL0
	b.eq	el0_da
	handle_invalid_entry 0, SYNC_ERROR
void handle_irq(void)
{
	unsigned int irq = get32(IRQ_PENDING_1);
	switch (irq) {
		case (SYSTEM_TIMER_IRQ_1):
			handle_timer_irq();
			break;
		default:
			printf("Inknown pending irq: %x\r\n", irq);
	}
}

The next function called in main is the timer_init() defined in timer.c:

  • Raspberry Pi system timer is a very simple device. It has a counter that increases its value by 1 after each clock tick. It also has 4 interrupt lines that connect to the interrupt controller(so it can generate 4 different interrupts) and 4 corresponding compare registers (TIMER_C0, TIMER_C1, TIMER_C2, TIMER_C3).
  • When the value of the counter becomes equal to the value stored in one of the compare registers the corresponding interrupt is fired.
  • That’s why, before we will be able to use system timer interrupts, we need to initialize one of the compare registers with a non-zero value, the larger the value is - the later an interrupt will be generated. This is done in timer_init function:
void timer_init(void) {
  put32(TIMER_C1, get32(TIMER_CLO) + interval);
}

Another thing that we need to do is to unmask all types of interrupts. Let me explain what I mean by “unmasking” an interrupt. Sometimes there is a need to tell that a particular piece of code must never be intercepted by an asynchronous interrupt. Imagine, for example, what happens if an interrupt occurs right in the middle of kernel_entry macro? In this case, processor state would be overwritten and lost. That’s why whenever an exception handler is executed, the processor automatically disables all types of interrupts. This is called “masking”, and this also can be done manually if we need to do so.

.globl disable_irq
disable_irq:
	msr	daifset, #2
	ret

Devices usually don’t interrupt processor directly: instead, they rely on interrupt controller to do the job. Interrupt controller can be used to enable/disable interrupts sent by the hardware. Raspberry Pi interrupt controller has 3 registers that hold enabled/disabled status for all types of interrupts. For now, we are only interested in timer interrupts, and those interrupts can be enabled using ENABLE_IRQS_1 register

So here is the function that enables system timer IRQ number 1.

void enable_interrupt_controller()
{
	put32(ENABLE_IRQS_1, SYSTEM_TIMER_IRQ_1);
}

🌱 Back to Garden