I recently bought myself an STM32 Nucleo microcontroller board to play around with. What fascinated me was how much more flexible things are at this level, how much more you can do yourself. With an ESP32 that’s not really the case, you’re always tied to ESP-IDF or some other framework.
I started with a first simple example, the kind everyone knows and has done before: the famous Hello World. It’s simple, and that’s exactly why it’s useful. You don’t learn a language’s syntax with it, it’s too small for that. You learn how to actually use the language. What file format, how to compile, how to link, how to run the result.
That’s why I always reach for Hello World first whenever I pick up a new language or a new environment. It forces me to run the whole build system once before I do anything else. Two things I learned this way that weren’t obvious to me at the start. First: you can learn a surprising amount from a simple example if you take it seriously. Second: simple is almost never really simple. Most things that look easy are easy because someone else hid the complexity for you.
The embedded world has a Hello World too: the blinking LED. Most microcontroller boards have an onboard LED that you turn on and off at some frequency. Sounds trivial. And it is, if you use a Hardware Abstraction Layer (HAL) and some ready-made project template.
I’ve been working in automotive software for years, and before that on physics simulations at university. In my head I’d always been “close to the hardware”, I write embedded software after all, not frontend. A while back I wrote about the roles of C, C++, and Rust in automotive and quietly took for granted that “embedded equals close to the hardware”. At some point it hit me that this was a delusion. MCAL, AUTOSAR OS, RTE: there are more layers between me and the silicon than between a web app and the kernel. I wanted to actually get down to the bottom for once. No HAL, no framework, no vendor black box. Just the reference manual and the compiler.
A Blinky in Rust
In the Rust ecosystem the example quickly ends up looking like this:
#![no_std]
#![no_main]
use cortex_m_rt::entry;
use panic_halt as _;
use stm32f4xx_hal::{pac, prelude::*};
#[entry]
fn main() -> ! {
let dp = pac::Peripherals::take().unwrap();
let rcc = dp.RCC.constrain();
let clocks = rcc.cfgr.sysclk(48.MHz()).freeze();
let gpiob = dp.GPIOB.split();
let mut led1 = gpiob.pb0.into_push_pull_output();
let mut led2 = gpiob.pb7.into_push_pull_output();
let mut delay = dp.TIM1.delay_ms(&clocks);
loop {
led1.toggle();
delay.delay_ms(400u32);
led2.toggle();
delay.delay_ms(100u32);
}
}
It’s short, type-safe, and it works. The compiler keeps me from toggling an input pin. The clock config goes through a builder pattern. Pin types carry their configuration in the type system, so toggle() on an input pin is a compile error. The delay is timed off SYSCLK. And everything you don’t see, the vector table, the reset handler, the copy loop for .data, the zeroing of .bss, all of that comes from the cortex-m-rt crate. The linker just gets a small memory.x that tells it where flash and RAM are.
That’s exactly the problem I wanted to dig into. Not in the “the HAL is bad” sense (I actually came away appreciating it more), but: I wanted to see what the HAL does for me. So the same thing again, but in C, no HAL, no CMSIS, just register addresses straight out of the reference manual.
Hello World, embedded
The board I picked was the Nucleo-F446ZE, and I started reading the docs (Reference Manual RM0390, chapter 6 for RCC and chapter 8 for GPIO).
The blinky itself is quickly explained. Enable the GPIOB clock, configure PB0 as an output, in a loop toggle the output register. In C, with no abstraction, it looks like this. First the registers as macros, then main():
#include <stdbool.h>
#include <stdint.h>
#define RCC_BASE 0x40023800UL
#define GPIOB_BASE 0x40020400UL
#define RCC_AHB1ENR (*(volatile uint32_t*)(RCC_BASE + 0x30UL))
#define GPIOB_MODER (*(volatile uint32_t*)(GPIOB_BASE + 0x00UL))
#define GPIOB_ODR (*(volatile uint32_t*)(GPIOB_BASE + 0x14UL))
#define RCC_AHB1ENR_GPIOBEN (1UL << 1)
#define LED_PIN 0U
static void delay(volatile uint32_t n) {
while (n--) {
__asm__("nop");
}
}
int main(void) {
RCC_AHB1ENR |= RCC_AHB1ENR_GPIOBEN;
GPIOB_MODER &= ~(3UL << (LED_PIN * 2));
GPIOB_MODER |= (1UL << (LED_PIN * 2));
while (true) {
GPIOB_ODR ^= (1UL << LED_PIN);
delay(500000);
}
}
Three things about this code need explaining.
First: *(volatile uint32_t*)(...). That’s memory-mapped I/O in its purest form. The hardware exposes certain addresses that don’t point to ordinary RAM cells, but to registers of the peripherals. Writing to RCC_AHB1ENR doesn’t mean “write into a memory cell”, it means “tell the RCC block which clocks to enable”. The volatile cast isn’t a style choice, it’s mandatory. Without volatile, the compiler wouldn’t care how often you write, it would optimize the accesses away as dead stores, and the blinky would silently do nothing. volatile is the contract with the compiler: “hands off, every access has a side effect you can’t see.”
Second: initializing GPIOB_MODER. I clear the two mode bits for PB0 first, then set them to 01 (General Purpose Output). Read-modify-write with &= and |=, so that other pins in the same register stay untouched. On Cortex-M, by the way, this is not atomic, that’s three instructions (LDR, ORR/BIC, STR), and an ISR could fire in between. It works here because no interrupts are active during init. If you actually need atomicity, you use the bit-band region (where available, it’s gone on the Cortex-M7) or LDREX/STREX. For pure set-or-clear on GPIO output pins there’s also the BSRR register, which is specifically designed to let you set or reset individual bits atomically in one write, no read-modify-write required.
Third: delay(). The combination of volatile on the parameter and the explicit nop isn’t decoration. Without volatile, and depending on the optimization level, the compiler may simply skip decrementing the counter, because nobody reads the value. Without the nop, it’s free to collapse the loop body. Together they force the loop to actually run. The comment “500 ms at 16 MHz” is wishful thinking, since the real duration depends on the optimizer, flash wait states, and the pipeline. For a blinky that’s fine, in production you’d use SysTick.
So much for the functionality. The really interesting question isn’t what’s in main(), it’s: how does main() ever get called in the first place? On a PC the operating system does that. On a microcontroller there is no operating system, no loader, no process, nothing that reads in code, allocates memory, or prepares a runtime. Someone has to do all of this by hand. That’s where it got interesting for me.
The hardware doesn’t know about main()
When the ARM Cortex-M4 in the STM32 powers on, it does something very concrete. It reads 4 bytes from address 0x08000000 and loads them as the initial stack pointer. Then it reads the next 4 bytes from 0x08000004, interprets them as an address, and jumps there. That’s not a software instruction, that’s circuit logic, set in silicon. Everything that happens after that is software.
One detail that can cost you hours if you don’t know it: bit 0 of the reset vector address has to be set. The Cortex-M4 only knows the Thumb instruction set, and the CPU uses bit 0 of the jump address as a mode bit. If it’s zero, you get a HardFault right after reset. The linker usually takes care of this for you, but anyone who builds the vector table by hand and has to cast a function pointer symbol themselves will learn this one the hard way.
Which gives us a clear requirement: at address 0x08000000 exactly the right thing has to be sitting there. This structure is called the vector table, and it’s really just an array of function pointers. First entry is the stack pointer (cast as a function pointer, the hardware doesn’t care about the type, it just reads 4 bytes). Second entry is the address of the reset handler. After that come NMI, HardFault, and the other handlers. On an interrupt, the hardware looks into this table, reads the address, jumps there. It’s a hardware jump table, not a software dispatch.
In code, heavily shortened, it looks like this. The full table also has MemManage, BusFault, UsageFault, SVCall, PendSV, SysTick, and then the roughly 80 STM32-specific IRQs:
__attribute__((section(".isr_vector")))
void (*const vector_table[])(void) = {
(void (*)(void))(&_estack),
Reset_Handler,
Default_Handler, /* NMI */
Default_Handler, /* HardFault */
};
The section(".isr_vector") attribute matters. It tells the compiler: this data belongs in a specially named section. Where that section ends up in memory, though, isn’t decided here. That was the first moment I realized the compiler and the hardware don’t talk to each other directly. Something’s missing in between.
Since the Cortex-M4 is a licensed ARM core, none of this is STM-specific. It works the same way on boards from NXP, Microchip, or TI. Once you’ve understood it once, you can dive right in on a different board.
Sections floating in nothing
The STM32 has two memory regions. Flash, non-volatile, starting at 0x08000000. RAM, volatile, starting at 0x20000000. Both on the same 32-bit address bus. From the CPU’s point of view both regions are equally addressable; which addresses point to flash and which to RAM is decided by how the chip is wired.
The C compiler knows none of this. It takes main.c, produces machine code, puts it into a section called .text. Constants go into .rodata, initialized variables into .data, uninitialized variables into .bss. These are all just names. The compiler has no idea that .text is supposed to end up in flash later and .bss in RAM. It doesn’t even know that flash and RAM exist. The sections have no absolute addresses. They’re just floating in nothing.
So someone has to decide which section ends up at which physical address. That’s the job of the linker script.
The linker script is the floor plan
A linker script is a text file with a .ld extension. It describes two things: which memory regions exist, and which section goes where.
The line
ENTRY(Reset_Handler)
tells the linker where the entry point is.
The MEMORY block lists the physical regions. The numbers come straight from the chip’s datasheet:
MEMORY
{
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 512K
RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 128K
}
In the SECTIONS block every section gets assigned to a region. .isr_vector goes at the beginning of flash, because that’s where the hardware reads its first 4 bytes. The KEEP(*(.isr_vector)) keeps the linker from throwing the vector table away, since nothing in the C code explicitly references the vector_table symbol.
.text and .rodata go into flash, because they’re supposed to be non-volatile. .bss goes into RAM, because it gets filled with zeros at runtime.
My favorite part is .data. These are variables with an initial value: int baud_rate = 9600;. At runtime they need to live in RAM, otherwise they aren’t writable. But the initial value has to be stored somewhere before the board ever gets power. So the initial value has to live in flash and get copied into RAM at startup.
The linker script solves this by giving .data two addresses. A virtual address (VMA) in RAM, that’s the address the code expects the variable at. And a load address (LMA) in flash, that’s where the initial values physically sit. Who actually does the copying, the linker script doesn’t say. It only emits boundary markers as symbols: _sidata (start of the initial values in flash), _sdata and _edata (start and end in RAM), _sbss and _ebss for the .bss section.
These symbols aren’t variables in the usual sense. They don’t occupy any memory. They’re just numbers that the linker stamps in at the end. In C you access them by declaring them extern and then taking their address:
extern uint32_t _sidata;
extern uint32_t _sdata;
extern uint32_t _edata;
That confused me for a minute, because they look like variables but aren’t. You never read the value of _sdata, you always use &_sdata. The “variable” is its own address.
The startup code is a mini OS
The linker script sets the boundaries. Filling everything in is the job of the startup code. Traditionally a file called startup.s in assembly, nowadays often startup.c or inline in the same file. I kept mine inline in the blinky, for maximum transparency.
The startup code is the reset handler. It does three things. It copies .data from flash to RAM so that initialized variables have their initial values. It fills .bss with zeros so the C standard for uninitialized variables is upheld. Then it calls main(). With C++ there’s a fourth step: calling global constructors, which runs through a section called __init_array. In an AUTOSAR project this exact work lives inside the supplier’s startup code and runs before EcuM_Init ever sees a register.
In the blinky it looks like this:
void Reset_Handler(void) {
uint32_t* src = &_sidata;
uint32_t* dst = &_sdata;
while (dst < &_edata) {
*dst++ = *src++;
}
dst = &_sbss;
while (dst < &_ebss) {
*dst++ = 0;
}
main();
while (true) { }
}
Two things stood out to me when I wrote this for the first time.
First: the startup code uses the linker symbols directly as loop bounds. That’s the contract between the two files. The linker script promises that the symbols are there and point to the right addresses. The startup code just trusts it blindly. If you rename the symbols in the linker script, the startup code breaks silently. Nothing warns you. As an aside: the copy loop runs over uint32_t*, not uint8_t*. That’s faster, one word per bus transaction instead of four bytes, and it works because the linker aligns the sections to 4 bytes. Unaligned 32-bit accesses can trigger a fault on Cortex-M depending on the configuration.
Second: the while (true) {} after main(). On a PC, main() returns to the operating system. Here there is no operating system. If main() accidentally returns, the processor has to go somewhere. The infinite loop is insurance against it running wild through memory.
The only reason the startup code can run at all, before the C environment is set up, is that it uses only stack-local variables. And the stack already works because the hardware loaded the stack pointer from the vector table at reset. The whole sequence is a domino chain. Each step has exactly the one precondition the previous step just created.
From .c to bytes in flash
What happens between source and a blinking LED is also not one step, but several. The preprocessor resolves includes and macros. The compiler translates each .c file individually into assembly, architecture-specific via -mcpu=cortex-m4 -mthumb. The assembler produces object files in ELF format, with relative addresses and unresolved symbols. The linker collects all sections of the same name from all object files, assigns them absolute addresses via the linker script, and resolves the symbols. What used to say “jump to delay” now carries the concrete flash address of delay.
The final product is an ELF file. It contains the machine code, but also program headers (which bytes go where in memory), the symbol table (function and variable names with their addresses, for the debugger), and optionally debug information (the mapping of machine code to source lines). A .bin file, which you produce via objcopy -O binary, is just the raw bytes without any metadata.
To flash it, you use a tool like probe-rs or OpenOCD. The tool talks over a debug adapter (ST-Link, J-Link, or CMSIS-DAP) to the Cortex-M4’s SWD port (Serial Wire Debug). The debug port has direct access to the entire address space of the chip, regardless of whether the CPU is running. The tool reads the ELF file, extracts the program headers, writes the bytes into flash, and triggers a reset. After which the whole cycle starts over. Hardware reads the vector table, jumps to the reset handler, startup initializes, main() runs, LED blinks.
I want to stress again that this is a little learning project of mine, not production code. In production you’d put an abstraction on top, either a vendor HAL (for example ST’s) or the Cortex Microcontroller Software Interface Standard (CMSIS), which is vendor-agnostic.
What I took away
Two things stuck with me.
The first: a Hello World is very much not a waste of time if you take it seriously. I could have written the blinky with a HAL in five lines, done. Instead I wrote the vector table myself, the reset handler, read the linker script, understood the boundary markers. And I learned more about the system than I would have in weeks of framework tutorials.
The second: simple is almost never actually simple. The blinky with a HAL isn’t any easier than the blinky with registers, it’s just further away from what’s actually happening. Between make flash and a blinking LED sit the linker script, the startup code, the ELF, the flashing, the hardware reset. All of this is there, even when you don’t see it.
For those of us in automotive this is particularly interesting. In a classical AUTOSAR project we never see the vector table, never see the reset handler, never see the linker script. The MCAL, the supplier’s startup code, the OS init, BswM scheduling: all of that arrives as a black box, and we write runnables that get called from the RTE. Between power-on and Rte_MainFunction_* there’s the exact same chain as here. Hardware reads 4 bytes, jumps to the reset handler, someone initializes .data and .bss, someone calls the OS init, someone starts the tasks. It’s just that each of those steps is buried inside an AUTOSAR configuration we normally don’t open. Once you’ve built this foundation yourself, you read an MCAL doc differently. It also sharpens the language question. In the earlier post I argued that C, C++, and Rust each have their layer, but the layer we usually work on in AUTOSAR is several steps above the one where that choice really matters. The reset handler, the .data copy, the linker script, those are all C, regardless of what we write on top. The supplier picks the language down there, we don’t.
Not seeing it is comfortable as long as everything works. The moment it doesn’t, you have to go a level deeper. And then it’s good to have been there before.