Notes for Linux Kernel

39 minute read

Published: June 22, 2017

Introduction

Features

Preemptive multitasking
Virtual memory
Shared libraries
Demand loading, dynamic kernel modules
TCP/IP networking
Symmetrical Multi-Processing support
Open source

Two modes in Linux

User mode: application software(including different libraries)
Kernel mode: system calls, linux kernel, hardware

What’s a kernel

executive system monitor
controls and mediates access to hardware
implements and supports fundamental abstractions (processes, files, devices)
schedules/allocates system resources (memory, cpu, disk, descriptors)
security and protection
respond to user requests for service (system calls)

Kernel design goal

performance, stability, capability, security and protection, portability, extensibility

Linux source tree (directories)

/root : the home directory for the root user
/home : the user’s home directories along with directories for services (ftp, http…)
/bin : commands needed during booting up that may be needed by normal users
/sbin : similar to /bin. but not intended for normal users. run by Linux.
/proc : a virtual file system that exits in the kernels imagination, which is memory. not on a disk.
- a directory with info about process number
- each process has a directory below /proc
/usr : contain all commands, libraries, man pages, games and static files for normal operations
/boot : files used bu the bootstrap loader. kernel images are often kept here.
/lib : shared libraries needed by the programs on the root file system
/modules : loadable kernel modules, especially those needed to boot the system after disasters
/dev : device files
/etc : configuration files specific to the machine
/skel : when a home directory is created, it is initialized with files from this directory
/sysconfig : files that configure the linux system for devices
/var : files that change for mail, news , printers log files, man pages, temp files
/mnt : mount points for temporary mounts by the system administrator
/tmp : temporary files. programs running after bootup should use /var/tmp

linux/arch

subdirectories for each current port
each contains kernel, lib, mm, boot and other directories whose contents override code stubs in architecture independent code

linux/drivers

largest amount of code in the kernel tree
device, bus, platform and general directories

linux/fs

contains virtual file system framework (VFS) and subdirectories for actual file systems

linux/include

include/asm-* : architecture-dependent include subdirectories
include/linux :
- header info needed both by the kernel and user apps
- usually linked to /usr/include/linux
- kernel-only portions guarded by
```
  #ifdef__KERNEL__
  /* kernel stuff */
  #endif
```
  linux/init
version.c : contains the version banner that prints at boot
main.c : architecture-independent boot code (start_kernel is the main entry point)

linux/kernel

the core kernel code
sched.c : the main kernel file (scheduler, wait queues, timers, alarms, task queues)
process control : fork.c, exec.c, signal.c, exit.c etc…
kernel module support : kmod.c, ksyms.c, module.c
other operations : time.c, resource.c, dma.c, printk.c, info.c, sys.c …

linux/lib

kernel code cannot call standard C library routines

linux/mm

paging and swapping
allocation and deallocation
memory mapping

linux/scripts

scripts for:

menu-based kernel configuration
kernel patching
generating kernel documentation

Summary

Linux is a modular, UNIX-like monolithic kernel
kernel is the heart of the OS that executes with special hardware permission (kernel mode)
“core kernel” provides framework, data structures, support for drivers, modules, subsystems
architecture dependent source subtrees live in /arch

Booting

How computer startup?

booting is a bootstrapping process that starts operating systems when the user turns on a computer system
a boot sequence is the set of operations the computer performs when it is switched on that load an operating system

Booting sequence

Turn on
CPU jump to address of BIOS (0xFFFF0)
BIOS runs POST (Power-On Self Test)
Find bootable devices
Load and execute boot sector from MBR
Load OS

BIOS (Basic Input/Output System)

BIOS refers to the software code run by a computer when first powered on
the primary function is code program embedded on a chip and controls various devices that make up the computer

MBR (Master Boot Record)

MBR

OS is booted from a hard disk, where MBR contains the primary boot loader
The MBR is a 512-byte sector, located in the first sector in the disk (sector 1 of cylinder 0, head 0)
After the MBR is loaded into RAM, the BIOS yields control to it
The first 446 bytes are the primary boot loader, which contains both executable code and error message
The next 64 bytes are the partition table, which contains a record for each of four partitions
The MBR ends with two bytes that are defined as the magic number (0xAA55 or 0x55AA). The magic number serves as a validation check of the MBR.

To see the contents of MBR, use:

  sudo dd if=/dev/sda of=mbr.bin bs=512 count=1
  od -xa mbr.bin

Boot loader

more aptly called the kernel loader. the task at this stage is to load the linux kernel
optional, initial RAM disk
GRUB and LILO are the most popular Linux boot loader
- GRUB: GRand Unified Bootloader
  - an operating system independent bootloader
  - a multiboot software packet from GNU
  - flexible command line interface
  - file system access
  - support multiple executable format
  - support diskless system
  - download OS from network
- LILO: LInux LOader
  - not dependent on a specific file system
  - can boot from hard disk and floppy
  - up to 16 different images
  - must change LILO when kernel image file or config file is changed

Kernel Image

The kernel is the central part in most OS beacause of its task, which is the management of the system’s resources and the communication between hardware and software components.
kernel always store in memory until computer is turned off.
kernel image is not an executable kernel, but a compressed kernel image
zImage size is less than 512 KB
bzImage is larger than 512 KB

Task of kernel

process management
memory management
device management
system call

Init process

the first thing the kernel does is to execute init process
init is the root/parent of all processes executing in linux, and is responsible for staring all other processes
the first process that init starts is a script: /etc/rc.d/rc.sysinit
based on the appropriate run-level, scripts are executed to start various processes to run the system and make it functional
the init process id is “1”
init is responsible for starting system processes defined in /etc/inittab file
init typically will start multiple instances of “getty”, which waits for console logins which spawn one’s user shell process
upon shutdown, init control the sequence and processes for shutdown

Runlevels

a runlevel is a software configuration of the system which allows only a selected group of processes to exist
init can be in one of seven runlevels: 0-6

rc#.d files

rc#.d files are the scripts for a given runlevel that run during boot and shutdown
the scripts are found in /etc/rc.d/rc#.d, where the symbol # represents the run level

init.d

deamon is a background process
init.d is a directory that admin can start/stop individual deamons by changing on it
admin can issue the command and either the satrt, stop, status, restart or reload option
e.g. to stop the web server
```
  cd /etc/rc.d/init.d/
  httpd stop
```

实模式下的系统初始化

setup.S连同内核映象由bootsect.S装入。setup.S从BIOS获取计算机系统的参数,放到内存参数区,仍在实模式下运行
辅助程序setup为内核映象的执行做好准备,然后跳转到0x10000(小内核),0x100000(大内核)开始内核本身的执行,此后就是内核的初始化过程
setup.S:
- 版本检查和参数设置
- 为进入保护模式做准备

保护模式下的系统初始化

初始化寄存器和数据区
核心代码解压缩：调用mics.c 中的 decompress_kernel
页表初始化
初始化 idt, gdt, ldt
启动核心
- 前面是对CPU进行初始化，并启动保护模式
- 现在的任务是初始化内核的核心数据结构，主要涉及：
  - 中断管理
  - 进程管理
  - 内存管理
  - 设备管理
- 进入保护模式后，系统从start_kernel开始执行，start_kernel函数变成0号进程，不再返回
- Start_kernel显示版本信息,调用setup_arch()初始化核心的数据结构
- 最后,调用kernel_thread()创建1号进程init进程
- 父进程创建init子进程之后,返回执行cpu_idle

Q&A

Q: 在i386中,内核可执行代码在内存中的首地址是否可随意选择?
A: 从原理上讲是可以的,但实际上要考虑许多因素。例如微机内存的低端1MB的地址空间是不连续的。所以要把内核代码放在低端就要看是否能放的下。如果内核代码模块太大就不能将内核放在低端。
Q: 主引导扇区位于硬盘的什么位置,如果一个硬盘的主引导扇区有故障此硬盘是否还可以使用?
A: 主引导扇区位于硬盘的0面0道1扇区。一般来讲如果一个硬盘的主引导扇区有故障,此硬盘虽然可以使用,但不能作为引导盘使用了,因为它的主引导扇区不能读出内容。

/proc File System and Kernel Module Programming

What is a kernel module?

(wiki) an object file that contains code to extend the running kernel
(RedHat) modules are pieces of code that can be loaded and unloaded into the kernel upon demand
advantages and disadvantages
advantages:
- allowing the dynamic insertion and removal of code from the kernel at run-time
- save memory cost
disadvantages: fragmentation penalty -> decrease memory performance

current kernel modules

cd /lib/modules/2.6.32-22-generic/
find . -name "*.ko"

current loaded modules

lsmod or cat /proc/modules

/proc

a pseudo file system
real time, resides in the virtual memory
tracks the processes running on the machine and the state of the system
a new /proc file system is created every time your linux machine reboots
highly dynamic. the size of the proc directory is 0 and the last time of modification is the last bootup time
/proc file system doesn’t exist on any particular media
the contents of the /proc file system can be read by anyone who has the requisite permissions
certain parts of the /proc file system can be read only by the owner of the process and of course root (some not even by root)
the contents of the /proc are used by many utilities which grab the data from the particular /proc directory and display it. e.g. top, ps, lspci, dmesg etc

Tweak kernel parameters

/proc/sys: making changes to this directory enables you to make real time changes to certain kernel parameters
e.g. /proc/sys/net/ipv4/ip_forward, which has value of “0”, which can be seen using cat. This can be changed in real time by echo 1 > /proc/sys/net/ipv4/ip_forward, thus allowing IP forwarding.

Details of some files in /proc

buddyinfo: contains the number of free areas of each order for the kernel buddy system
cmdline: kernel command line
cpuinfo: human-readable information about the processors
devices: list of device drivers configured into the currently running kernel (block and character)
dma: shows which DMA channels are being used at the moment
execdomains: related to security
fb: frame buffer devices
filesystems: filesystems configured/supported into/by the kernel
interrupts: number of interrupts per IRQ on the x86 architecture
iomem: shows the current map of the system’s memory for its various devices
ioports: provides a list of currently registered port regions used for input or output communication with a device
kcore:
- this file represents the physical memory of the system and is stored in the core file format
- unlike most /proc files, kcore does display a size. This value is given in bytes and is equal to the size of physical memory (RAM) used plus 4 KB.
- its content are designed to be examined by a debugger, such as gdb, the GNU Debugger
- only root user has the rights to view this file
kmsg: used to hold messages generated by the kernel, which are then picked up by other programs, such as klogd
loadavg:
- provides a look at load average
- the first three columns measure CPU utilization of the last 1, 5 and 10 minutes periods.
- the fourth column shows the number of currently running processes and the total number of processes
- the last column displays the last process ID used
locks: displays the files currently locked by the kernel
mdstat: contains the current information for multiple-disk, RAID configurations
meminfo:
- one of the most commonly used /proc files
- it reports plenty of valuable information about the current utilization of RAM on the system
misc: this files lists miscellaneous drivers registered on the miscellaneous major device, which is number 10
modules: displays a list of all modules that have been loade by the system
mounts: provides a quick list of all mounts in use by the system
mtrr: refers to the current Memory Type Range Registers (MTRRs) in use with the system
partitions: detailed information on the various partitions currently available to the system
pci: full list of every PCI device on your system
slabinfo: information about the memory usage on the slab level
stat: keeps track of a variety of different statistics about the system since it was last restarted
swap: measure swap space and its uilization
uptime: contains information about how long the system has on since its last restart
version: tells the versions of the kinux kernel and gcc, as well as the version of red hat linux installed on the system

The numerical named directories

the processed that are running at the instant a snapshot of the /proc file system was taken
the contents of all the directories are the same as these directories contain the carious parameters and status of the corresponding process
you have full access only to the processes that you have started

A typical process directory

cmdline: contains the whole command line used to invoke the process. the contents of this file are the command line arguments with all the parameters (without formatting/spaces)
cwd: symbolic link to the current working directory
environ: contains all the process-specific environment variables
exe: symbolic link of the executable
maps: parts of the process’ address space mapped to a file
fd: this directory contains the list of file descriptors as opened by the particular process
root: symbolic link pointing to the directory which is the root file system for the particular process
status: information about the process

Other subdirectories in /proc

/proc/self: link to the currently running process
/proc/bus:
- contains information specific to the various buses available on the system
- for ISA, PCI and USB buses, current data on each is available in /proc/bus/<bus type directory>
- individual bus directories, signified with numbers, contains binary files that refer to the various devices available on that bus
/proc/driver: specific drivers in use by kernel
/proc/fs: specific file system, file handle, inode, dentry and quota information
/proc/ide: information abput IDE devices
/proc/irq: used to set IRQ to CPU affinity
/proc/net: networking parameters and statistics
- arp: kernel’s ARP table. useful for connecting hardware address to an IP address on a system
- dev: lists the network devices along with transmit and receive statistics
- route: displays the kernel’s routing table
/proc/scsi: like /proc/ide, it gives info about scsi devices
/proc/sys:
- allows you to make configuration changes to a running kernel, by echo command
- any configuration changes made will disappear when the system is restarted

/proc/sys subdirectories

/proc/sys/dev: provides parameters for particular devices on the system. e.g. cdrom/info: many important CD-ROM parameters
/proc/sys/kernel:
- acct: controls the suspension of process accounting based on the percentage of free space available on the filesystem containing the log
- ctrl-alt-del: controls whether [Ctrl]-[Alt]-[Delete] will gracefully restart the computer using init (value 0) or force an immediate reboot without syncing the dirty buffers to disk (value 1).
- domainname: allows you to configure the system’s domain name, such as domain.com.
- hostname: allows you to configure the system’s host name, such as host.domain.com
- threads-max: sets the maximum number of threads to be used in the kernel, with a default value of 4096
- panic: defines the number of seconds the kernel will postpone rebooting the system when a kernel panic is experienced. By default, the value is set to 0, which disables automatic rebooting after a panic.
/proc/sys/vm: facilitates the configuration of the Linux kernel’s virtual memory subsystem

Advantages and disadvantages of the /proc file system

advantages:
- coherent, intuitive interface to the kernel
- great for tweaking and collecting status info
- easy to use and programming
disadvantages:
- certain amount of overhead, must use fs calls, alleviated somewhat by sysctl() interface
- user can possibly cause system instability

/proc file system entries

to use any of the procfs functions, you have to include the correct header file #include <linux/proc_fs.h>
struct proc_dir_entry* create_proc_entry(const char* name, mode_t mode, struct proc_dir_entry* parent);
- this function creates a regular file with the name name, the mode mode in the directory parent
- to create a file in the root of the procfs, use NULL as parent parameter
- when successful, the function will return a pointer to the freshly created struct proc_dir_entry
- e.g. foo_file = create_proc_entry("foo", 0644, example_dir);
struct proc_dir_entry* proc_mkdir(const char* name, struct proc_dir_entry* parent);
- create a directory name in the procfs directory parent
struct proc_dir_entry* proc_symlink(const char* name, struct proc_dir_entry* parent, const char* dest);
- this creates a symlink in the procfs directory parent that points from name to dest. this translates in userland to ln -s dest name
void remove_proc_entry(const char* name, struct proc_dir_entry* parent);
- removes the entry name in the directory parent from the procfs.
- be sure to free the data entry from the struct proc_dir_entry before the function is called

Tips

modules can only use APIs exported by kernel and other modules
- no libcs
- kernel exports some common APIs
modules are part of kernel
- modules can control the whole system
- as a result, can damage the whole system
modules basically can’t be written with C++
determine which part should be in kernel

Process Management

Processes, Lightweight Processes and Threads

Process: an instance of a program in execution
(User) Thread: an execution flow of the process
- Pthread (POSIX thread) library
Lightweight process (LWP): used to offer better support for multithreaded applications
- LWP may share resources: address space, open files…
- To associate a lightweight process with each thread

Process descriptor

task_struct data structure:

state: process state
thread_info: low-level information for the process
mm: pointers to memory area descriptors
tty: tty associated with the process
fs: current directory
files: pointers to file descriptors
signal: signals received

Process state

TASK_RUNNING: 该状态表示进程处于可运行状态，也就是说要么正在CPU中运行，要么在runqueue队列中等待运行
TASK_INTERRUPTABLE: 该状态表示进程处于可中断的睡眠状态。该进程正处在睡眠，但是可以被任何信号唤醒。当信号将该进程唤醒后，进程会去对信号做出响应。
TASK_UNINTERRUPTABLE: 该状态表示进程处于不可中断的睡眠状态。该进程正处于睡眠，专心等待某一个事件（一般是IO事件），并且不希望被其他信号唤醒。
TASK_STOPPED
TASK_TRACED
EXIT_ZOMBIE: 该状态是该进程变为僵尸进程，即其父进程没有对该进程的结束信号进行处理
EXIT_DEAD

Identifying a process

process descriptor pointers: 32 bits
process ID (PID): 16 bits (~32767 for compatibility)
- linux associates different PID with each process or LWP
- programmers expect threads in the same group to have a common PID
- thread group: a collection of LWPs
  - the PID of the first LWP in the group
  - tgid field in process descriptor: using getpid() system call

The process list

tasks field in task_struct structure
- type list_head
- prev, next fields point to the previous and the next task_struct
process 0 (swapper): init_task

Parenthood relationships among processes

process 0 and 1: created by the kernel
- process 1 init: the ancestor of all processes
fields in process descriptor for parenthood
- real_parent
- parent
- children
- sibling

Pidhash table and chained lists

to support the search for the process descriptor of a PID, since sequential search in the process list is inefficient
the pid_hash array contains four hash tables and correspnding field in the process descriptor
- pid: PIDTYPE_PID
- tgid: PIDTYPE_TGID (thread group leader)
- pgrp: PIDTYPE_PGID (group leader)
- session: PIDTYPE_SID (session leader)
chaining is used to handle PID collisions
the size of each pidhash table is dependent on the available memory

PID

pids field of the process descriptor: the pid data structure

nr: PID number
pid_chain: links to the previous and the next elements in the hash chain list
pid_list: head of the per-PID list (in thread group)

How processes are organized

processes in TASK_STOPPED, EXIT_ZOMBIE, EXIT_DEAD: not linked in lists
processes in TASK_INTERRUPTABLE, TASK_UNINTERRUPTABLE: waiting queues
two kinds of sleeping processes:
- exclusive process
- nonexclusive process: always woken up by the kernel when the event occurs

Process switch, task switch, context switch

hardware context switch: a far jmp (in older Linux)
software context switch: a sequence of mov instructions
- allows better control over the validity of data being loaded
- the amount of time required is about the same

Performing the process switch

switch the page global directory
switch the kernel mode stack and the hardware context

Task State Segment

TSS: a specific segment type in x86 architecture to store hardware contexts

Creating processes

in traditional UNIX, resources owned by parent process are duplicated
- very slow and inefficient
- mechanisms to solve this problem
  - Copy on Write: parent and child read the same physical page
  - Ligitweight process (LWP): parent and child share per-process kernel data structures
  - vfork() system call: parent and child share the memory address space

`clone()`, `fork()` and `vfork()` system calls

clone(fn, arg, flags, child_stack, tls, ptid, ctid): creating lightweight process
- a wrapper function in C library
- Uses clone() system call
fork() and vfork() system calls: implemented by clone with different parameters
each invokes do_fork() function

Kernel threads

kernel threads run only in kernel mode
they use only linear addresses greater than PAGE_OFFSET
kernel_thread(): to create a kernel thread
example kernel threads:
- process 0 (swapper process), the ancestor of all processes
- process 1 (init process)
- others: keventd, kapm…

Destrying processes

exit() library function
- two system calls in Linux 2.6
  - _exit() system call: handled by do_exit() function
  - exit_group() system call: handled by do_group_exit() function
process removal: releasing the process descriptor of a zombie process by release_task()

Scheduling policy

based on time-sharing:
- time slice
based on priority ranking:
- dynamic
classification of processes:
- interactive processes: e.g. shells, text editors, GUI applications
- batch processes: e.g. compilers, database search engine, web server
- real-time processes: e.g. audio/video applications, data-collection from physical sensors, robit controllers

Process preemption

Linux (user) processes are preemptive
- when a new process has higher priority than the current
- when its time quantum expires, TIF_NEED_RESCHED in thread_info will be set
a preempted process is not suspended since it remains in the TASK_RUNNING state
Linux kernel before 2.6 is nonpreemptive, simpler

How long should a quantum last

neither too long nor too short:
- too short: overhead for process switch
- too long: process no longer appear to be executed concurrently
always a compromise: the rule of thumb, to choose a duration as long as possible, while keeping good system response

Scheduling algorithm

earlier version of Linux: simple and straightforward
- at every process switch, scan the runnable processes, compute the priorities and select the best one to run
much more complex in Linux 2.6
- scales well with the number of processes and processors
- constant time scheduling
3 scheduling classes
- SCHED_FIFO: first-in first-out real-time
- SCHED_RR: round-robin real-time
- SCHED_NORMAL: conventional time-shared processes

Scheduling of conventional processes

Static priority

static priority
- conventional processes: 100 - 139
  - can be changed by nice(), setpriority() system calls
- base time quantum: the time-quantum assigned by the scheduler if it has exhausted its previous time quantum

Dynamic priority and average sleep time

dynamic priority: 100 - 139
dynamic priority = max(100, min(static_priority - bonus + 5, 139))
- bonus: 0 - 10
  - < 5: penalty
  - > 5: premium
- dependent on the average sleep time: average number of nanoseconds that the process spent while sleeping
- the more average sleep time, the more bonus:

Determine the status of a process

To determine whether a process is considered to be interactive or batch

interactive if: dynamic_priority <= 3 * static_priority / 4 + 28
i.e. bonux - 5 >= static_priority / 4 - 28
- interactive delta

Active and expired processes

active processes: runnable processes that have not exhausted their time quantum
expired processes: runnable processes that have exhausted their time quantum
periodically, the role of processes changes

Scheduling of real-time processes

real-time priority: 1 - 99
real-time processes are always considered active
- can be changed by sched_setparam() and sched_setscheduler() system calls
a real-time process is replaced only when:
- another process has higher real-time priority
- put to sleep by blocking operation
- stopped or killed
- voluntarily relinquishes the CPU by sched_yield() system call
- for round-robin real time (SCHED_RR), time quantum exhausted

Implementation Support of Scheduling

Data structures used by the scheduler

runqueue data structure for each CPU
two sets of runnable processes in the runqueue structure
prio_array_t data structure
- nr_active: # of process descriptors in the list
- bitmap: priority bitmap
- queue: the 140 list_heads

Functions used by the scheduler

scheduler_tick(): keep the time_slice counter up-to-date
try_to_wake_up() awaken a sleeping process
recalc_task_prio(): updates the dynamic priority
load_balance(): keep the runqueue of multiprocess system balanced
schedule(): select a new process to run
- Direct invocation:
  - when the process must be blocked to wait for resource
  - when long iterative tasks are executed in device drivers
- Lazy invocation: by setting the TIF_NEED_RESCHED flag
  - when current process has used up its quantum, by scheduler_tick()
  - when a process is woken up, by try_to_wake_up()
  - when a sched_setscheduler() system call is issued

Runqueue balancing in multiprocessor systems

3 types of multiprocessor systems
- Classic multiprocessor architecture
- Hyper-threading
- NUMA
A CPU can execute only the runnable processes in the corresponding runqueue
The kernel periodically checks if the workloads of the runqueues are balanced

Scheduling domains

A set of CPUs whose workloads are kept balanced by the kernel

hierarchically organized
partitioned into groups: a subset of CPUs

nice(): for backward compatability only
- allow processes to change their base priority
- replaced by setpriority()
getpriority() and setpriority():
- act on the base priorities of all processes in a given group
- which: PRIO_PROCESS, PRIO_PGRP, PRIO_USER
- who: the value of pid, pgrp, or uid field to be used for selecting the processes
- niceval: the new base priority value: -20 - +19
sched_getscheduler(), sched_setscheduler(): queries/sets the scheduling policy
sched_getparam(), sched_setparam(): queries/sets the scheduling parameters
sched_yield(): to relinquish the CPU voluntarily without being suspended
sched_get_priority_min(), sched_get_priority_max(): to return the minimum/maximum real-time static priority
sched_rr_get_interval(): to write into user space the round-robin time quantum for real-time process

Completely Fair Scheduler (CFS)

Motivation of CFS: running task gets 100% usage of the CPU, all other tasks get 0% usage of the CPU
New scheduling algorithm in linux kernel 2.6
Design:
- the same virtual runtime for each task
- increasing the priority for sleeping task
Implementation
- stores the records about the planned tasks in a red-black tree
- pick efficiently the process that has used the least amount of time (the leftmost node of the tree)
- the entry of the picked process is then removed form the tree, the spent execution time is updated and the entry is then returned to the tree where it normally takes some other location.

Kernel synchronization

Kernel control paths

Linux kernel: like a server that answers requests
- parts of the kernel run in an interleaved way
Kernel control path: a sequence of instructions executed in kernel mode on behalf of current process
- interrupts and exceptions
- lighter than a process (less context)
Three CPU states are considered:
- User: running a process in user mode
- Excp: running an exception or a system call handler
- Intr: running an interrupt handler

Kernel preemption

Preemptive kernel: a process running in kernel mode can be replaced by another process while in the middle of a kernel function
The main motivation for making a kernel preemptive is to reduce the dispatch latency of the user mode process
- dispatch latency: delay between the time they become runnable and the time they actually begin running
The kernel can be preempted only when it is excuting an exception handler (in particular a system call) and the kernel preemption has not been explicitly disabled

When synchronization is necessary

A race condition can occur when the outcome of a computation depends on how two or more interleaved kernel control paths are nested
To identify and protect the critical regions in exception handlers, interrupt handlers, deferrable functions, and kernel threads
- On single CPU, critical region can be implemented by disabling interrupts while accessing shared data
- If the same data is shared only by the service routines of system calls, critical region can be implemented by disabling kernel preemption while accessing shared data
Things are more complicated on multiprocessor systems
- different synchronization techniques are necessary

When synchronization is not necessary

The same interrupt cannot occur until the handler terminates
Interrupt handlers and softirqs are nonpreemptable, non-blocking
A kernel control path performing interrupt handling cannot be interrupted by a kernel control path executing a deferrable function or a system call service routine
Softirqs cannot be interleaved

Per-CPU variables

The simplest and most efficient synchronization technique consists of declaring kernel variables as per-cpu variables
- an array of data structures, one element per CPU in the system
- a CPU should not access the elements of the array corresponding to the other CPUs
While per-cpu variables provide protection against concurrent accesses from several CPUs, they do not provide protection against accesses from asynchronous functions (interrupt handles and deferrable functions)
per-cpu variables are prone to race conditions caused by kernel preemption, both in uniprocessor and multiprocessor systems

Memory Barriers

when dealing with synchronization, instruction reordering must be avoided
a memory barrier primitive ensures that the operations before the primitive are finished before the operations after the primitive
- all instructions that operate on I/O ports
- all instructions prefixed by lock byte
- all instructions that write into control registers, system registers or debug registers
- a few special instructions, e.g. iret

Spin locks

Spin locks are a special kind of lock designed to work in a multiprocessor environment
- busy waiting
- very convenient

Read/Write Spin Locks

To increase the amount of concurrency in the kernel
- multiple reads, one write
rwlock_t structure
- lock field: 32 bit

Seqlock

Introduced in Linux 2.6
similar to read/write spin locks
except that they give a much higher priority to writers
a writer is allowed to proceed even when readers are active

Read-Copy Update

Read-Copy update (RCU): another synchronization technique designed to protect data structures that are mostly accessed for reading by several CPUs
- RCU allows many readers and many writers to proceed concurrently
- RCU is lock free
Key ideas
- only data structures that are dynamically allocated and referenced via pointers can be protected by RCU
- No kernel control path can sleep inside a critical section protected by RCU

Semaphores

two kinds of semaphores
- kernel semaphores: by kernel control paths
- System V IPC semaphores: by user processes
Kernel semaphores
- struct semaphore
  - count
  - wait
  - sleepers
- up(): to release a kernel semaphore (similar to signal)
- down(): to acquire kernel semaphore (similar to wait)

Read/Write semaphores

similar to read/write spin locks
- except that waiting processes are suspended instead of spinning
struct rw_semaphore
- count
- wait_list
- wait_lock
init_rwsem()
down_read(), down_write: acquire a read/write semaphore
up_read(),up_write(): release a read/write semaphore

Completions

to solve a subtle race condition in multiprocessor systems
- similar to semaphores
struct completion
- done
- wait
complete(): corresponding to up()
wait_for_completion(): corresponding to down()

Local interrupt disabling

interrupts can be disabled on a CPU with cli instructions
- local_irq_disable() macro
interrupts can enabled by sti instructions
- local_irq_enable() macro

Disabling/Enabling deferrable functions

softirq
the kernel sometimes need to disable deferrable functions without diabling interrupts
- local_bh_disable() macro
- local_bh_enable() macro

Synchronizing accesses to kernel data structures

rule of thumb for kernel developers:
- always keep the concurrency level as high as possible in the system
- two factors:
  - the number of I/O devices that operate concurrently
  - the number of CPUs that do productive work
a shared data structure consisting of a single integer value can be updated by declaring it as an atomic_t type and by using atomic operations
inserting an element into a shared linked list is never atomic since it consists of at leasr pointer assignments

Examples of race condition

when a program uses two or more semaphores, the potential for deadlock is present because two different paths could wait for each other
Linux has few problems with deadlocks on semaphore requests since each path usually acquire just one semaphore
in cases such as rmdir() and rename() system calls, two semaphores requests
to avoid such deadlocks, semaphore requests are performed in address order
semaphore request are performed in predefined address order

Symmetric multiprocessing (SMP)

Traditional View

Traditionally, the computer has been viewed as a sequential machine
- a processor executes instructions one at a time in sequence
- each instruction is a sequence of operations
two popular approached to providing parallelism
- symmetric multiprocesors
- clusters

Categories of computer systems

single instruction single data (SISD) stream
- single processor executes a single instruction stream to operate on data stored in a single memory
Single instruction Multiple data (SIMD) stream
- each instruction is executed on a different set of data by the different processors
multiple instruction single data (MISD) stream
- a sequence of data is transmitted to a set of processors, each execute a different instruction sequence
multiple instruction multiple data (MIMD)
- a set of processors simultaneously execute different instruction sequences on different data sets

Symmetric multiprocessing

kernel can execute on any processor
- allowing portions of the kernel to execute in parallel
- typically each processor does self-scheduling from the pool of available process or threads

SMP

Symmetric multiprocessing (SMP) involves a multiprocessor computer hardware and software where two or more identical processors connect to a single shared main memory, have full access to all I/O devices and are controlled by a single OS instance that treats all processors equally, reserving none for special purposes
Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as seperate processors

Linux Support for SMP

启动过程:BSP负责操作系统的启动,在启动的最后阶段,BSP通过IPI激活各个AP,在系统的正常运行过程中,BSP和AP基本上是无差别的
进程调度:与单处理器系统的主要差别是执行进程切换后,被换下的进程有可能会换到其他CPU上继续运行。在计算优先权时,如果进程上次运行的CPU也是当前CPU,则会适当提高优先权,这样可以更有效地利用Cache
中断系统:为了支持SMP,在硬件上需要APIC中断控制系统。Linux定义了各种IPI的中断向量以及传送IPI的函数

NUMA

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor
Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memeory shared between processors)
Benefits: the benefits of NUMA are limited to particular workloads, notably on servers where the data are often associated strongly with certain tasks or users

SMP 系统启动过程

BIOS初始化：屏蔽AP(Application Processor),建立系统配置表
MBR里面的引导程序(LILO,GRUB等)将内核载入内存
执行 head.S中的 startup_32函数，最后调用 start_kernel
执行 start_kernel，这个函数相当于应用程序里面的 main 函数
start_kernel 进行一系列的初始化工作，最后将执行:
- smp_init:关键的一步，启动各AP
- rest_init，调用 init 创建1号 init 进程，自身调用 cpu_idle成为0号进程
1号进程 init 完成剩下的工作

`reschedule_idle` 工作过程

先检查p进程上一次运行的cpu是否空闲，如果空闲，这是最好的cpu，直接返回。
找一个合适的cpu，查看SMP中的每个CPU上运行的进程，与p进程相比的抢先权，把具有最高的抢先权值的进程记录在target_task中，该进程运行的cpu为最合适的CPU。
如target_task为空，说明没有找到合适的cpu，直接返回。
如果target_task不为空，则说明找到了合适的cpu，因此将target_task->need_resched置为1，如果运行target_task的cpu不是当前运行的cpu，则向运行target_task的cpu发送一个IPI中断，让它重新调度。

Memory Management - Addressing

Memory Addresses

three kinds of addresses in 80x86 microprocessors
- logical address: included in the machine language instructions
- linear address (virtual address): a single 32-bit unsigned integer that can be used to address up to 4GB
- physical address (32-bit unsigned integers): used to address memory cells in memory chips
- logical address => segmentation unit => linear address => paging unit => physical address

Virtual File System (VFS)

Role of VFS

a common interface to several kinds of file systems
File systems supported by the VFS
- Disk-based filesystems
- Network filesystems
- special filesystems (e.g. /proc)

Common file system interface

enable system calls such as open(), read() and write() to work regardless of file system or storage media

Terminology

file system: storage of data adhering to a specific structure
file: ordered string of bytes
directory: analogous to a folder
- special type of file
- instead of normal data, it contains pointers to other files
- directories are hooked together to create the hierarchical name space
metadata: information describing a file

Physical file representation

Inode:
- unique index
- holds file attributes and data block locations pertaining to a file
Data blocks
- contain file data
- may not be physically contiguous
File name:
- human-readable identifier for each file

The common file model

defines common file model conceptual interfaces and data structures
e.g. user space (write()) => VFS (sys_write()) => file system (file system write methods) => physical media
object-oriented: data structures and associated operations
- Superblock object: a mounted file system
- Inode object: information about a file, represents a specific file
- File object: interaction between an open file and a process
- Dentry object: directory entry, single component of a path name

VFS operations

each object contains operations object with methods:
- super_operations: invoked on a specific file system
- inode_operations: invoked on a specific inodes (which point to a file)
- dentry_operations: invoked on a specific directory entry
- file_operations: invoked on a file
lower file system can implement own version of methods to be called by VFS
if an operation is not defined by a lower file system (NULL), VFS will often call a generic version of the method

VFS Data Structures

Superblock object

implemented by each file system
used to store information describing that specific file system
often physically written at the beginning of the partition (file system control block)
found in linux/fs.h

Inode object

represent all the information needed to manipulate a file or directory
constructed in memory, regardless of how file system stores metadata information
three doubly linked lists
- List of valid unused inodes
- List of in-use inodes
- List of dirty inodes

Power Management - From Linux Kernel to Android

Linux Power Management

Two popular power management standards in Linux
- APM (Advanced Power Management)
  - Power management happens in two ways:
    1. function calls from APM driver to the BIOS requesting power state change
    2. automatically based on device activities
  - Power states in APM:
    1. Full on
    2. APM enabled
    3. APM standby
    4. APM suspend
    5. Sleep/Hibernation
    6. Off
  - Weakness of APM:
    1. each BIOS has its own policy, different computers have different activities
    2. unable to know why system is suspend
    3. BIOS does not know user activities
    4. does not know add-on devices (USB)
- ACPI (Advanced Configuration and Power Interface)
  - control divided between OS and BIOS, decisions managed through OS
  - Enables sophisticated power policies for general-purpose computers with standard usage patterns and hardware

Android Power Management

Built on top of Linux Power Management
Designed for devices which have a “default-off” behaviors
- the phone is not supposed to be on when we do not want to use it
- powered on only when requested to be run, off by default
- unlike PC, which has a “default-on” behavior
Apps and services must request CPU resource with “wake locks” through the Android application framework (PowerManager class) and native Linux libraries in order to keep power on, otherwise Android will shut down the CPU
Android PM uses wake locks and time out mechanism to switch state of system power, so that system power consumption decreases
If there are no wake locks, CPU will be turned off
If there are partial wake locks, screen and keyboard will be turned off

Android PM State Machine

When an user application acquire full wake lock or screen/keyboard touch activity event occur, the machine will enter ‘AWAKE’ state
If timeout happens or power key is pressed, the machine will enter “notification” state:
- if partial wake lock is acquired, the machine will remain “notification”
- if all partial wake locks are released, the machine will enter “sleep” state
There is one main wake lock called ‘main’ in the kernel to keep the kernel awake
- It is the last wake lock to be released when system goes to suspend
  Acquiring wake locks
  1. request sent to PowerManager to acquire a wake lock
  2. PowerManagementService (in the kernel) to take a wake lock
  3. Add wake lock to the list
  4. set the power state:
- for a full wake lock, state should be set to “ON”
  1. For taking Partial wake lock, if it is the first partial wake lock, a kernel wake lock is taken. This will protect all the partial wake locks. For subsequent requests, kernel wake lock is not taken, but just added to the list

System Sleep

early_suspend:
- Used by drivers that need to handle power mode settings to the device before kernel is suspended
- Used to turn off screen and non-wakeup source input devices
- E.g., consider a display driver
  - In early suspend, the screen can be turned off
  - In the suspend, other things like closing the driver can be done
late_resume:
- When system is resumed, resume is called first, followed by resume late

Battery Service

The BatteryService monitors the battery status, level, temperature etc.
A Battery Driver in the kernel interacts with the physical battery via ADC to read battery voltage and I2C
Whenever BatteryService receives information from the BatteryDriver, it will act accordingly
- E.g. if battery level is low, it will ask system to shutdown
Battery Service will monitor the battery status based on received uevent from the kernel
- uevent : An asynchronous communication channel for kernel

Share on

Twitter Facebook Google+ LinkedIn

Heng Fang