Operating Systems Notes
Basics
Processors vs threads
less resources associated to them, therefore cheaper, can share resources
threads are often used for I/O overlapping
threads can still utilise multiprocessors
user level threads are handled by user applications (processor) - OS does less book-keeping, security
Address space
PC
Global variables
Registers
Open files
Stack
Child processors
State
Pending alarms
Signals & signal handlers
Accounting information
Concurrency and synchronization
Remember, the scheduler's job is to switch between processors/threads - this is usually done with timer interrupts, hence why synchronization often turns interrupts on/off.
Test-and-Set
Basically:
load lock value
if
lock == 0
:set
lock = 1
return 0
- we acquire the lock
if
lock == 1
:return 1
- another processor has the lock
hardware guarantees atomic execution
enter_region:
TSL REGISTER, LOCK # copy lock to register, set lock to 1
# if lock == 0, else leave alone
CMP REGISTER, $0 # compare if register is == 0
JNE enter_region # if register == 1, loop
RET 0 # return to caller; critical region entered
leave_region:
MOVE LOCK, $0 # set lock to 0
RET
Issues:
prone to starvation - one threads always checking, other always continues
consumes CPU cycles - spin lock
poor for priority threads/processors - low priority thread may always have the lock
Semaphores
Dijkstra 1965 legend
P()
: wait/acquireV()
: signal/releaseprimitives are atomic
simple and awesome but if you program them incorrectly (e.g. without matching
wait()
,signal()
), good luck debugging
typedef struct {
int count;
struct process *L; // list of sleeping processors
} semaphore;
// P(S)
wait(S):
S.count--;
if (S.count < 0) {
// add this process to S.L
sleep();
}
// V(S)
signal(S):
S.count++;
if (S.count <= 0) {
// remove a process P from S.L
wakeup(P);
}
Producer-consumer problem
producer() {
while (true) {
item = produce()
if (count == N)
sleep();
insert_item(); // into some shared data structure
count++; // shared var
if (count == 1)
wakeup(consumer);
}
}
consumer() {
while (true) {
if (count == 0)
sleep();
remove_item(); // remove item from shared data structure
count--; // shared var
if (count == N-1)
wakeup(producer);
}
}
Issues:
count
has race conditionsitem buffer isn't synchronized at all
dead locks
// the deadlock scenario
// assume count == 0
if (count == 0) // consumer
// scheduler context switches
item = produce() // producer
if (count == N) sleep();
...
// repeat until above condition is true
// producer sleeps
// scheduler context switches
sleep(); // consumer
// both threads are now sleeping
// no progress is made since count == 0 was checked
// before the context switch!
Solution
we use a semaphore to control the sleep/wake behavior for the size of the item buffer
we use a mutex semaphore (basically a lock) for synchronization of the item buffer
empty = N; // number of empty slots
full = 0; // number of filled slots
mutex = 1; // allow one thread access only
producer() {
while (true) {
item = produce();
wait(empty); // check if buffer is full - sleep
// critical region is entered
wait(mutex); // obtain lock
insert_item(); // into some shared data structure
signal(mutex); // release lock
signal(full); // full++ && wake consumer if full <= 0
}
}
consumer() {
while (true) {
wait(full); // full-- && sleep if full < 0
wait(mutex);
remove_item(); // remove item from shared data structure
signal(mutex);
signal(empty); // empty++ && wake producer if empty <= 0
}
}
File systems
I'm going to over-simplify this because there's just too much to write about.
The OS manages from FD table - device driver on a typical storage stack.
Application
Syscall interface: create, open, read, write etc.
FD table OF table
File (open) descriptor table - keeps track of files opened by user-level processes. Implements semantics of FS syscalls.
VFS
Virtual FS - unified interface to multiple file systems.
FS
Physical location of data on the disk
Directory hierarchy, symbolic file names, random-access files, protection.
Buffer cache Disk scheduler
Keeps recently access disk blocks in memory. Schedule disk accesses from multiple processes for performance and fairness. This all occurs in memory.
Device driver
Hides device specific protocol e.g. USB3
Exposes block-device interface (linear sequence of blocks)
Disk controller
Hides disk geometry, bad sectors
Exposes linear sequence of blocks
Physical disk
Hard disk platters: tracks, sectors
File structure
Three types:
byte sequence (most popular for OS)
record sequence
key-based, tree structure
Byte stream
OS considers it to be unstructured
simplifies file management for OS
applications can impose their own structure
Records
collection of bytes treated as a unit
operations at the level of records e.g.
read_rec
file is a collection of similar records
OS can optimise operations on records
Tree of records
records of variable length
key association and retrieval
databases or some data processing systems
Implementation
A file system must do the following:
map symbolic names to block addresses
track which blocks belong to which files
block order in file
blocks that are free for allocation
store this in some metadata
Allocation strategies
Contiguous allocation
(+) easy bookkeeping (track start block + size)
(+) increases performance for sequential operations
(-) need maximum file size at time of creation
(-) external fragmentation - as files are deleted, disk blocks become fragmented/sparse
Dynamic allocation
(+) allocates space on demand
(+) allocation in fixed-sized blocks
(+) no external fragmentation
(+) doesn't require pre-allocation of file size
(-) internal fragmentation - blocks internally can be sparsely filled
(-) file blocks are scattered across disk - can become semi-random
(-) metadata management becomes complicated - many strategies
Linked list allocation
link blocks together starting with the head
(+) good for sequential access
(-) poor random access
(-) blocks end up scattered across the disk due to freeing lists - block corruption/loss
File allocation table
map the entire FS in a separate table - like a page table
table is copied into memory
random access is fast (table is loaded into memory first)
(-) this starts to use huge amounts of space
(-) freeing a block is slow
e.g. FAT32
i-node based allocation
separate table for each file (i-node)
only keep table for open files in memory
fast random access
i-nodes are stored usually at the start of the disk
fixed size
last i-node entry points to an extension i-node
e.g. ext2, ext3
Terms
Asymmetric Multiprocessor Programming (AMP)
Where tasks are delegated to processors at design time.
All CPUs have the whole program instruction memory (rescheduling possible).
They don't have to be balanced - one CPU might only run OS code, other might run I/O code etc.
Symmetric Multiprocessor Programming (SMP)
A single OS manages all CPUs and a separate processor runs per CPU.
This allows parallel execution of programs.
All CPUs access a shared memory - this may be using shared memory buses and crossbar switches - limited scalability.
Other memory bus alternatives are on-chip mesh networks - this has significant complexities of communication as well.
Shared memory access is serialized, you have cache coherency challenges as well.
Initialization routines
Constructors - initialization routines - functions to initialize data in the program when the program is started ie. before main()
called in reverse order
Destructors - termination routines - that are called when the program terminates
called in forward order
The linker must generate a list of these constructors/destructors
__CTOR_LIST__
and the exec file format traverses these lists at startup/exit.Sections
.ctors
and.dtors
and usually set aside for these listsOn systems that support .init, parts of crtstuff.c are compiled into that section. The program is linked by the gcc driver by:
ld -o output_file crti.o crtegin.o ... -lgcc crtend.o crtn.o
The prologue of a function
__init
appears in the.init
section ofcrti.o
The epilogue
__fini
appears in crtn.oThe objects
crtbegin.o
andcrtend.o
are (for most targets) compiled fromcrtstuff.c.
They contain, among other things, code fragments within the.init
and.fini
sections that branch to routines in the.text
section. The linker will pull all parts of a section together, which results in a complete__init
function that invokes the routines we need at startup.If no
.init
section is available, gcc compiles any function called main, it inserts a procedure call to __main as the first executable code after the function prologue. The__main
function is designed inlibhcc2.c
and runs the global constructors.The last method doesn't use arbitrary sections nor the GNU linker. This is useful for dynamic linking and when using file formats the linker doesn't support eg. ECOFF.
The initialization and termination functions are recognized by their names.
An extra program called
collect2
which pretends to be the linker (as the GNU linked is not called) ie. it does the same as the original linker but also arranges to include the vectors of initialization and terminations functions.These functions are called via
__main
.use_collect2
needs to be defined in the target inconfig.gcc
ctors dtors
Both crt0/crt1
do the same thing, basically do what is needed before calling main() (like initializing stack, setting IRQs, etc.). You should link with one or the other but not both. They are not really libraries but really inline assembly code.
crt1.o
is used on system that support constructors and destructors (functions called before and after main and exit). In this case main is treated like a normal function call.crt0.o
is used on systems that does not support constructors/destructors.
.init_array
Why did
.init_array
turn up?We added
.init_array/.fini_array
in order to blend the SVR4 version of .init, which contained actual code, with the HP-UX version, which contained function pointers and used aDT_INIT_SZ
entry in the dynamic array rather than prologue and epilogue pieces contributed fromcrt*.o
files. The HP-UX version was seen as an improvement, but it wasn't compatible, so we renamed the sections and the dynamic table entries so that the two versions could live side-by-side and implementations could transition slowly from one to the other.On HP-UX, we used
.init/.init_array
for static constructors, and they registered the corresponding static destructors on a special atexit list, rather than adding destructors to.fini_array
, so that we could handle destructors ondlclose()
events properly (subject to your interpretation of "properly" in that context) The order of execution differs between.ctors
and.init_array
Backwarding order of
.ctors
sectionSome programs may implicitly rely on the fact that global constructors in archives linked later are run before constructors in the object linked against those archives. That is, given
g++ foo.o -lbar
where bar is a static archive, not a shared library, then currently the global constructors in objects pulled in fromlibbar.c
will be executed before the global constructors infoo.o
. That was an intentional choice because it is more likely to be correct than the reverse. However, the C++ standard does not guarantee it, so any programs which relies on this ordering is technically invalid. Problem of backwarding order of.ctors
A lot of work was done in both GNU ld and gold to move constructors from .ctors
to .init_array
, all to improve startup latency for Firefox
Using .init_array/.fini_array
instead of .ctors/.dtors
removes the need for:
the associated (relative) relocations
avoids the backwards disk seeks on startup (since while .ctors are processed backwards,
.init_array
is processed forward).
Transition from .ctors to .init_array
The mainline versions of both GNU ld
and gold
now put .ctors
sections into .init_array
sections, and put .dtors
sections into .fini_array
sections.
ARM
ARM EABI has been using .init_array
from day one. Comment: Nevertheless the default linker script contains a .ctors output section.
GCC configuration
One option you have is to configure gcc with --disable-initfini-array
.
PIC: Position Independent Code
Is a body of machine code that, being placed somewhere in the primary memory, executes properly regardless of its absolute address.
PIC is commonly used for shared libraries, so that the same library code can be loaded in a location in each program address space where it will not overlap any other uses of memory (for example, other shared libraries).
PIC was also used on older computer systems lacking an MMU, so that the operating system could keep applications away from each other even within the single address space of an MMU-less system.
Position-independent code can be executed at any memory address without modification.
This differs from relocatable code, in which a link editor or program loader modifies a program before execution so it can be run only from a particular memory location.
Generating position-independent code is often the default behavior for compilers, but they may place restrictions on the use of some language features, such as disallowing use of absolute addresses (PIC has to use relative addressing).
Instructions that refer directly to specific memory addresses sometimes execute faster, and replacing them with equivalent relative-addressing instructions may result in slightly slower execution, although modern processors make the difference practically negligible.
More in-depth
Procedure calls inside a shared library are typically made through small procedure linkage table stubs, which then call the definitive function. This notably allows a shared library to inherit certain function calls from previously loaded libraries rather than using its own versions.
Data references from position-independent code are usually made indirectly, through global offset tables (GOTs), which store the addresses of all accessed global variables. There is one GOT per compilation unit or object module, and it is located at a fixed offset from the code (although this offset is not known until the library is linked). When a linker links modules to create a shared library, it merges the GOTs and sets the final offsets in code. It is not necessary to adjust the offsets when loading the shared library later.
Position independent functions accessing global data start by determining the absolute address of the GOT given their own current program counter value. This often takes the form of a fake function call in order to obtain the return value on stack (x86) or in a special register (PowerPC, SPARC, MIPS etc.), which can then be stored in a predefined standard register. Some processor architectures, such as the Motorola 68000, ARM and x86-64 allow referencing data by offset from the program counter. This is specifically targeted at making position-independent code smaller, less register demanding and hence more efficient.
GOT: Global Offset Table
To be populated.
Static, shared, dynamic libraries
.so
files are dynamic libraries. The suffix stands for "shared object", because all the applications that are linked with the library use the same file, rather than making a copy in the resulting executable.
.a
files are static libraries. The suffix stands for "archive", because they're actually just an archive (made with the ar command -- a predecessor of tar that's now just used for making libraries) of the original .o object files.
.la
files are static libraries used by the GNU "libtools" package. You can find more information about them in this question: What is libtool's .la file for?
ABI - Application Binary Interface
Is the interface between two program modules, one of which is often a library or operating system, at the level of machine code.
An ABI determines such details as how functions are called and in which binary format information should be passed from one program component to the next, or to the OS eg. system call.
Adhering to ABIs (which may or may not be officially standardized) is usually the job of the compiler, OS or library writer, but application programmers may have to deal with ABIs directly when writing programs in a mix of programming languages, using foreign function call interfaces between them.
ABIs differ from application programming interfaces (APIs), which similarly define interfaces between program components, but at the source code level.
Responsible for:
the sizes, layout, and alignment of data types
the calling convention, which controls how functions' arguments are passed and return values retrieved;
e.g. whether all parameters are passed on the stack or some are passed in registers, which registers are used for which function parameters
whether the first function parameter passed on the stack is pushed first or last onto the stack
how an application should make system calls to the operating system and, if the ABI specifies direct system calls rather than procedure calls to system call stubs, the system call numbers
in the case of a complete operating system ABI, the binary format of object files, program libraries and so on.
ELF
Consists of these main sections:
ELF Header
Specifics the arch, machine, endianness, 32/64 bit
Specifies where the SHT, PHT are located
Program Header Table
The program header table tells the system how to create a process image.
It is found at file offset
e_phoff
, and consists ofe_phnum
entries, each with sizee_phentsize
All the sections -
.text
,.data
,.bss
,.ctors
etc.The ELF binary contains all the necessary data to execute
Libraries can be compiled within the ELF with the
-static
flag.bss
is not actually included in the ELFIt only specifies the size and assumes the space at the address is initialized to zero
If you check the flags, there is no
ALLOC
Section Header Table
This lists all the different sections and where they are found
Also lists the virtual addresses they are mapped to, type of section eg.
ALLOC
Last updated