Gaining insight into the Linux? kernel with Kprobes

by William Cohen

Introduction

Many times kernel developers have resorted to using the "diagnostic print statements" approach to understand what is occurring in the Linux kernel. This technique can be painful because a new kernel must be built and installed on the machine. The machine must then be rebooted with the new kernel. Each new experiment requires another reboot of the machine, which could take minutes on some machines.

Developers have found the ability to inspect the operation of unmodified executables to be very useful. In the case of userspace applications developers can use debuggers to set breakpoints at specific locations in the unmodified executable. When the processor encounters a breakpoint the developer uses the debugger to inspect program state to gain insight into how the program is operating (or failing). There are advantages to this method of examining the program operation over the traditional technique of compiling "diagnostic print statements" into the program:

The developer does not change the source code of the original program.
The developer avoids unintended changes caused by rebuilding the executable.
The developer can avoid the expense of recompiling the program and restarting the program each time something else is examined. In some cases it may not be feasible for the developer to rebuild the application.

Due to interrupt handling it is not feasible to completely stop the Linux kernel and wait for the developer to type in commands. However, it is possible to place snippets of instrumentation code in the kernel to collect information at specific locations to determine whether a specific function is being executed and state of variables. The recent 2.6 Linux kernels, including the x86 kernel in the upcoming Fedora Core 4, have support to allow developers to gather information about the Linux kernel's operation without compiling or booting a new kernel. This is implemented with Kprobes, a dynamic instrumentation system. This article describes how Kprobes operate and provides kernel instrumentation examples.

Kprobes

Kprobes is a dynamic instrumentation system in the mainline 2.6 Linux kernel and will be enabled in the soon to be released x86 Fedora Core 4 kernels. Kprobes allows one to gather additional information about kernel operation without recompiling or rebooting a kernel. Kprobes enables locations in the kernel to be instrumented with code, and the instrumentation code runs when the processor encounters that probe point. Once the instrumentation code completes execution, the kernel continues normal execution.

The Kprobes instrumentation is built as a kernel module. Thus, rather than having to recompile and reboot the system with an instrumented kernel, a kprobe instrumentation module can be written, compiled, and loaded on the system. There is no need to reboot the system. Once the instrumentation module has served its purpose, it can be unloaded, and the kernel returned to its normal operation.

There are two types of kernel probes available: kprobes and jprobes. A kprobe inserts a probe at a specific instruction. The instrumentation provided by a kprobe could be inserted anywhere in a function, thus the kprobe code cannot make assumptions about local variables or arguments passed into the function being probed. A jprobe instruments the entry of a function and allows the probe to examine the arguments passed into the probed function.

The kprobe support in the kernel provides simple data structures and a set of functions to allow the insertion and removal of kernel probes. A data structure is filled out and registered with a call to either the register_kprobe or register_jprobe function. The data structure passed to the register function must remain allocated until the kernel probe is unregistered with either a matching unregister_kprobe or unregister_jprobe. Table 1, Kernel probes management functions” summarizes the functions used to register and unregister the probes. The register functions return zero if the operation was successful and a negative value if the operation was unsuccessful.

int register_kprobe(struct kprobe *p);

int register_jprobe(struct jprobe *p);

void unregister_kprobe(struct kprobe *p);

void unregister_jprobe(struct jprobe *p);

Table 1. Kernel probes management functions

Listing 1, kprobe data structure shows the fields of struct kprobe. The addr field is the linear address of the instruction being probed. The developer needs to determine the appropriate address for addr. In the examples in this article the address was an exported function and could be placed in the code. In other cases the you may have to examine the System.Map file or the disassembled kernel code to find the appropriate value for the address. The pre_handler field is a function pointer to the function run before the execution of the probed instruction. The post_handler field is a function pointer to the function executed following the execution of the instruction. The fault_handler field is a pointer to the function to run if there is a fault during the execution of the probe code.

struct kprobe {
       /* elided fields for internal state information */

        kprobe_opcode_t *addr;
        kprobe_pre_handler_t pre_handler;
        kprobe_post_handler_t post_handler;
        kprobe_fault_handler_t fault_handler;

       /* elided fields for internal state information */
};

Listing 1. kprobe data structure

The jprobe is built on top of the basic kprobe. The jprobes simplify the instrumentation of function entries and allow one to inspect the arguments passed to the function. The struct jprobe contains a struct kprobe for the kprobe information related to the jprobe. There are two pieces of information that need to entered into the struct jprobe: the entry field which points to the instrumentation function that has the same arguments list as the instrumented function and the addr field in kp. The other fields in the struct kprobe are filled out when the jprobe is registered.

struct jprobe {
        struct kprobe kp;
        kprobe_opcode_t *entry; /* probe handling code to jump to */
};

Listing 2. jprobe data structure

The execution of a kprobe has similarities to the execution of a breakpoint set by a debugger. The instruction at the kernel probe location is saved in a buffer, and the instruction at that location is replaced by an breakpoint instruction. When the processor encounters the breakpoint, the trap handler is invoked. A check is made to determine whether there is a kprobe registered at this location. If there is no probe registered for that location, the breakpoint is passed on to the normal handler. If a probe is found, the pre_handler function is executed, the probed instruction is executed, then the post_handler function is executed. The execution resumes at the instruction following the probed instruction.

Examples

This article contains two examples: one example using a kprobe and the other example using a jprobe. Most all of the block device I/O goes through the function generic_make_request. It is useful to instrument generic_make_request to observe its operation. Both examples instrument the generic_make_request function.

You need to have the kernel-devel RPM matching the running kernel installed to build these examples. Listing 3, Makefile shows the simple makefile used to build the instrumentation modules after the kernel-devel RPM has been installed. There are two source files in the directory: kprobebio.c and jprobebio.c. In conjunction with the makefile supplied by kernel-devel, this makefile creates kprobebio.ko and jprobebio.ko, the kernel modules.

Assuming that the kernel-devel RPM matching the running kernel is installed, you can create the modules with the following command:

make  -C /lib/modules/`uname -r`/build M=`pwd` modules

Kprobe example

The kprobe example kprobebio.c in Listing 4, kprobebio.c demonstrates how to counts the number of times the generic_make_request function is called. Since the kprobe is a module, the instrumentation is inserted when the module is loaded. When the instrumentation is removed, the results of the instrumentation are written to /var/log/messages by a printk in this example. Other means of extracting the data are possible.

The include for linux/kprobes.h contains the needed data structures for kprobes and jprobes. The include for linux/blkdev.h declares the function generic_make_request, which is needed to put the probe in the correct location.

The inst_generic_make_request function is the instrumentation function that is called each time the generic_make_request function is called. Normally, as in this case, the instrumentation function returns a value of 0 to indicate that instrumented instruction should be handled normally.

The function init_module sets up the kprobe data structure and starts the instrumentation. There is only an instrumentation function to execute before the executed instructions: pre_handler. Thus, the post_handler and fault_handler are set to NULL. The address of the instrumented function is set in kp.addr. The data structure is registered via register_kprobe. After the register_kprobe, the instrumentation is operating and counting the number of times that generic_make_request is called. The cleanup_module unregisters the probe and then writes the data to /var/log/messages via a printk.

The instrumentation is started as root with the following command:

/sbin/insmod kprobebio.ko

The instrumentation is shutdown as root with the following command:

/sbin/rmmod kprobebio

When the module is unloaded, the data is written to /var/log/messages. Listing 5, Output of kprobebio module in /var/log/messages show the output from this particular example.

Feb 23 12:09:20 slingshot kernel: kprobe registered
Feb 23 12:09:31 slingshot kernel: kprobe unregistered
Feb 23 12:09:31 slingshot kernel: generic_make_request() called 52 times.

Listing 5. Output of kprobebio module in /var/log/messages

Jprobe example

Another useful mechanism provided by Kprobes support is Jprobes. Jprobes allow instrumentation of the function entry and access to the arguments passed into the instrumented function. Listing 6, jprobebio.c shows the the code to generate instrumentation that counts the number of times that generic_make_request is called. The example in Listing 6, jprobebio.c also accumulates the number of sectors moved in the requests and keeps track of the calls and sectors on a per-device basis.

The linux/bio.h is included to describe the data structure used by generic_make_request. This is required because the instrumentation function inst_generic_make_request now has the same arguments as the original generic_make_request function. These arguments can be accessed inside the instrumentation function. For this example the bio pointer is examined to determine the device for which the request is being made and the number of sectors being transfered. A simple hash table is implemented to separate the data for the different devices.

Another significant difference between kprobes and jprobes is how the instrumentation function is exited. In a jprobe there needs to be an explicit jprobe_return rather than a kprobe function's return 0;.

A jprobe uses a struct jprobe to describe the instrumentation point. In this example the entry is made to point to the inst_generic_make_request function. In init_module the kprobe field in the jprobe struct is initialized to point at the function being instrumented, generic_make_request. The other fields in the kprobe field are set up appropriately for the jprobe when the register_jprobe function is called.

When the module is removed from the kernel, cleanup_module is executed. This unregisters the probe and prints out the recorded data much in the same way that the earlier kprobe example operates. Like the kprobebio example, jprobebio module instrumentation is started when it is loaded into the kernel with an insmod command and writes out the data when the module is removed with an rmmod command. Listing 7, Output of the jprobebio module in /var/log shows the output of the module in /var/log/messages.

Feb 23 13:55:01 slingshot kernel: plant jprobe at c024f900, handler addr e09e4000
Feb 23 13:55:02 slingshot crond(pam_unix)[5969]: session closed for user root
Feb 23 13:55:21 slingshot kernel: jprobe unregistered
Feb 23 13:55:21 slingshot kernel: generic_make_request() called 119 times for 952 sectors.
Feb 23 13:55:21 slingshot kernel: bdev 0xcb199da8 (3,5) 26 208 sectors.
Feb 23 13:55:21 slingshot kernel: bdev 0xdf00eda8 (3,2) 93 744 sectors.

Listing 7. Output of the jprobebio module in /var/log

The future

The examples in this article show how to write simple instrumentation using the Kprobes support in the Fedora Core 4 kernels. However, one might notices that the instrumentation is written in raw C code, and it is quite possible to crash the machine if the instrumentation code has a flaw in it. The Kprobes mechanism is also a very low-level interface that simply places individual probes where directed. There is no predefined library that selects groups of probe points to measure things that a regular user might be interested in. Thus, currently Kprobes requires a good understanding of the kernel to know which locations in the kernel to instrument to get data and to perform analysis on the collected data to produce a meaningful result.

An effort has started to address these deficiencies in the current kprobe instrumentation: SystemTap. SystemTap will provide a safer language for writing the instrumentation and a library of useful instrumentation.

About the author

William Cohen is a performance tools engineer at Red Hat, Inc. Will received his BS in electrical engineering from the University of Kansas. He earned a MSEE and a PhD from Purdue University. In his spare time he bicycles and takes pictures with his digital cameras.

Kprobes—insight into the Linux kernel—replace kernel function with module