http://paulmck.livejournal.com/7314.html
RCU的作者,paul在他的blog中有提到这个问题,也明确提到需要在module exit的地方使用rcu_barrier来等待保证call_rcu的回调函数callback能够执行完成,然后再正式卸载模块,方式快速卸载之后call_back回调发现空指针的问题,从而导致kernel panic的问题。
rcu_barrier()
function was described some time back in an article on Linux Weekly News. This rcu_barrier()
function solves the problem where a given module invokes call_rcu()
using a function in that module, but the module is removed before the corresponding grace period elapses, or at least before the callback can be invoked. This results in an attempt to call a function whose code has been removed from the Linux kernel. Oops!!!Since the above article was written, rcu_barrier_bh()
and rcu_barrier_sched()
have been accepted into the Linux kernel, for use with call_rcu_bh()
and call_rcu_sched()
, respectively. These functions have seen relatively little use, which is no surprise, given that they are quite specialized. However, Jesper Dangaard recently discovered that they need to be used a bit more heavily. This lead to the question of exactly when they needed to be used, to which I responded as follows:
Unless there is some other mechanism to ensure that all the RCU callbacks have been invoked before the module exit, there needs to be code in the module-exit function that does the following:What other mechanism could be used? I cannot think of one that it safe. For example, a module that tried to count the number of RCU callbacks in flight would be vulnerable to races as follows:
- Prevents any new RCU callbacks from being posted. In other words, make sure that no future
call_rcu()
invocations happen from this module unless thosecall_rcu()
invocations touch only functions and data that outlive this module.- Invokes
rcu_barrier()
.- Of course, if the module uses
call_rcu_sched()
instead ofcall_rcu()
, then it should invokercu_barrier_sched()
instead ofrcu_barrier()
. Similarly, if it usescall_rcu_bh()
instead ofcall_rcu()
, then it should invokercu_barrier_bh()
instead ofrcu_barrier()
. If the module uses more than one ofcall_rcu()
,call_rcu_sched()
, andcall_rcu_bh()
, then it must invoke more than one ofrcu_barrier()
,rcu_barrier_sched()
, andrcu_barrier_bh()
.
- CPU 0: RCU callback decrements the counter.
- CPU 1: module-exit function notices that the counter is zero, so removes the module.
- CPU 0: attempts to execute the code returning from the RCU callback, and dies horribly due to that code no longer being in memory.
If there was an easy solution (or even a hard solution) to this problem, then I do not believe that Nikita Danilov would have asked Dipankar Sarma for
rcu_barrier()
. Therefore, I do not expect anyone to be able to come up with an alternative torcu_barrier()
and friends. Always happy to learn something by being proven wrong, of course!!!So unless someone can show me some other safe mechanism, every unloadable module that uses
call_rcu()
,call_rcu_sched()
, orcall_rcu_bh()
must usercu_barrier()
,rcu_barrier_sched()
, and/orrcu_barrier_bh()
in its module-exit function.
So if you have a module that uses one of the
call_rcu()
functions, please use the corresponding rcu_barrier()
function in the module-exit code!Update: Peter Zijlstra rightly points out that the issue is not whether your module invokes call_rcu()
, but rather whether the corresponding RCU callback invokes a function that is in a module. So, if there is a call_rcu()
, call_rcu_sched()
, or call_rcu_bh()
anywhere in the kernel whose RCU callback either directly or indirectly invokes a function in your module, then your module's exit function needs to invoke rcu_barrier()
, rcu_barrier_sched()
, and/or rcu_barrier_bh()
. Thanks to Peter for pointing this out!