I have a linked list of numbers of length N. N is very large and I don’t know in advance the exact value of N.
How can I most efficiently write a function that will return k completely random numbers from the list
初看这题心里没有一点思路,最后查了下资料,这题不是什么新题,编程珠玑Column 12中的题目10提到过,其描述如下:
How could you select one of n objects at random, where you see the objects sequentially but you do not know the value of n beforehand? For concreteness, how would you read a text file, and select and print one random line, when you don’t know the number of lines in advance?
首先想到的是我们做过类似的题目吗?当然,在知道文件行数的情况下,我们可以很容易的用C运行库的rand函数随机的获得一个行数,从而随机的取出一行,但是,当前的情况是不知道行数,这样如何求呢?我们需要一个概念来帮助我们做出猜想,来使得对每一行取出的概率相等,也即随机。这个概念即蓄水池抽样(Reservoir Sampling)。
wikipedia: 说的很详细:
水塘抽样是一系列的随机算法,其目的在于从包含n个项目的集合S中选取k个样本,其中n为一很大或未知的数量,尤其适用于不能把所有n个项目都存放到主内存的情况。最常见例子为Jeffrey Vitter在其论文[1]中所提及的算法R。
参照Dictionary of Algorithms and Data Structures[2]所载的O(n)算法,包含以下步骤(假设阵列S以0开始标示):
從S中抽取首k項放入「水塘」中 對於每一個S[j]項(j ≥ k): 隨機產生一個範圍從0到j的整數r 若 r < k 則把水塘中的第r項換成S[j]項
array R[k]; // result integer i, j; // fill the reservoir array for each i in 1 to k do R[i] := S[i] done; // replace elements with gradually decreasing probability for each i in k+1 to length(S) do j := random(1, i); // important: inclusive range if j <= k then R[j] := S[i] fi done
#include<iostream> #include<ctime> using namespace std; int main() { int S[10]={0,1,2,3,4,5,6,7,8,9}; const int k=4; int R[k]; int i,j; for(i=0;i<k;i++) R[i]=S[i]; for(i=k;i<sizeof(S)/sizeof(S[0]);i++) { srand(time(NULL)); j=rand()%i; if(j<k) R[j]=S[i]; } for(int i=0;i<k;i++) cout<<R[i]<<ends; cout<<endl; }
为什么叫水塘抽样,因为我们array R【k】类似一个reservoir水库(蓄水池),
The algorithm creates a "reservoir" array of size k and populates it with the first k items of S. It then iterates through the remaining elements of S until Sis exhausted. At the ith element of S, the algorithm generates a random number j between 1 and i. If j is less than k, the jth element of the reservoir array is replaced with the ith element of S. In effect, for all i, the ith element of S is chosen to be included in the reservoir with probability k/i. Similarly, at each iteration the jth element of the reservoir array is chosen to be replaced with probability j/k * k/i, which simplifies to j/i. It can be shown that when the algorithm has finished executing, each item in S has equal probability (i.e. k/length(S)) of being chosen for the reservoir.
有了这个概念,我们来看最先的问题,在不知道文件总行数的情况下,如何从文件中随机的抽取一行?我们便有了这样一个解决方案:定义取出的行号为choice,第一次直接以第一行作为取出行 choice ,而后第二次以二分之一概率决定是否用第二行替换 choice ,第三次以三分之一的概率决定是否以第三行替换 choice ……,以此类推,可用伪代码描述如下:
i = 0
while more input lines
with probability 1.0/++i
choice = this input line
print choice
#include<iostream> #include<ctime> using namespace std; int main() { int choice=0; int start=0; const int n=10; for(int i=2;i<=n;i++) { srand(time(NULL)); int randValue=rand()%(i+1-start)+start; if(randValue==0) choice=i; } cout<<choice; }
Init : a reservoir with the size: k
for i= k+1 to N
M=random(1, i);
if( M < k)
SWAP the Mth value and ith value
end for
在循环内第n行被抽取的机率为k/n,以 Pn表示。如果档案共有N行,任意第n行(注意这里n是序号,而不是总数)被抽取的机率为:
1-(Pj/k) 就为不选中的概率。