zoukankan      html  css  js  c++  java
  • 齐夫定律, Zipf's law,Zipfian distribution

    齐夫定律英语:Zipf's law,IPA英语发音:/ˈzɪf/)是由哈佛大学语言学家乔治·金斯利·齐夫George Kingsley Zipf)于1949年发表的实验定律。

    它可以表述为:

    自然语言语料库里,一个单词出现的频率与它在频率表里的排名成反比

    所以,频率最高的单词出现的频率大约是出现频率第二位的单词的2倍,

    而出现频率第二位的单词则是出现频率第四位的单词的2倍。

    这个定律被作为任何与幂定律概率分布有关的事物的参考。

    目录

    例子

    最简单的齐夫定律的例子是“1/f function”。给出一组齐夫分布的频率,按照从最常见到非常见排列,第二常见的频率是最常见频率的出现次数的½,第三常见的频率是最常见的频率的1/3,第n常见的频率是最常见频率出现次数的1/n。然而,这并不精确,因为所有的项必须出现一个整数次数,一个单词不可能出现2.5次。

    Brown语料库中,“the”、“of”、“and”是出现频率最前的三个单词,其出现的频数分别为69971次、36411次、28852次,大约占整个语料库100万个单词中的7%、3.6%、2.9%,其比例约为6:3:2。大约占整个语料库的7%(100万单词中出现69971次)。满足齐夫定律中的描述。仅仅前135个字汇就占了Brown语料库的一半。

    齐夫定律是一个实验定律,而非理论定律,可以在很多非语言学排名中被观察到,例如不同国家中城市的数量、公司的规模、收入排名等。但它的起因是一个争论的焦点。齐夫定律很容易用点阵图观察,坐标分别为排名和频率的自然对数(log)。比如,“the”用上述表述可以描述为x = log(1), y = log(69971)的点。如果所有的点接近一条直线,那么它就遵循齐夫定律。

    遵循该定律的现象

    • 单词的出现频率:不仅适用于语料全体,也适用于单独的一篇文章
    • 网页访问频率
    • 城市人口
    • 收入前3%的人的收入
    • 地震震级
    • 固体破碎时的碎片大小

    参见

    ====================================

    Zipf Distribution

    DOWNLOAD Mathematica Notebook

    The Zipf distribution, sometimes referred to as the zeta distribution, is a discrete distribution commonly used in linguistics, insurance, and the modelling of rare events. It has probability density function

     P(x)=(x^(-(rho+1)))/(zeta(rho+1)),  

    where rho is a positive parameter and zeta(z) is the Riemann zeta function, and distribution function

     D(x)=(H_(x,rho+1))/(zeta(rho+1)),  

    where H_(n,r) is a generalized harmonic number.

    The Zipf distribution is implemented in the Wolfram Language as ZipfDistribution[rho].

    The nth raw moment is

     mu_n^'=(zeta(1-nrho))/(zeta(rho+1)),  

    giving the mean and variance as

    mu = (zeta(rho))/(zeta(rho+1))
     
    sigma^2 = (zeta(rho-1))/(zeta(rho+1))-([zeta(rho)]^2)/([zeta(rho+1)]^2).
     

    The distribution has mean deviation

     MD=(2[zeta(rho+1)zeta(rho,|_mu_|+1)-zeta(rho)zeta(rho+1,|_mu_|+1)])/(zeta^2(rho+1)),
     

    where zeta(z,s) is a Hurwitz zeta function and mu is the mean as given above in equation (4).

    SEE ALSO: Zipf's Law

     

    CITE THIS AS: Weisstein, Eric W. "Zipf Distribution." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/ZipfDistribution.html

    Zipf's Law

    In the English language, the probability of encountering the rth most common word is given roughly by P(r)=0.1/r for r up to 1000 or so. The law breaks down for less frequent words, since the harmonic series diverges. Pierce's (1980, p. 87) statement that sumP(r)>1 for r=8727 is incorrect. Goetz states the law as follows: The frequency of a word is inversely proportional to its statistical rank r such that

     P(r) approx 1/(rln(1.78R)),

    where R is the number of different words.

    Theoretical review

    Zipf's law is most easily observed by plotting the data on a log-log graph, with the axes being log (rank order) and log (frequency). For example, the word "the" (as described above) would appear at x = log(1), y = log(69971). It is also possible to plot reciprocal rank against frequency or reciprocal frequency or interword interval against rank.[1] The data conform to Zipf's law to the extent that the plot is linear.

    Formally, let:

    • N be the number of elements;
    • k be their rank;
    • s be the value of the exponent characterizing the distribution.

    Zipf's law then predicts that out of a population of N elements, the frequency of elements of rank k, f(k;s,N), is:

      • f(k;s,N)={frac {1/k^{s}}{sum _{n=1}^{N}(1/n^{s})}}
  • 相关阅读:
    感谢一个名叫“祯玥”的姑娘
    下一代互联网
    伤心时要读的三十八句
    任何企业的竞争,归根结底都是智能的竞争
    互联网创业必须过的槛(转)
    钻到牛角尖里面去,想开公司必需知道的奥秘
    重游草堂
    牛根生:我们应该学会“三换思维”
    领导者的感染力和传染力
    幸福是一种心境(转)
  • 原文地址:https://www.cnblogs.com/sddai/p/6081447.html
Copyright © 2011-2022 走看看