zoukankan      html  css  js  c++  java
  • MTLD -词汇复杂度的指标

    论文:

    MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches
    to lexical diversity assessment 

    地址:

    https://link.springer.com/content/pdf/10.3758%2FBRM.42.2.381.pdf

    LD Lexical diversity 

    TTR type–token ratio 

    缺点是文本长度变化敏感

    vocd-D :也是文本长度的函数

    CONSIDERATIONS IN THE ASSESSMENT OF LEXICAL DIVERSITY 

    Text Length 

    LD的第一个缺点就是对文本长度敏感。the gradual decrease in type count can be an indication of the thematic saturation of a text or corpus . That is, when a text reaches the point at which no new types are being encountered, we can say that the text is (fully) repre- sentative of the word types that are indicative of that text’s theme ~ 作用就是it allows researchers greater confidence that their corpora comprise texts of a sufficient length to represent suitably their linguistic function.  ~MTLD  is a notion closely related to thematic saturation 

    Textual Homogeneity文本同质性

    LD的第二个缺点就是LD指标会被看做对textual homogeneity的假设的描述。homogeneity assumption可以看做一个文本中类型的分布,也就是说,不同的修辞和策略使得文本各个部分有不同的等级。每个文本都有一个structure,每个structure都有一个修辞目的,这个目的可以在文本中用多种修辞形式表示,但是没有任何一个可以表示文本的全部。

    Sequential and Nonsequential Analysis Processing 

    For example, it has the advantage of avoiding local cluster- ing of content words, which Malvern et al. (2004) argued may lead to a distorted view of the overall text. Landauer, Laham, Rehder, and Schreiner (1997) went even further, claiming that there may be little benefit to word order when it comes to deriving meaning from texts. 

    INDICES OF LEXICAL DIVERSITY 

    vocd-D 

    The calculation of vocd-D is the result of a series of ran- dom text samplings. The approach begins its calculation by taking from the text 100 random samples of 35 tokens. The TTR for each of these samples is calculated, and the mean TTR is stored. The same procedure is then repeated for samples from 36 to 50 tokens. An empirical TTR curve is then created from the means of each of these samples. 

    HD-D

    The hypergeometric distribution represents the prob- ability of drawing (without replacement) a certain number of tokens of a particular type from a sample of a particu- lar size. The way we have used this distribution for our own HD-D index is to calculate, for each lexical type in a text, the probability of encountering any of its tokens in a random sample of 42 words drawn from the text.3 The probabilities for all lexical types in the text are then added together, and the sum is used as an index of the text’s LD. 

    Other LD Indices Used in This Study

    Log correction

    Because the text length problem of LD is related to frequency, log values have long been used as an LD corrective factor .

    Frequency correction

    A second approach to correct- ing for the text length effect is the frequency distribution of types. 

    For example, consider the sentence The friendly man liked both the big dog and the little dog, which contains nine types and 12 tokens, and then consider the sentence The friendly man, whom the big dog liked, liked a little dog, which also contains nine types and 12 tokens. Note that the first sentence contains 3 tokens of the type the, whereas the second sentence contains only 2 tokens of the type the; however, for the second sentence, the word liked has a frequency of 2, whereas it is just 1 in the first sentence. 

    Whereas vocd-D is deter- mined by the sums of probabilities of encountering each type in the text in sample sizes from 35 to 50 tokens, K is determined by the sums of probabilities of encountering each type in the text when the sample size is set to just 2 words. 

    MTLD 

    Processing MTLD 

    MTLD is an index of a text’s LD, evaluated sequen- tially. It is calculated as the mean length of sequential word strings in a text that maintain a given TTR value (here, .720). During the calculation process, each word of the text is evaluated sequentially for its TTR. For example, . . . of (1.00) the (1.00) people (1.00) by (1.00) the (.800) people (.667) for (.714) the (.625) people (.556) . . . and so forth. However, when the default TTR factor size value (here, .720) is reached, the factor count increases by a value of 1, and the TTR evaluations are reset. Thus, given the previous example, MTLD would execute . . . of (1.00) the (1.00) people (1.00) by (1.00) the (.800) people (.667) |||FACTORS FACTORS 1||| for (1.00) the (1.00) peo- ple (1.00) . . . and so forth. 

    Forward and Reverse Processing 

    之所以计算一个前向的一个后向的,是因为如果只从前往后计算的话,segmentation sizes 的不同会导致结果的variation很大

    Calculation of MTLD Value 

    The total number of words in the text is divided by the total factor count. For example, if the text 340 words and the factor count 4.404, then the MTLD value is 77.203. Two such MTLD values are calculated, one for forward processing and one for reverse processing. The mean of the two values is the final MTLD value. 

  • 相关阅读:
    基于 abp vNext 和 .NET Core 开发博客项目
    基于 abp vNext 和 .NET Core 开发博客项目
    基于 abp vNext 和 .NET Core 开发博客项目
    基于 abp vNext 和 .NET Core 开发博客项目
    数据结构 6 基础排序算法详解 冒泡排序、三层冒泡排序逐步优化方案详解
    数据结构 5 哈希表/HashMap 、自动扩容、多线程会出现的问题
    数据结构 4 时间复杂度、B-树 B+树 具体应用与理解
    数据结构 3 二叉查找树、红黑树、旋转与变色 理解与使用
    数据结构 2 字符串 数组、二叉树以及二叉树的遍历
    数据结构 1 线性表详解 链表、 栈 、 队列 结合JAVA 详解
  • 原文地址:https://www.cnblogs.com/rosyYY/p/10418800.html
Copyright © 2011-2022 走看看