zoukankan      html  css  js  c++  java
  • Python自然语言处理学习笔记(47):5.8 小结

    5.8 Summary 小结

    Words can be grouped into classes, such as nouns, verbs, adjectives, and adverbs. These classes are known as lexical categories or parts-of-speech. Parts-of-speech are assigned short labels, or tags, such as NN and VB.

     单词可以分成类,例如名词,动词,形容词以及副词。这些类被称为词汇类别或者词性。词性被赋给了短标签或者标记,例如NN或者VB

    The process of automatically assigning parts-of-speech to words in text is called part-of-speech tagging, POS tagging, or just tagging.

     给文中的单词自动标注词性的过程称为词性标注。

    Automatic tagging is an important step in the NLP pipeline, and is useful in a variety of situations, including predicting the behavior of previously unseen words, analyzing word usage in corpora, and text-to-speech systems.

    自动标注在NLP流程中是重要的一步,并且在各种情况下都非常有效,包括预测先前未出现单词的行为,分析语料库的单词使用,以及文字转语音系统。

    Some linguistic corpora, such as the Brown Corpus, have been POS tagged.

     一些语言语料库,例如布朗语料库,已经进行了POS标记。

    A variety of tagging methods are possible, e.g., default tagger, regular expression tagger, unigram tagger, and n-gram taggers. These can be combined using a technique known as backoff.

    各种不同的标记方法都是合适的,例如,缺省tagger,正则表达式tagger,unigram tagger以及n-gram tagger。这些可以使用一种称为backoff的技术进行组合。

    Taggers can be trained and evaluated using tagged corpora.

    Tagger可以进行训练并且用标记了的语料库进行评分。

    Backoff is a method for combining models: when a more specialized model (such as a bigram tagger) cannot assign a tag in a given context, we back off to a more general model (such as a unigram tagger).

    Backoff是一个用于组合模型的方法:当一个更详细的模型(例如bigram tagger)不能为给定内容分配标记,我们后退到一个更加一般化的模型(例如unigram tagger

    Part-of-speech tagging is an important, early example of a sequence classification task in NLP: a classification decision at any one point in the sequence makes use of words and tags in the local context.

    词性标注是NLP中一个重要的,早先的序列分类任务:在序列任意某点的分类决策使用了局部语境中的单词和标记。

    A dictionary is used to map between arbitrary types of information, such as a string and a number: freq['cat'] = 12. We create dictionaries using the brace notation: pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}.

    字典用来映射任意类型之间的信息,例如字符串和数字:freq[‘cat’]=12。我们使用大括号标记来创建字典:pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}.

    N-gram taggers can be defined for large values of n, but once n is larger than 3, we usually encounter the sparse data problem; even with a large quantity of training data, we see only a tiny fraction of possible contexts.

    N-gram tag可以定义为较大数值的n,但是一旦n大于3,我们常常会面临稀疏数据问题,即时使用大量的训练数据,我们仅可以看到可能的上下文的细小部分。

    Transformation-based tagging involves learning a series of repair rules of the form change tag s to tag t in context c,” where each rule fixes mistakes and possibly introduces a (smaller) number of errors.

    基于转换的标记包含了一系列的“change tag s to tag t in context c”形式的修复规则,每个规则修复错误并且可能地引入更小的错误。

  • 相关阅读:
    204. Count Primes (Integer)
    203. Remove Linked List Elements (List)
    202. Happy Number (INT)
    201. Bitwise AND of Numbers Range (Bit)
    200. Number of Islands (Graph)
    199. Binary Tree Right Side View (Tree, Stack)
    198. House Robber(Array; DP)
    191. Number of 1 Bits (Int; Bit)
    190. Reverse Bits (Int; Bit)
    189. Rotate Array(Array)
  • 原文地址:https://www.cnblogs.com/yuxc/p/2160125.html
Copyright © 2011-2022 走看看