zoukankan      html  css  js  c++  java
  • Python自然语言处理学习笔记(27):3.11 深入阅读

    3.11 Further Reading 深入阅读

    Extra materials for this chapter are posted at http://www.nltk.org/ , including links to freely available resources on the Web. Remember to consult the Python reference materials at http://docs.python.org/ . (For example, this documentation covers “universal newline support,” explaining how to work with the different newline conventions used by various operating systems.)

    For more examples of processing words with NLTK, see the tokenization, stemming, and corpus HOWTOs at http://www.nltk.org/howto . Chapters 2 and 3 of (Jurafsky &Martin, 2008) contain more advanced material on regular expressions and morphology.

    For more extensive discussion of text processing with Python, see (Mertz, 2003). For information about normalizing non-standard words, see (Sproat et al., 2001).

    There are many references for regular expressions, both practical and theoretical. For an introductory tutorial to using regular expressions in Python, see Kuchling’s Regular Expression HOWTO, http://www.amk.ca/python/howto/regex/. For a comprehensive and detailed manual in using regular expressions, covering their syntax in most major programming languages, including Python, see (Friedl, 2002). Other presentations include Section 2.1 of (Jurafsky & Martin, 2008), and Chapter 3 of (Mertz, 2003).

    There are many online resources for Unicode. Useful discussions of Python’s facilities

    for handling Unicode are:

    PEP-100  http://www.python.org/dev/peps/pep-0100/ 

    Jason Orendorff, Unicode for Programmers,

     http://www.jorendorff.com/articles/uni code/  

    A. M. Kuchling, Unicode HOWTO,

    http://www.amk.ca/python/howto/unicode

    Frederik Lundh, Python Unicode Objects,

    http://effbot.org/zone/unicode-objects.htm

    Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), http://www.joelonsoftware.com/articles/Unicode.html

     

    The problem of tokenizing Chinese text is a major focus of SIGHAN, the ACL Special Interest Group on Chinese Language Processing (http://sighan.org/). Our method for segmenting English text follows (Brent & Cartwright, 1995); this work falls in the area of language acquisition (Niyogi, 2006).

    Collocations are a special case of multiword expressions. A multiword expression is a small phrase whose meaning and other properties cannot be predicted from its words alone, e.g., part-of-speech (Baldwin & Kim, 2010).

    Simulated annealing is a heuristic for finding a good approximation to the optimum value of a function in a large, discrete search space, based on an analogy with annealing in metallurgy. The technique is described in many Artificial Intelligence texts.

    The approach to discovering hyponyms in text using search patterns like x and other ys is described by (Hearst, 1992).

  • 相关阅读:
    Codeforces #548 (Div2)
    Codeforces #550 (Div3)
    UVA
    ios 动画
    CAAnimation
    iOS三种定时器的用法NSTimer、CADisplayLink、GCD
    iOS使用宏写单例
    iOS完美的网络状态判断工具
    iOS开发
    iOS自定义控件教程:制作一个可重用的旋钮
  • 原文地址:https://www.cnblogs.com/yuxc/p/2135548.html
Copyright © 2011-2022 走看看