zoukankan      html  css  js  c++  java
  • Python自然语言处理学习笔记(26):3.10 小结

    3.10 Summary 小结

     

    In this book we view a text as a list of words. A “raw text” is a potentially long string containing words and whitespace formatting, and is how we typically store and visualize a text.

    A string is specified in Python using single or double quotes: 'Monty Python', "Monty Python".

    The characters of a string are accessed using indexes, counting from zero: 'Monty Python'[0] gives the value M. The length of a string is found using len().

    Substrings are accessed using slice notation: 'Monty Python'[1:5] gives the value onty. If the start index is omitted, the substring begins at the start of the string; if the end index is omitted, the slice continues to the end of the string.

    Strings can be split into lists: 'Monty Python'.split() gives ['Monty', 'Python']. Lists can be joined into strings: '/'.join(['Monty', 'Python']) gives 'Monty/ Python'.

    We can read text from a file f using text = open(f).read(). We can read text from a URL u using text = urlopen(u).read(). We can iterate over the lines of a text file using for line in open(f).

    Texts found on the Web may contain unwanted material (such as headers, footers, and markup), that need to be removed before we do any linguistic processing.

    Tokenization is the segmentation of a text into basic units—or tokens—such as words and punctuation. Tokenization based on whitespace is inadequate(不恰当) for many applications because it bundles(捆) punctuation together with words. NLTK provides an off-the-shelf(现成的)tokenizer nltk.word_tokenize().

    Lemmatization is a process that maps the various forms of a word (such as appeared, appears) to the canonical(标准的) or citation(引用) form of the word, also known as the lexeme(词位) or lemma (e.g., appear).

    Regular expressions are a powerful and flexible method of specifying patterns. Once we have imported the re module, we can use re.findall() to find all substrings in a string that match a pattern.

    If a regular expression string includes a backslash, you should tell Python not to preprocess the string, by using a raw string with an r prefix: r'regexp'.

    When backslash is used before certain characters, e.g., \n, this takes on a special meaning (newline character); however, when backslash is used before regular expression wildcards and operators, e.g., \., \|, \$, these characters lose their special meaning and are matched literally.

    A string formatting expression template % arg_tuple consists of a format string template that contains conversion specifiers like %-6s and %0.2d.

  • 相关阅读:
    vue init webpack projectName命令运行报错 解决方法
    DIV实际高度小于设置高度的问题
    openlayers 地图要素的多种高亮方式 Demo(可直接运行)
    加载wkt到地图 Demo (可直接运行)
    openlayers 框选地图得到选框范围(坐标)Demo(可直接运行)
    element+vue可远程搜索可懒加载的下拉框组件
    Android-使用约束布局(ConstraintLayout)构建灵活的UI【译】
    Mysql explain 执行计划详解(转)
    Managing Hierarchical Data in MySQL(邻接表模型)[转载]
    无限极分类原理与实现(转)
  • 原文地址:https://www.cnblogs.com/yuxc/p/2135547.html
Copyright © 2011-2022 走看看