CHAPTER 3
Processing Raw Text 处理原始文本
The most important source of texts is undoubtedly the Web. It’s convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. However, you probably have your own text sources in mind, and need to learn how to access them.
The goal of this chapter is to answer the following questions:
1. How can we write programs to access text from local files and from the Web, in order to get hold of an unlimited range of language material?
我们如何编写程序去访问来自本地文件和Web的文本,从而得到无限范围的语言材料?
2.How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?
我们如何把文档分割成单独的文字和标点符号,这样我们可以执行与上一章中文本预料库相同的分析?
3. How can we write programs to produce formatted output and save it in a file?我们应该如何编程程序来产生有格式的数和并把它保存在文件中?
In order to address(处理) these questions, we will be covering key concepts in NLP, including tokenization and stemming(包括分词和提取词干). Along the way you will consolidate your Python knowledge and learn about strings, files, and regular expressions(本章学习:字符串,文件和正则表达式). Since so much text on the Web is in HTML format, we will also see how to dispense(去除) with markup.
Important: From this chapter onwards(向前的,从这一章开始), our program samples will assume you begin your interactive session or your program with the following import statements:
>>> import nltk, re, pprint
3.1 Accessing Text from the Web and from Disk 从Web和Disk上访问文本
Electronic Books 电子书
A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing other texts from Project Gutenberg. You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/ , and obtain a URL to an ASCII text file. Although 90% of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese, and Spanish (with more than 100 texts each).
Text number 2554 is an English translation of Crime and Punishment(罪与罚), and we can access it as follows.
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> type(raw)
<type 'str'>
>>> len(raw)
1176831
>>> raw[:75]
'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'
手动设置代理:The read() process will take a few seconds as it downloads this large book. If you’re using an Internet proxy(代理) that is not correctly detected by Python, you may need to specify the proxy manually as follows:
>>> raw = urlopen(url, proxies=proxies).read()
The variable raw contains a string with 1,176,831 characters. (We can see that it is a string, using type(raw).) This is the raw content of the book, including many details we are not interested in, such as whitespace, line breaks(换行), and blank lines. Notice the \r and \n in the opening line of the file, which is how Python displays the special carriage return(回车) and line-feed(换行) characters (the file must have been created on a Windows machine,注1给出了解释). For our language processing, we want to break up the string into words and punctuation, as we saw in Chapter 1. This step is called tokenization, and it produces our familiar structure, a list of words and punctuation.(英文的分词真是容易那...)
注1关于carriage return(回车) and line-feed(换行)
参考自:http://blog.csdn.net/zhezhelin/article/details/2703382
在计算机还没有出现之前,有一种叫做电传打字机(Teletype Model 33)的玩意,每秒钟可以打10个字符。但是它有一个问题,就是打完一行换行的时候,要用去0.2秒,正好可以打两个字符。要是在这0.2秒里面,又有新的字符传过来,那么这个字符将丢失。
于是,研制人员想了个办法解决这个问题,就是在每行后面加两个表示结束的字符。一个叫做“回车”,告诉打字机把打印头定位在左边界;另一个叫做“换行”,告诉打字机把纸向下移一行。
这就是“换行”和“回车”的来历,从它们的英语名字上也可以看出一二。后来,计算机发明了,这两个概念也就被般到了计算机上。那时,存储器很贵,一些科学家认为在每行结尾加两个字符太浪费了,加一个就可以。于是,就出现了分歧。Unix系统里,每行结尾只有“<换行>”,即“/n”;Windows系统里面,每行结尾是“<换行><回车>”,即“/r/n”;Mac系统里,每行结尾是“<回车>”/r。一个直接后果是,Unix/Mac系统下的文件在Windows里打开的话,所有文字会变成一行;而Windows里的文件在Unix/Mac下打开的话,在每行的结尾可能会多出一个^M符号。
>>> type(tokens)
<type 'list'>
>>> len(tokens)
255809 (我这是241137)
>>> tokens[:10]
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']
Notice that NLTK was needed for tokenization, but not for any of the earlier tasks of opening a URL and reading it into a string. If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing we saw in Chapter 1, along with the regular list operations, such as slicing:
>>> type(text)
<type 'nltk.text.Text'>
>>> text[1020:1060]
['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in',
'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in',
'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly',
',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.']
>>> text.collocations()
Notice that Project Gutenberg appears as a collocation. This is because each text downloaded from Project Gutenberg contains a header with the name of the text, the author, the names of people who scanned and corrected the text, a license, and so on. Sometimes this information appears in a footer(页脚) at the end of the file. We cannot reliably(可靠地) detect where the content begins and ends, and so(因此) have to resort(凭借) to manual inspection of the file, to discover unique strings that mark the beginning and the end, before trimming(修剪) raw to be just the content and nothing else:
5303
>>> raw.rfind("End of Project Gutenberg's Crime")
1157681
>>> raw = raw[5303:1157681] ①
>>> raw.find("PART I")
0
The find() and rfind() (“reverse find 反转寻找”) methods help us get the right index values to use for slicing the string ①. We overwrite raw with this slice, so now it begins with “PART I” and goes up to (but not including) the phrase that marks the end of the content.
This was our first brush with the reality of the Web: texts found on the Web may contain unwanted material, and there may not be an automatic way to remove it.(这可能是我们第一次接触到实际的Web:Web上的文本可能包含了不想要的资料,并且可能没有自动的方法来移除它) But with a small amount of extra work we can extract the material we need.
Dealing with HTML 处理HTML
Much of the text on the Web is in the form of HTML documents. You can use a web browser to save a page as text to a local file, then access this as described in the later section on files. However, if you’re going to do this often, it’s easiest to get Python to do the work directly. The first step is the same as before, using urlopen. For fun we’ll pick a BBC News story called “Blondes to die out in 200 years,” an urban legend (都市传奇)passed along by the BBC as established scientific fact:
>>> html = urlopen(url).read()
>>> html[:60]
'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'
You can type print html to see the HTML content in all its glory, including meta tags(元标签), an image map, JavaScript, forms, and tables.
Getting text out of HTML is a sufficiently common task that NLTK provides a helper function nltk.clean_html(), which takes an HTML string and returns raw text(将HTML字符串转变为原始文本). We can then tokenize this to get our familiar text structure:
>>> tokens = nltk.word_tokenize(raw)
>>> tokens
['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out', ...]
This still contains unwanted material concerning site navigation and related stories. With some trial and error you can find the start and end indexes of the content and select the tokens of interest, and initialize a text as before.
>>> text = nltk.Text(tokens)
>>> text.concordance('gene')
they say too few people now carry the gene for blondes to last beyond the next tw
t blonde hair is caused by a recessive gene . In order for a child to have blonde
to have blonde hair , it must have the gene on both sides of the family in the gra
there is a disadvantage of having that gene or by chance . They don ' t disappear
ondes would disappear is if having the gene was a disadvantage and I do not think
For more sophisticated processing of HTML, use the Beautiful Soup package, available at http://www.crummy.com/software/BeautifulSoup/ .
(官网上有中文的说明文档,链接:http://www.crummy.com/software/BeautifulSoup/documentation.zh.html#Quick%20Start )
Processing Search Engine Results 处理搜索引擎的结果
The Web can be thought of as a huge corpus of unannotated(未注释的) text. Web search engines provide an efficient means of searching this large quantity of text for relevant linguistic examples. The main advantage of search engines is size(数目): since you are searching such a large set of documents, you are more likely to find any linguistic pattern you are interested in. Furthermore, you can make use of(利用) very specific patterns, which would match only one or two examples on a smaller example, but which might match tens of thousands of examples when run on the Web. A second advantage of web search engines is that they are very easy to use. Thus, they provide a very convenient tool for quickly checking a theory, to see if it is reasonable. See Table 3-1 for an example.
Table 3-1. Google hits for collocations: The number of hits for collocations involving the words absolutely or definitely, followed by one of adore, love, like, or prefer. (Liberman, in LanguageLog, 2005)
Google hits |
adore |
love |
Like |
prefer |
absolutely |
289,000 |
905,000 |
16,200 |
644 |
definitely |
1,460 |
51,000 |
158,000 |
62,600 |
ratio |
198:1 |
18:1 |
1:10 |
1:97 |
Unfortunately, search engines have some significant shortcomings. First, the allowable range of search patterns is severely restricted(搜索方式的可允许范围受到严格地限制). Unlike local corpora, where you write programs to search for arbitrarily complex patterns, search engines generally only allow you to search for individual words or strings of words, sometimes with wildcards(通配符). Second, search engines give inconsistent results, and can give widely different figures when used at different times or in different geographical regions(地理区). When content has been duplicated across multiple sites, search results may be boosted(提高). Finally, the markup in the result returned by a search engine may change unpredictably(不可预料的), breaking any pattern-based(基于模式的) method of locating particular content (a problem which is ameliorated(改善)by the use of search engine APIs).
Your Turn: Search the Web for "the of" (inside quotes). Based on the large count, can we conclude that the of is a frequent collocation in English?
Processing RSS Feeds 处理RSS Feeds
The blogosphere(博客圈,网络上博客的总称) is an important source of text, in both formal and informal registers. With the help of a third-party Python library called the Universal Feed Parser, freely downloadable from http://feedparser.org/ , we can access the content of a blog, as shown here:
>>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
>>> llog['feed']['title']
u'Language Log'
>>> len(llog.entries)
15
>>> post = llog.entries[2]
>>> post.title
u"He's My BF"
>>> content = post.content[0].value
>>> content[:70]
u'<p>Today I was chatting with three of our visiting graduate students f'
>>> nltk.word_tokenize(nltk.html_clean(content)) (html_clean貌似打印错误?)
>>> nltk.word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))
[u'Today', u'I', u'was', u'chatting', u'with', u'three', u'of', u'our', u'visiting',
u'graduate', u'students', u'from', u'the', u'PRC', u'.', u'Thinking', u'that', u'I',
u'was', u'being', u'au', u'courant', u',', u'I', u'mentioned', u'the', u'expression',
u'DUI4XIANG4', u'\u5c0d\u8c61', u'("', u'boy', u'/', u'girl', u'friend', u'"', ...]
Note that the resulting strings have a u prefix to indicate that they are Unicode strings (see Section 3.3). With some further work, we can write programs to create a small corpus of blog posts(文章), and use this as the basis for our NLP work.
Reading Local Files 读取本地文件
In order to read a local file, we need to use Python’s built-in open() function, followed by the read() method. Supposing you have a file document.txt, you can load its contents like this:
>>> raw = f.read()
Your Turn: Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text(纯文本). If you are using IDLE, select the New Window command in the File menu, typing the required text into this window, and then saving the file as document.txt inside the directory that IDLE offers in the pop-up(弹出式) dialogue box. Next, in the Python interpreter, open the file using f = open('document.txt'), then inspect its contents using print f.read().
Various things might have gone wrong when you tried this. If the interpreter couldn’t find your file, you would have seen an error like this:
Traceback (most recent call last):
File "<pyshell#7>", line 1, in -toplevel-
f = open('document.txt')
IOError: [Errno 2] No such file or directory: 'document.txt'
To check that the file that you are trying to open is really in the right directory, use IDLE’s Open command in the File menu; this will display a list of all the files in the
directory where IDLE is running. An alternative is to examine the current directory from within Python:
>>> os.listdir('.')
Another possible problem you might have encountered when accessing a text file is the newline conventions, which are different for different operating systems. The built-in open() function has a second parameter for controlling how the file is opened: open('document.txt', 'rU'). 'r' means to open the file for reading (the default), and 'U' stands for “Universal”, which lets us ignore the different conventions used for marking new-lines.
Assuming that you can open the file, there are several methods for reading it. The read() method creates a string with the contents of the entire file:
'Time flies like an arrow.\nFruit flies like a banana.\n'
Recall that the '\n' characters are newlines(换行符); this is equivalent to pressing Enter on a keyboard and starting a new line.
We can also read a file one line at a time using a for loop:
>>> for line in f:
... print line.strip()
Time flies like an arrow.
Fruit flies like a banana.
Here we use the strip() method to remove the newline character at the end of the input line.
NLTK’s corpus files can also be accessed using these methods. We simply have to use nltk.data.find() to get the filename for any corpus item. Then we can open and read it in the way we just demonstrated:
>>> raw = open(path, 'rU').read()
Extracting Text from PDF, MSWord, and Other Binary Formats
从PDF,MSWord和其他二进制格式中提取文本
ASCII text and HTML text are human-readable formats. Text often comes in binary formats—such as PDF and MSWord—that can only be opened using specialized software. Third-party libraries such as pypdf and pywin32 provide access to these formats. Extracting text from multicolumn(多列) documents is particularly challenging. For one-off(一次性的) conversion of a few documents, it is simpler to open the document with a suitable application, then save it as text to your local drive, and access it as described below. If the document is already on the Web, you can enter its URL in Google’s search box. The search result often includes a link to an HTML version of the document, which you can save as text.
Capturing User Input 捕捉用户输入
Sometimes we want to capture the text that a user inputs when she is interacting with our program. To prompt the user to type a line of input, call the Python function raw_input(). After saving the input to a variable, we can manipulate it just as we have done for other strings.
Enter some text: On an exceptionally hot evening early in July
>>> print "You typed", len(nltk.word_tokenize(s)), "words."
You typed 8 words.
The NLP Pipeline NLP流水线
Figure 3-1 summarizes what we have covered in this section, including the process of building a vocabulary that we saw in Chapter 1. (One step, normalization, will be discussed in Section 3.6.)
Figure 3-1. The processing pipeline: We open a URL and read its HTML content, remove the markup and select a slice of characters; this is then tokenized and optionally converted into an nltk.Text object; we can also lowercase all the words and extract the vocabulary.
There’s a lot going on in this pipeline. To understand it properly, it helps to be clear about the type of each variable that it mentions. We find out the type of any Python object x using type(x); e.g., type(1) is <int> since 1 is an integer.
When we load the contents of a URL or file, and when we strip out(删掉) HTML markup, we are dealing with strings, Python’s <str> data type (we will learn more about strings in Section 3.2):
>>> type(raw)
<type 'str'>
When we tokenize a string we produce a list (of words), and this is Python’s <list> type. Normalizing and sorting lists produces other lists:
>>> type(tokens)
<type 'list'>
>>> words = [w.lower() for w in tokens]
>>> type(words)
<type 'list'>
>>> vocab = sorted(set(words))
>>> type(vocab)
<type 'list'>
The type of an object determines what operations you can perform on it. So, for example, we can append to a list but not to a string:
>>> raw.append('blog')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'append'
Similarly, we can concatenate strings with strings, and lists with lists, but we cannot concatenate strings with lists:
>>> beatles = ['john', 'paul', 'george', 'ringo']
>>> query + beatles
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: cannot concatenate 'str' and 'list' objects
In the next section, we examine strings more closely and further explore the relationship between strings and lists.