Python challenge 3

zoukankan html css js c++ java

Python challenge 3
第三个主题地址：http://www.pythonchallenge.com/pc/def/ocr.html
Hint1：recognize the characters. maybe they are in the book, but MAYBE they are in the page source.
Hint2: 网页源代码的凝视中有: find rare characters in the mess below；以下是一堆字符。
显然是从这对字符中找出现次数最少的；注意忽略空白符。出现次数相同多的字符按出现次数排序。

import re import urllib # urllib to open the website response= urllib.urlopen("http://www.pythonchallenge.com/pc/def/ocr.html") source = response.read() response.close()
# 抓取到整个HTML的sourceprint source
# 得到凝视中的全部元素
data = re.findall(r'', source, re.S)
# 得到字母charList = re.findall(r'([a-zA-Z])', data[1], 16)print charListprint ''.join(charList)

终于的结果是

['e', 'q', 'u', 'a', 'l', 'i', 't', 'y'] equality

####################################################################################################################################

Python urllib库提供了一个从指定URL地址获取网页数据，然后进行分析的功能。

import urllib google = urllib.urlopen('http://www.google.com') print 'http header: ', google.info() print 'http status:', google.getcode() print 'url:', google.geturl() # result http header: Date: Tue, 21 Oct 2014 19:30:35 GMT Expires: -1 Cache-Control: private, max-age=0 Content-Type: text/html; charset=ISO-8859-1 Set-Cookie: PREF=ID=521bc5021bb6e976:FF=0:TM=1413919835:LM=1413919835:S=7cbCQWnhLCPJFOiw; expires=Thu, 20-Oct-2016 19:30:35 GMT; path=/; domain=.google.com Set-Cookie: NID=67=mzfYCxoBC3d9VaQC6-cXKIcbxt4eekorvE6lon1ZHQhLeVxasD2oeRKEG2In90zRAqNPQ1xLfzR_ha1ife0JqdJankdexWaFjZiQN2mLGjavWCfMBYETbFfIst08iNtR; expires=Wed, 22-Apr-2015 19:30:35 GMT; path=/; domain=.google.com; HttpOnly P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info." Server: gws X-XSS-Protection: 1; mode=block X-Frame-Options: SAMEORIGIN Alternate-Protocol: 80:quic,p=0.01 http status: 200 url: http://www.google.com
我们能够用urlopen抓取网页，然后read方法获得全部的信息。

info获取http header，返回一个httplib.HTTPMessage对象。表示远程server返回的头信息。

getcode获得http status。假设是http请求，200表示成功。404表示网址没找到。

geturl获得信息来源站点。

还有getenv获得环境变量。putenv环境变量设置。等等。

print help(urllib.urlopen) #result Help on function urlopen in module urllib: urlopen(url, data=None, proxies=None) Create a file-like object for the specified URL to read from.
上述。我们能够知道，就是创建一个类文件对象为指定的url来读取。

參数url表示远程数据的路径。通常是http或者ftp路径

參数data表示以get或者post方法提交到url数据

參数proxies表示用于代理的设置

urlopen返回一个类文件对象

有read()，readline()。readlines()，fileno()。close()等和文件对象一样的方法

####################################################################################################################################

Python 中的re 正則表達式模块

re.match 字符串匹配模式

import re line = "Cats are smarter than dogs" matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I) if matchObj: print "matchObj.group() : ", matchObj.group() print "matchObj.group(1) : ", matchObj.group(1) print "matchObj.group(2) : ", matchObj.group(2) else: print "No match!!"
上述的代码的结果是

matchObj.group() : Cats are smarter than dogs matchObj.group(1) : Cats matchObj.group(2) : smarter
能够看出。group()返回整个match的对象。group(?)能够返回submatch，上述代码有两个匹配点。

主要函数语句 re.match(pattern, string, flags)

pattern就是写的regular expression用于匹配。

string就是传入的须要被匹配取值。

flags能够不写。能够用 | 分隔。

re.I 或者re.IGNORECASE，表示匹配部分大写和小写。case insensitively。

（Performs case-insensitive matching.）

re.S或者re.DOTALL，表示点随意匹配模式，改变'.'的行为，设置后能够匹配

（Makes a period (dot) match any character, including a newline.）

re.M或者re.MULTILINE，表示多行模式。改变'^'和'$'的行为

（Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).）

re.L或者re.LOCALE。使得提前定义字符类w,W, , B, s, S取决于当前区域设定

（Interprets words according to the current locale. This interpretation affects the alphabetic group (w and W), as well as word boundary behavior ( and B).）

re.U或者re.UNICODE，使得提前定义字符类w,W, , B, s, S取决于unicode定义的字符属性

（Interprets letters according to the Unicode character set. This flag affects the behavior of w, W, , B.）

re.X或者re.VERBOSE。具体模式。这个模式下正則表達式能够是多行。忽略空白字符，并能够增加凝视。

（Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker.）

re.search v.s. re.match

import re line = "Cats are smarter than dogs"; matchObj = re.match( r'dogs', line, re.M|re.I) if matchObj: print "match --> matchObj.group() : ", matchObj.group() else: print "No match!!" searchObj = re.search( r'dogs', line, re.M|re.I) if searchObj: print "search --> searchObj.group() : ", searchObj.group() else: print "Nothing found!!" # result No match!! search --> searchObj.group() : dogs
我们能够看出来，match是从头開始check整个string的，假设開始没找到就是没找到了。
而search寻找完整个string。从头到尾。

re.sub

详细的语句例如以下

re.sub(pattern, repl, string, max=0)

替换string全部的match部分为repl，替换全部的知道替换max个。
然后返回一个改动过的string。

import re phone = "2004-959-559 # This is Phone Number" # Delete Python-style comments num = re.sub(r'#.*$', "", phone) print "Phone Num : ", num # Remove anything other than digits num = re.sub(r'D', "", phone) print "Phone Num : ", num # result Phone Num : 2004-959-559 Phone Num : 2004959559

re.split (pattern, string, maxsplit=0)

能够使用re.split来切割字符串。maxsplit是分离次数，maxsplit=1表示分离一次。默认是0，不限制次数。

import re print re.split('W+', 'Words, words, words.') print re.split('(W+)', 'Words, words, words.') print re.split('W+', 'Words, words, words.', 1) # result ['Words', 'words', 'words', ''] ['Words', ', ', 'words', ', ', 'words', '.', ''] ['Words', 'words, words.']

假设在字符串的开头或者结尾就匹配，那么返回的list会以空串開始或结尾。

import re print re.split('(W+)', '...words, words...') # result ['', '...', 'words', ', ', 'words', '...', '']

假设字符串不能匹配，就返回整个字符串的list。

import re print re.split('a', '...words, words...') # result ['...words, words...']

####

str.split('s') 和re.split('s',str)都是切割字符串，返回list。可是是有差别的。

1. str.split('s') 是字面上的依照's'来切割字符串

2. re.split('s', str)是依照空白来切割的。由于正則表達式中的‘s’就是空白的意思。

re.findall(pattern, string, flags=0)

找到re匹配的全部子串，并把它们作为一个列表返回。这个匹配从左到右有序的返回。假设没有匹配就返回空列表。

import re print re.findall('a', 'bcdef') print re.findall(r'd+', '12a34b56c789e') # result [] ['12', '34', '56', '789']

re.compile(pattern, flags=0)
编译正則表達式，返回RegexObject对象，然后通过RegexObject对象调用match方法或者search方法。

prog = re.compile(pattern) result = prog.match(string) 等价 result = re.match(pattern, string)
第一种方法可以实现正则表达式的重用。
查看全文

相关阅读:
socket是什么
 0，1，2 代表标准输入、标准输出、标准错误
 认识程序的执行：从高级语言到二进制，以java为例
 97 条 Linux 运维工程师常用命令总结[转]
rsync 参数配置说明[转]
shell 脚本学习之内部变量
 ansible 入门学习（一）
python 管理多版本之pyenv
CentOS6 克隆
 yum 本地仓库搭建

原文地址：https://www.cnblogs.com/mengfanrong/p/5017231.html