zoukankan      html  css  js  c++  java
  • Python练习—Google’s Python Class

    首先介绍下正则表达式:

    1)python中提供了re模块来进行正则表达式支持,因此第一步 import re

    2)几个常用的方法:

    match = re.search(pat, str)  
    注意点:1.match是个对象,使用match.group()来输出匹配文本,若失败返回None
           2.search从str的起始处开始处理,在第一个匹配处结束
           3.所有的模式都必须匹配上,但并不是所有的字符串都要匹配一遍
           4.首先找到匹配模式的最左边,然后尽可能的往右尝试
     
    list = re.findall(pat, str) 搜索所有的匹配项,以列表形式返回
    注意点:1.可以用f.read()把所有文本都丢给findall
           2.使用()后,返回的是元组的列表
     
    re.sub(pat,replacement,str) 搜索所有匹配项,并进行替换,匹配字符串可以包括\1,\2来引用group(1),group(2)的内容
     
    3)基本模式
    普通字符原样匹配,元字符会特殊处理. ^ $ * + ? { [ ] \ | ( )
    .匹配除了\n外的任意字符
    \w 匹配一个字符[a-zA-Z0-9_] 
    \W 匹配非上面的任意字符
    \b 字符和非字符的边界
    \s 匹配单个空格 [ \n\r\t\f]
    \S 匹配非空格字符
    \t, \n, \r   制表,换行,回车
    \d 十进制数
    ^ 开始 $结束
    \ 转义
    [] 指明字符集,注意这时.就代表 [^]代表取反
    () 分组抽取,组特性允许抽取部分匹配文本
    重复:
    + 出现一次或多次
    * 出现0次或多次
    ? 出现0次或一次,在正则表达式后面加?可以取消贪婪搜索

    BUG Fixed:

    WIN7+MINGW:

    使用commands.getstatusoutput()函数,由于cmd加上了{,出现歧义,需要矫正

    def getstatusoutput(cmd):
        """Return (status, output) of executing cmd in a shell."""
    
        import sys
        mswindows = (sys.platform == "win32")
    
        import os
        if not mswindows:
          cmd = '{ ' + cmd + '; }'
    
        pipe = os.popen(cmd + ' 2>&1', 'r')
        text = pipe.read()
        sts = pipe.close()
        if sts is None: sts = 0
        if text[-1:] == '\n': text = text[:-1]
        return sts, text

    Google’s Class介绍了基本的内容,包括:字符串操作,列表操作,排序操作,字典和文件操作,正则表达式操作,一些辅助工具操作

    提供的练习包括:字符串,列表使用;正则表达式,文件使用;辅助工具使用。并提供了参考代码。

    特别是最后一个练习,根据文件提取图片地址,并下载,生成HTML文件的。稍微修改就可以用来订阅网站内容的功能,值得初学者练习使用。

    这里贴个代码(新浪图片页面指定部分抓取):

       1: #!/usr/bin/python
       2: # -*- coding: utf-8 -*-
       3: # Copyright 2010 Google Inc.
       4: # Licensed under the Apache License, Version 2.0
       5: # http://www.apache.org/licenses/LICENSE-2.0
       6:  
       7: # Google's Python Class
       8: # http://code.google.com/edu/languages/google-python-class/
       9:  
      10: import os
      11: import re
      12: import sys
      13: import urllib
      14:  
      15:  
      16: """Logpuzzle exercise
      17: Given an apache logfile, find the puzzle urls and download the images.
      18:  
      19: Here's what a puzzle url looks like:
      20: 10.254.254.28 - - [06/Aug/2007:00:13:48 -0700] "GET /~foo/puzzle-bar-aaab.jpg HTTP/1.0" 302 528 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"
      21: """
      22: def grep_file(url):
      23:   filename='test.html'
      24:   abspath=os.path.abspath(filename)
      25:   urllib.urlretrieve(url,abspath)
      26:  
      27: def read_urls(filename):
      28:   """Returns a list of the puzzle urls from the given log file,
      29:   extracting the hostname from the filename itself.
      30:   Screens out duplicate urls and returns the urls sorted into
      31:   increasing order."""
      32:   # +++your code here+++
      33:   url=[]
      34:   piclist=[]
      35:   """firstpart=re.search(r'(.*)_(.*)',filename)
      36:   if firstpart:
      37:     first=firstpart.group(2)"""
      38:     
      39:   try:
      40:     f=open(filename,'rU')
      41:     """for line in f:
      42:       urline=re.search(r'GET\s(.*\.jpg)\sHTTP/1.0',line)
      43:       if urline:
      44:         urlpart=urline.group(1)
      45:         str='http://'+first+urlpart
      46:         if str not in url:
      47:           url.append(str)"""
      48:     url=re.findall('<!--写真 start-->([\w\W]*?)<!--写真 end-->',f.read().decode('gbk').encode('utf-8'))
      49:     f.close()
      50:   except IOError as (errno, strerror):
      51:     sys.stderr.write("I/O error({0}): {1}".format(errno, strerror))
      52:   """def MyFn(name):
      53:     base=os.path.basename(name)
      54:     set=re.findall(r'(.*?)[-.]',base)
      55:     if set:
      56:       #print set[0],set[1],set[2]
      57:       return set[2]
      58:     else:
      59:       return base
      60:   url=sorted(url,key=MyFn)
      61:   #url.sorted()"""
      62:   for i in url:
      63:     piclist=re.findall(r'<img src="(.*?)"',i)
      64:   return piclist
      65:   
      66:  
      67: def download_images(img_urls, dest_dir):
      68:   """Given the urls already in the correct order, downloads
      69:   each image into the given directory.
      70:   Gives the images local filenames img0, img1, and so on.
      71:   Creates an index.html in the directory
      72:   with an img tag to show each local image file.
      73:   Creates the directory if necessary.
      74:   """
      75:   # +++your code here+++
      76:   abspath=os.path.abspath(dest_dir)
      77:   if not os.path.exists(abspath):
      78:     os.mkdir(abspath)
      79:    
      80:   count=0 
      81:   for i in img_urls:
      82:   
      83:     fn=abspath+'\img'+str(count)
      84:     print 'Retrieving...'+fn
      85:     urllib.urlretrieve(i,fn)
      86:     count+=1
      87:   
      88:   #create html
      89:   toshow=''
      90:   htmlpath=os.path.join(abspath,'index.html')
      91:  
      92:   f=open(htmlpath,'w')
      93:   for i in range(count):
      94:     toshow+='<img src="img'+str(i)+'">'
      95:   f.write(toshow)
      96:   f.close
      97:     
      98:   
      99: def main():
     100:   args = sys.argv[1:]
     101:  
     102:   if len(args)>0 and args[0] == '-h':
     103:     print 'usage: [--todir dir]'
     104:     sys.exit(1)
     105:  
     106:   todir = ''
     107:   if len(args)>0 and args[0] == '--todir':
     108:     todir = args[1]
     109:     del args[0:2]
     110:   
     111:   url='http://ent.sina.com.cn/photo/'
     112:   grep_file(url)
     113:  
     114:   #read_urls('test.html')
     115:   img_urls = read_urls('test.html')
     116:  
     117:   if todir:
     118:     download_images(img_urls, todir)
     119:   else:
     120:     print '\n'.join(img_urls)
     121:  
     122: if __name__ == '__main__':
     123:   main()
  • 相关阅读:
    Android将TAB选项卡放在屏幕底部(转)
    unix进程间通信
    C优先级顺序(转)
    C/C++ 内存补齐机制
    Android Sqlite ORM 工具
    类型安全性测试
    反射手册笔记 2.程序集,对象和类型
    CLR笔记:15.委托
    反射手册笔记 4.创建对象
    反射手册笔记 1.灵活的编程方法
  • 原文地址:https://www.cnblogs.com/westwind/p/2520569.html
Copyright © 2011-2022 走看看