zoukankan      html  css  js  c++  java
  • 使用python进行re拆分网页内容

    这里简短的总结一下而不是完全的罗列python的re模块,python的re具有强大的功能,如下是一个从我们学校抓取数据然后拆分的程序,代码如下:

    import httplib
    import urllib
    import re
    import sys
    reload(sys)
    
    sys.setdefaultencoding("utf-8")
    
    parameters = "__EVENTTARGET=&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE=%2FwEPDwUKLTYwNjgwNDAyOQ8WBB4Jcm9vbXRhYmxlBQ9qZGRhdGFfcm9vbXZpZXceCWRhdGF0YWJsZQULamRkYXRhX3ZpZXcWAgIDD2QWBgIDDxBkZBYBAgFkAgUPEA8WBB4NRGF0YVRleHRGaWVsZAUIUk9PTU5BTUUeC18hRGF0YUJvdW5kZ2QQFRIPMDflj7flhazlr5MgICAgDzA45Y%2B35YWs5a%2BTICAgIA8wOeWPt%2BWFrOWvkyAgICAPMTDlj7flhazlr5MgICAgDzEy5Y%2B35YWs5a%2BTICAgIA8xM%2BWPt%2BWFrOWvkyAgICAPMTTlj7flhazlr5MgICAgDzE15Y%2B35YWs5a%2BTICAgIA8xNuWPt%2BWFrOWvkyAgICAPMTflj7flhazlr5MgICAgDzE45Y%2B35YWs5a%2BTICAgIA4xOeWPt%2BalvCAgICAgIA4yMOWPt%2BalvCAgICAgIA7mnKznp5E0ICAgICAgIA7mnKznp5E1ICAgICAgIA7mnKznp5E2ICAgICAgIA7noJTnqbYyICAgICAgIA7noJTnqbYzICAgICAgIBUSDzA35Y%2B35YWs5a%2BTICAgIA8wOOWPt%2BWFrOWvkyAgICAPMDnlj7flhazlr5MgICAgDzEw5Y%2B35YWs5a%2BTICAgIA8xMuWPt%2BWFrOWvkyAgICAPMTPlj7flhazlr5MgICAgDzE05Y%2B35YWs5a%2BTICAgIA8xNeWPt%2BWFrOWvkyAgICAPMTblj7flhazlr5MgICAgDzE35Y%2B35YWs5a%2BTICAgIA8xOOWPt%2BWFrOWvkyAgICAOMTnlj7fmpbwgICAgICAOMjDlj7fmpbwgICAgICAO5pys56eRNCAgICAgICAO5pys56eRNSAgICAgICAO5pys56eRNiAgICAgICAO56CU56m2MiAgICAgICAO56CU56m2MyAgICAgICAUKwMSZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZGQCFQ88KwANAGQYAQUJR3JpZFZpZXcxD2dks72pwRhFZXy7shqK0FV%2BHo%2BV6xw%3D&__EVENTVALIDATION=%2FwEWIwLCqrCECgKehO%2FXDgKS2sqQDQKbhO%2FXDgLvo6%2FWAQKchO%2FXDgKco5mFBAKo7ZuOCQKQtOGrAwLGtc2eAwKUkP3jDgKphpG2AgL3ot33AgL3ov2mCALP9anUDQLO9e2UAQLO9fEwAsHtjeQDAsHtlaACAsHtmdwCAsHtnfwCAs7toZgNAs7tpbgNAs7tqdQNAsHt7ZQBApnz9msChpiS3QMCtcKkWgL%2BhMCpBAK7ovXVAwLVvLqTBQKewdn%2BDgLeuZHECgK8w4S2BAKjm5WMBhrpaK%2FPVR7L%2BngMlHOw%2B5OLj989&DistrictDown=%E5%98%89%E5%AE%9A%E6%A0%A1%E5%8C%BA&BuildingDown=12%E5%8F%B7%E5%85%AC%E5%AF%93++++&RoomnameText="+sys.argv[1]+"&Submit=%E6%9F%A5%E8%AF%A2"
    
    headers =  {"Content-type": "application/x-www-form-urlencoded","Accept": "text/plain"}
    
    conn = httplib.HTTPConnection("nyglzx.tongji.edu.cn")
    
    conn.request("POST","/web/datastat.aspx",parameters,headers)
    
    response = conn.getresponse()
    
    print response.status,response.reason
    
    result = response.read()
    
    pattern = r'<td><font color="Black">d+-d+-d+</font></td><td><font color="Black">d+,d+.d+</font></td><td><font color="Black">d+,d+.d+</font></td><td><font color="Black">d+.d+</font></td>'
    
    matchs = re.findall(pattern,result)
    
    pattern = r'<td><font color="Black">(d+-d+-d+)</font></td><td><font color="Black">(d+,d+.d+)</font></td><td><font color="Black">(d+,d+.d+)</font></td><td><font color="Black">(d+.d+)</font></td>'
    
    for i in matchs:
    	tm = re.match(pattern, i)
    	print tm.group(1),tm.group(2),tm.group(3),tm.group(4)
    

      这里面的re模块主要用到了两个,一个是result = re.match(pattern,content), 通过result.group(1:n)来访问pattern中以()括起来的内容。另一个是result = re.findall(pattern,content),它的结果用for来访问或者result[index]来访问即可了。

  • 相关阅读:
    洛谷 P3040 [USACO12JAN]贝尔分享Bale Share
    洛谷 P1994 有机物燃烧
    洛谷 P3692 [PUB1]夏幻的考试
    洛谷 P2117 小Z的矩阵
    洛谷 P1154 奶牛分厩
    洛谷 P1718 图形复原
    洛谷 P1900 自我数
    洛谷 P1964 【mc生存】卖东西
    洛谷 P1123 取数游戏
    hdu_2844_Coins(多重背包)
  • 原文地址:https://www.cnblogs.com/luomingchuan/p/3776049.html
Copyright © 2011-2022 走看看