zoukankan      html  css  js  c++  java
  • python提取页面信息beautifulsoup正则lxml

    beautifulsoup正则lxml

    # -*- coding: utf-8 -*-
    import re
    from urllib.request import urlopen
    from urllib.request import Request
    from bs4 import BeautifulSoup
    from lxml import etree
    
    #添加模拟浏览器协议头
    headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
    url = "http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC&kw=python&sm=0&p=1"
    req_timeout = 5
    req = Request(url=url,headers=headers)
    f = urlopen(req,None,req_timeout)
    s = f.read()
    s = s.decode('utf-8')
    
    ss = str(s)
    
    #lxml提取
    selector = etree.HTML(ss)
    links = selector.xpath('//tr/td[@class="zwmc"]/div/a/@href|//tr/td[@class="zwmc"]/div/a/text()')
    for link in links:
    	print(link)
    '''
    #beautifulsoup提取
    soup = BeautifulSoup(ss,'html.parser')
    aList = soup.find_all("tr")
    for item in aList:
    	aList1 = item.find_all("a")
    	for item1 in aList1:
    		print(item1.get('href'))
    		print(item1.get_text())
    		break
    	#print(item)
    	#print(item.get('href'))
    	#print(item.get_text())
    '''
    
    #正则提取
    '''
    mm = re.findall('<div style=" 224px;* 218px; _200px; float: left"><a style="font-weight: bold" par="(.*)" href="(.*)" target="_blank">(.*)</a>',ss)
    
    print(mm)
    '''
    

      

  • 相关阅读:
    spark streaming 整合kafka(二)
    JAVA设计模式之动态代理
    使用org.apache.commons.cli包来设计JAVA命令行工具
    HTML教程
    Java InputStream和Reader
    Java IO
    程序员怎么把自己的招牌打出去?
    Java设计模式之单例模式
    JAVA NIO
    Java文件流字节流和字符流的区别
  • 原文地址:https://www.cnblogs.com/yongxinboy/p/7783954.html
Copyright © 2011-2022 走看看