zoukankan      html  css  js  c++  java
  • python3: 爬虫---- urllib, beautifulsoup

    最近晚上学习爬虫,首先从基本的开始;

    python3 将urllib,urllib2集成到urllib中了, urllib可以对指定的网页进行请求下载,  beautifulsoup 可以从杂乱的html代码中

    分离出我们需要的部分;

    注:  beautifulsoup 是一种可以从html 或XML文件中提取数据的python库;

    实例1:

    from urllib import request
    from bs4 import BeautifulSoup as bs
    import re
    
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36'
    }
    
    def download():
        """
         模拟浏览器进行访问;
        :param url:
        :return:
        """
        for pageIdx in range(1, 3, 1):
            #print(pageIdx)
            url = "https://www.cnblogs.com/#p%s" % str(pageIdx)
            print(url)
            req = request.Request(url, headers=header)
            rep = request.urlopen(req).read()
            data = rep.decode('utf-8')
            print(data)
            content = bs(data)
            for link in content.find_all('h3'):
                content1 = bs(str(link), 'html.parser')
                print(content1.a['href'],content1.a.string)
                curhtmlcontent = request.urlopen(request.Request(content1.a['href'], headers=header)).read()
                #print(curhtmlcontent.decode('utf-8'))
                open('%s.html' % content1.a.string, 'w',encoding='utf-8').write(curhtmlcontent.decode('utf-8'))
    
    if __name__ == "__main__":
        download()
    

    实例2:

    # -- coding: utf-8 --
    import unittest
    import  lxml
    import requests
    from bs4 import BeautifulSoup as bs
    
    def  school():
        for index in range(2, 34, 1):
            try:
                url="http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html" % str(index)
                r = requests.get(url=url)
                soup = bs(r.content, 'lxml')
                city = soup.find_all(name="td",attrs={"colspan":"7"})[0].string
                fp = open("%s.txt" %(city), "w", encoding="utf-8")
                content1 = soup.find_all(name="tr", attrs={"height": "29"})
                for content2 in content1:
                    try:
                        contentTemp = bs(str(content2), "lxml")
                        soup_content = contentTemp.find_all(name="td")[1].string
                        fp.write(soup_content + "
    ")
                        print(soup_content)
                    except IndexError:
                        pass
                fp.close()
            except IndexError:
                pass
    
    
    class MyTestCase(unittest.TestCase):
        def test_something(self):
            school()
    
    
    if __name__ == '__main__':
        unittest.main()
    

    BeatifulSoup支持很多HTML解析器(下面是一些主要的):

    解析器 使用方法 优势 劣势
    Python标准库 BeautifulSoup(markup, “html.parser”) (1)Python的内置标准库(2)执行速度适中(3)文档容错能力强 Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差
    lxml HTML解析器 BeautifulSoup(markup, “lxml”) (1)速度快(2)文档容错能力强 需要安装C语言库
    lxml XML解析器 BeautifulSoup(markup, [“lxml”, “xml”]) OR BeautifulSoup(markup, “xml”) (1)速度快(2)唯一支持XML的解析器 需要安装C语言库
    html5lib BeautifulSoup(markup, “html5lib”) (1)最好的容错性(2)以浏览器的方式解析文档(3)生成HTML5格式的文档 (1)速度慢(2)不依赖外部扩展
  • 相关阅读:
    如何批量查看容器内部的进程?
    如何一键将k8s中configmap以及secret的配置变成本地环境变量
    如何快速批量下载m3u8(ts)视频?
    如何快速搜索?
    【待学习】知识点/小类
    【待学习】科目/大类
    总览:SpringCloud基础结构
    AES 加密解密
    JVM学习:方法重载的优先级
    反射操作数组---反序列化小知识
  • 原文地址:https://www.cnblogs.com/yinwei-space/p/9320640.html
Copyright © 2011-2022 走看看