zoukankan      html  css  js  c++  java
  • 36.爬取柯林斯字字典

    爬取柯林斯字字典:

    # 关于线程以及进程的使用
    # https://www.cnblogs.com/dylan9/p/9207366.html
    # 关于进程池的使用
    # https://www.cnblogs.com/huchong/p/7459324.html#_lab2_1_0
    import time
    
    import requests
    from lxml import etree
    from multiprocessing.dummy import Pool
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
    }
    
    # url = "https://www.collinsdictionary.com/zh/browse/english/"
    #
    # page_text = requests.get(url=url, headers=headers).text
    #
    # tree = etree.HTML(page_text)
    #
    # li_list = tree.xpath("//ul[@class='bLtr']/li/a/@href")[1:]
    pool = Pool(20)
    
    li_list = ['https://www.collinsdictionary.com/zh/browse/english/words-starting-with-a', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-b', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-c', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-d', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-e', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-f', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-g', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-h', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-i', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-j', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-k', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-l', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-m', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-n', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-o', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-p', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-q', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-r', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-s', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-t', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-u', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-v', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-w', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-x', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-y', 'https://www.collinsdictionary.com/zh/browse/english/words-starting-with-z']
    
    # li_list = ["https://www.collinsdictionary.com/zh/browse/english/words-starting-with-a"]
    
    deep_url_list = []
    
    start = time.time()
    
    def get_urls(url):
        page_text2 = requests.get(url=url, headers=headers).text
        tree2 = etree.HTML(page_text2)
        url_list = tree2.xpath("//ul[@class='columns2 bL']/li/a/@href")
        deep_url_list.extend(url_list)
    
    
    def get_data(url):
        page_text3 = requests.get(url=url, headers=headers).text
        tree3 = etree.HTML(page_text3)
        data_li_list = tree3.xpath("//ul[@class='columns2 bL']/li")
        for li in data_li_list:
            data = li.xpath('./a/text()')[0]
            with open("word2.txt", "a", encoding="utf-8") as f:
                f.write(data + '
    ')
    
    
    pool.map(get_urls, li_list)
    result = pool.map_async(get_data, deep_url_list)
    result.wait()
    print("执行完毕")
    print("耗时:", time.time()-start)
    
  • 相关阅读:
    Elasticsearch5.3 学习(一):安装、Yii2.0 下载es扩展
    lnmp 环境require(): open_basedir restriction in effect 错误
    Ueditor编辑器图片上传到万象优图
    Linux curl 模拟form表单提交信息和文件
    CP936 转换成 UTF-8
    wamp 两个不同的php.ini
    PHP浮点数运算精度造成的,订单金额支付经常少1分的问题
    进制相关:存储与转换
    Python的数据类型
    pycharm+PyQt5 开发配置
  • 原文地址:https://www.cnblogs.com/liuzhanghao/p/12700889.html
Copyright © 2011-2022 走看看