zoukankan      html  css  js  c++  java
  • python爬取百度谷歌搜索结果


    使用requests模块爬取百度或者谷歌搜索结果,,如下代码示例是百度的,修改为谷歌的话研究下谷歌url的格式替换下即可,
    把要搜索的字段写入一个文件中,每行写一个,运行的第一个参数为文件路径,按代码中的保存格式将结果保存在当前目录的文件中;
    代码如下

    # coding=utf-8

    import os
    import random
    import sys
    import time
    import json
    import logging
    import datetime
    import requests


    logging.basicConfig(level=logging.DEBUG,
    format='%(asctime)s - %(levelname)s - %(message)s')

    USER_AGENT = ['Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533+
    (KHTML, like Gecko) Element Browser 5.0',
    'IBM WebExplorer /v0.94', 'Galaxy/1.0 [en] (Mac OS X 10.5.6; U; en)',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
    'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14',
    'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko)
    Version/6.0 Mobile/10A5355d Safari/8536.25',
    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
    Chrome/28.0.1468.0 Safari/537.36',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; TheWorld)']


    class GoogleSpider:

    def __init__(self, query_file):

    self.query_file = query_file
    self.temp_url = "https://www.baidu.com/s?wd={}"
    self.query_list = []
    self.url_list = []
    user_agent = random.choice(USER_AGENT)
    self.headers = {
    "User-Agent": user_agent}
    self.save_list = []

    def get_query(self):
    """
    从文件中查找要搜索的字段
    :return:
    """

    if not os.path.exists(self.query_file):
    logging.error("请检查文件名路径")

    with open(self.query_file, "r", encoding="utf-8") as file:
    for word in file:
    self.query_list.append(word.strip())

    def get_url_list(self):
    """
    获取所有要搜索的url
    :return:
    """

    self.url_list = [self.temp_url.format(query) for query in self.query_list]

    def parse_url(self):
    """
    解析每一个url,每个请求停顿一秒,防止被识别为爬虫
    :return:
    """

    for url in self.url_list:
    word = self.query_list[self.url_list.index(url)]

    response = requests.get(url, headers=self.headers)

    if response.status_code != 200:
    logging.error("{}搜索请求失败".format(word))

    save_format = dict()
    save_format["query"] = word
    save_format["html"] = response.content.decode()
    save_format["datatime"] = datetime.datetime.now().strftime('%Y%m%d')

    self.save_list.append(save_format)
    time.sleep(1)

    def write_to_file(self):
    """
    将读取的内容按照特定格式保存至文件中
    :return:
    """

    with open("success_query.txt", "w", encoding="utf-8") as file:
    for content in self.save_list:
    file.write(str(content))
    file.write(" ")

    logging.info("请在当前目录下查看success_query.txt")

    def run(self):

    # 从query文件中读取要查询的字段
    self.get_query()

    # 获取url列表
    self.get_url_list()

    # 发送请求获取数据
    self.parse_url()

    # 将数据写入文件中
    self.write_to_file()


    if __name__ == '__main__':
    try:
    query = sys.argv[1]
    google = GoogleSpider(query)
    google.run()

    except IndexError:
    logging.error("未找到查找目录")

    except Exception as e:

    logging.error(e)
  • 相关阅读:
    ok6410驱动usb摄像头
    自己动手写CPU之第五阶段(1)——流水线数据相关问题
    ListView嵌套ListView时发生:View too large to fit into drawing cache的问题
    算法导论 第8章 线性时间排序(计数排序、基数排序、桶排序)
    Android_通过ContentObserver监听短信数据变化
    【MyEcplise】导入项目报错:Errors running builder 'JavaScript Validator' on project '项目名'. java.lang.ClassCastException
    【js】js中const,var,let区别
    【Node.js】2.开发Node.js选择哪个IDE 开发工具呢
    【Node.js】1.安装步骤
    【POI】对于POI无法处理超大xls等文件,官方解决方法【已解决】【多线程提升速率待定】
  • 原文地址:https://www.cnblogs.com/skaarl/p/13624106.html
Copyright © 2011-2022 走看看