zoukankan      html  css  js  c++  java
  • 百度搜索引擎取真实地址-python代码

    代码

    def parseBaidu(keyword, pagenum):
        keywordsBaseURL = 'https://www.baidu.com/s?wd=' + str(quote(keyword)) + '&oq=' + str(quote(keyword)) + '&ie=utf-8' + '&pn='
        pnum = 0
        while pnum <= int(pagenum):
            baseURL = keywordsBaseURL + str(pnum*10)
            try:
                request = requests.get(baseURL, headers=headers)
                soup = BeautifulSoup(request.text, "html.parser")
                for a in soup.select('div.c-container > h3 > a'):
                    url = requests.get(a['href'], headers=headers).url
                    yield url
            except:
                yield None
            finally:
                pnum += 1
    

    示例用法

    import requests
    from bs4 import BeautifulSoup
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0"
    }
    
    def parseBaidu(keyword, pagenum)
    
    def main():
        for url in parseBaidu("keyword",10):
            if url:
                print(url)
            else:
                continue
    
  • 相关阅读:
    Oracle 操作数据库(增删改语句)
    web----框架基础
    js----DOM对象
    易错之for循环
    python调用修改变量新方法
    js----基础
    web----Twisted
    web----Socket
    python----面向对象(2)
    python----面向对象
  • 原文地址:https://www.cnblogs.com/Akkuman/p/6963141.html
Copyright © 2011-2022 走看看