zoukankan      html  css  js  c++  java
  • python网络爬虫(1)静态网页抓取

    获取响应内容:

    import requests
    r=requests.get('http://www.santostang.com/')
    print(r.encoding)
    print(r.status_code)
    print(r.text)
    

    获取编码,状态(200成功,4xx客户端错误,5xx服务器相应错误),文本,等。

    定制Request请求

    传递URL参数

    key_dict = {'key1':'value1','key2':'value2'}
    r=requests.get('http://httpbin.org/get',params=key_dict)
    print(r.url)
    print(r.text)
    

    定制请求头

    headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0','Host':'www.santostang.com'}
    r=requests.get('http://www.santostang.com',headers=headers)
    print(r.status_code)

    发送POST请求

    POST请求发送表单信息,密码不显示在URL中,数据字典发送时自动编码为表单形式。

    key_dict = {'key1':'value1','key2':'value2'}
    r=requests.post('http://httpbin.org/post',data=key_dict)
    print(r.url)
    print(r.text)
    

    超时并抛出异常

    r=requests.get('http://www.santostang.com/',timeout=0.11)
    

      

    获取top250电影数据

    import requests
    import myToolFunction
    from bs4 import BeautifulSoup
    
    def get_movies():
        headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0','Host':'movie.douban.com'}
        movie_list=[]
        for i in range(10):
            link='https://movie.douban.com/top250'
            key_dict = {'start':i*25,'filter':''}
            r=requests.get(link,params=key_dict)
            #print(r.text)
            print(r.status_code)
            print(r.url)
            
            soup=BeautifulSoup(r.text,'lxml')
            div_list=soup.find_all('div', class_='hd')
            for each in div_list:
                movie=each.a.span.text.strip()+'
    '
                movie_list.append(movie)
            pass
        return movie_list
    
    def storFile(data,fileName,method='a'):
        with open(fileName,method,newline ='') as f:
            f.write(data)
            pass
        pass
    
    movie_list=get_movies()
    for str in movie_list:
        myToolFunction.storFile(str, 'movie top250.txt','a')
        pass
    

      

  • 相关阅读:
    LyX使用中的一些问题
    Mac OS apache php配置
    MySQL utf8mb4 字符集:支持 emoji 表情符号
    java.util.NoSuchElementException: Timeout waiting for idle object
    MyEclipse 2014跟2015破解
    No row with the given identifier exists:
    Android启动icon切图大小
    Android接入百度自动更新SDK
    Android自定义spinner下拉框实现的实现
    android给View设置边框 填充颜色 弧度
  • 原文地址:https://www.cnblogs.com/bai2018/p/10957787.html
Copyright © 2011-2022 走看看