zoukankan      html  css  js  c++  java
  • 初学-BeautifulSoup爬取豆瓣页面

    # -*- coding: utf-8 -*-
    import os
    import urllib
    import urllib2
    from bs4 import BeautifulSoup

    headers = {
    'Accept': 'text / html, application / xhtml + xml, application / xml;q = 0.9, image / webp, image / apng, * / *;q = 0.8',
    'Accept - Language':'zh - CN, zh;',
    'Cache - Control':'max - age = 0',
    'Connection':'keep - alive',
    'Content - Length':'125',
    'Content - Type':'application / x - www - form - urlencoded',
    'X-Content-Type-Options':'nosniff',
    'X-DAE-Node':'daisy2b',
    'X-Douban-Mobileapp':'0',
    'X-Xss-Protection':'1; mode=block',
    }


    def parse(html,downloader_Function):
    soup = BeautifulSoup(html, 'html.parser')
    all_a = soup.find_all(rel="nofollow")
    for a in all_a:

    if 'src' not in a.attrs:
    print a['href']
    else:
    path = a['src']
    name = a['alt']
    downloader_Function(path,name)

    def htmlContent(url):
    req = urllib2.Request(url, headers=headers)
    resp = urllib2.urlopen(req)
    html = resp.read()
    return html


    def fileDownloader(path,fileName):
    currentDir = os.getcwd() + '/download/'

    filePath = currentDir +'%s.png'%fileName
    urllib.urlretrieve(path,filePath)

    def start():
    htmlText = htmlContent('https://movie.douban.com/')
    print htmlText
    parse(htmlText,fileDownloader)

    start()
    print(dir(BeautifulSoup))

  • 相关阅读:
    VS2015复制VS2013的项目,编译报错
    Asp.Net MVC的几个文件上传方式
    一个基于Jquery的涂鸦插件
    Js 自定义日期格式的正则表达式验证
    无聊做的小游戏,斗牛.html
    MSSql Server 自定义导出
    Asp.Net Ambiguous match found 错误另一种解决方法
    韩天峰(Rango)推荐书目
    微信创建菜单报 must use utf-8 charset hint 错误
    InstallShield 2015 LimitedEdition VS2012 覆盖安装
  • 原文地址:https://www.cnblogs.com/air-liyan/p/8422840.html
Copyright © 2011-2022 走看看