zoukankan      html  css  js  c++  java
  • python之新手一看就懂的小说爬虫

    晚上回来学学爬虫,记住,很多网站一般新手是爬不出来的,来个简单的,往下看:


    import urllib.request
    from bs4 import BeautifulSoup #我用的pycharm需要手动导入这个包的
    import lxml  #同上



    def getHtml(url,headers):
    req = urllib.request.Request(url=url, headers=headers)
    res =urllib.request.urlopen(req)
    html = res.read()
    return html

    def saveTxt(path,html):
    f = open(path,'wb')
    f.write(html)

    def praseHtml(currentURL,headers,path):
    # html = html.decode('utf-8')
    chapter = 0
    flag = 1
    while flag:
    chapter = chapter+1
    if chapter >= 30: #控制下载的数量,太多数据电脑要爆。
    flag = 0 #停止下载
    html = getHtml(currentURL,headers)
    savePath = path +"\"+str(chapter)+ ".txt"
    f = open(savePath,"w")
    soup =BeautifulSoup(html,"lxml") #注意这里是lxml格式,我第一次居然写成了html,不小心就会吃亏的
    nameText = soup.find('h3',attrs={'class':'j_chapterName'})
    contentText = soup.find('div',attrs={'class':'read-content j_readContent'})
    result = nameText.getText()+' '+contentText.getText()
    result = result.replace(' ',' ')
    f = open(savePath,"w")
    f.write(result)

    nextpage = soup.find('a',attrs={'id':'j_chapterNext'})
    if next :
    currentURL = "http:" + nextpage['href']
    else:
    currentURL = None
    flag = 0

    def main():
    url = "https://www.readnovel.com/chapter/22160402000540402/107513768840595159"
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'} #请求头自己可以再网页中查看 (f12->network->刷新)
    path = "D:\novel"
    praseHtml(url,headers,path)

    main()
    学习,永无止境!
  • 相关阅读:
    POJ 2486 Apple Tree(树形DP)
    HDOJ 4276 鬼吹灯 (树形DP)
    POJ 2923 Relocation(状态压缩DP)
    Vue,事件的修饰符
    Vye,v-if 和 v-show 显示隐藏的使用
    Vue,for循环的key值绑定
    Vue,v-for循环遍历方式
    Vue-class样式,style样式
    Vue,v-model双向事件绑定 及简易计算器练习
    uni-app 生命周期(下拉已解决)
  • 原文地址:https://www.cnblogs.com/litinghappy/p/9180434.html
Copyright © 2011-2022 走看看