zoukankan      html  css  js  c++  java
  • python之新手一看就懂的小说爬虫

    晚上回来学学爬虫,记住,很多网站一般新手是爬不出来的,来个简单的,往下看:


    import urllib.request
    from bs4 import BeautifulSoup #我用的pycharm需要手动导入这个包的
    import lxml  #同上



    def getHtml(url,headers):
    req = urllib.request.Request(url=url, headers=headers)
    res =urllib.request.urlopen(req)
    html = res.read()
    return html

    def saveTxt(path,html):
    f = open(path,'wb')
    f.write(html)

    def praseHtml(currentURL,headers,path):
    # html = html.decode('utf-8')
    chapter = 0
    flag = 1
    while flag:
    chapter = chapter+1
    if chapter >= 30: #控制下载的数量,太多数据电脑要爆。
    flag = 0 #停止下载
    html = getHtml(currentURL,headers)
    savePath = path +"\"+str(chapter)+ ".txt"
    f = open(savePath,"w")
    soup =BeautifulSoup(html,"lxml") #注意这里是lxml格式,我第一次居然写成了html,不小心就会吃亏的
    nameText = soup.find('h3',attrs={'class':'j_chapterName'})
    contentText = soup.find('div',attrs={'class':'read-content j_readContent'})
    result = nameText.getText()+' '+contentText.getText()
    result = result.replace(' ',' ')
    f = open(savePath,"w")
    f.write(result)

    nextpage = soup.find('a',attrs={'id':'j_chapterNext'})
    if next :
    currentURL = "http:" + nextpage['href']
    else:
    currentURL = None
    flag = 0

    def main():
    url = "https://www.readnovel.com/chapter/22160402000540402/107513768840595159"
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'} #请求头自己可以再网页中查看 (f12->network->刷新)
    path = "D:\novel"
    praseHtml(url,headers,path)

    main()
    学习,永无止境!
  • 相关阅读:
    用react重构个人网站 3-23
    用react重构个人网站 3-22
    React官方文档笔记之快速入门
    .Net多线程编程—同步机制
    .Net多线程编程—Parallel LINQ、线程池
    .Net多线程编程—并发集合
    .Net多线程编程—任务Task
    【翻译】MongoDB指南/聚合——聚合管道
    【翻译】MongoDB指南/CRUD操作(四)
    【翻译】MongoDB指南/CRUD操作(三)
  • 原文地址:https://www.cnblogs.com/litinghappy/p/9180434.html
Copyright © 2011-2022 走看看