zoukankan      html  css  js  c++  java
  • 爬大主宰小说(第一代)

    功能有待加强,可实现抓取,不过速度太慢,由于代理,没设置header,幸好网站对爬虫没设限制

     1 import requests
     2 from bs4 import BeautifulSoup
     3 def get_url_list(url):
     4     content = requests.get(url).content
     5     soup = BeautifulSoup(content,'lxml')
     6     list = []
     7     for i in soup.select('#list dl dd a'):
     8                          temp = 'http://www.biquge.info/0_921/'+i.get('href')
     9                          list.append(temp)
    10     return list
    11 def get_date(url):
    12     content = requests.get(url).content
    13     soup = BeautifulSoup(content,'lxml')
    14     soup1 = str(soup.select('#content'))
    15     text = soup1.replace('<br/>','
    ').replace('</div>','
    ').replace('<div id="content">','')
    16     title = soup.select('.content_read .box_con .bookname h1')[0].get_text()
    17     f = open(r'F:\栋歌第一代爬虫.txt','a+',encoding = 'utf-8')
    18     f.write(title +"
    
    "+text)
    19     print(title)
    20     f.close()
    21                          
    22 if __name__=="__main__":
    23     url = 'http://www.biquge.info/0_921/'
    24     url_list = get_url_list(url)
    25     for i in url_list:
    26         get_date(i)
  • 相关阅读:
    day16(链表中倒数第k个结点)
    day15(C++格式化输出数字)
    day14(调整数组顺序使奇数位于偶数前面 )
    day13(数值的整数次)
    day12(二进制中1的个数)
    day11(矩形覆盖)
    day10(跳台阶)
    hadoop 又一次环境搭建
    Hive 学习
    hadoop -工具合集
  • 原文地址:https://www.cnblogs.com/kangdong/p/8627354.html
Copyright © 2011-2022 走看看