zoukankan      html  css  js  c++  java
  • 一个简书的爬虫,可以设定页码,抓取文章标题、简介以及链接

     1 #coding=utf-8
     2 import requests
     3 from bs4 import BeautifulSoup
     4 
     5 m=input("请输入想要抓取的页码数量:")
     6 for i in range(1,int(m)):
     7     url="https://www.jianshu.com/?page="+str(i)
     8     headers={
     9         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0',
    10         'Accept': 'text/html, */*; q=0.01',
    11         'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    12         'Accept-Encoding': 'gzip, deflate',
    13         'Referer': 'https://www.jianshu.com/',
    14         'X-INFINITESCROLL': 'true',
    15         'X-Requested-With': 'XMLHttpRequest',
    16         'Connection': 'close',
    17         }
    18     html=requests.get(url=url,headers=headers)
    19     soup = BeautifulSoup(html.text.encode(html.encoding).decode('utf-8'), 'html.parser')
    20     # 以格式化的形式打印html
    21     #print(soup.prettify())
    22     titles = soup.find_all('a', 'title')
    23     titlesp = soup.find_all('p', 'abstract')
    24     with open(r"./文章简介.txt","a",encoding='utf-8') as file:
    25         for (title,titlep) in zip(titles,titlesp):
    26             file.write(title.string+'
    ')
    27             file.write(titlep.string+'
    ')
    28             file.write("https://www.jianshu.com" + title.get('href')+'
    
    ')</code>
    29 
    30 print("执行完毕,保存在目录:./文章简介.txt")

    环境:python3

    模块:requests、bs4

  • 相关阅读:
    BZOJ4327 : JSOI2012 玄武密码
    BZOJ4303 : 数列
    BZOJ1077 : [SCOI2008]天平
    BZOJ1829 : [Usaco2010 Mar]starc星际争霸
    BZOJ1770 : [Usaco2009 Nov]lights 燈
    BZOJ3012 : [Usaco2012 Dec]First!
    BZOJ4320 : ShangHai2006 Homework
    BZOJ4311 : 向量
    BZOJ3075 : [Usaco2013]Necklace
    BZOJ4304 : 道路改建
  • 原文地址:https://www.cnblogs.com/0day-li/p/9899842.html
Copyright © 2011-2022 走看看