zoukankan      html  css  js  c++  java
  • python获取知乎日报另存为txt文件

    前言

    拿来练手的,比较简单(且有bug),欢迎交流~

    功能介绍

    抓取当日的知乎日报的内容,并将每篇博文另存为一个txt文件,集中放在一个文件夹下,文件夹名字为当日时间。

    使用的库

    re,BeautifulSoup,sys,urllib2

    注意事项

    1.运行环境是Linux,python2.7.x,想在win上使用直接改一下里边的命令就可以了

    2.bug是在处理 “如何正确吐槽”的时候只能获取第一个(懒癌发作了)

    3.直接获取(如下)内容是不可以的,知乎做了反抓取的处理

    urllib2.urlop(url).read()

    所以加个Headers就可以了

    4.因为zhihudaily.ahorn.me这个网站时不时挂掉,所以有时候会出现错误

    1 def getHtml(url):
    2     header={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1','Referer' : '******'}
    3     request=urllib2.Request(url,None,header)
    4     response=urllib2.urlopen(request)
    5     text=response.read()
    6     return text

    4.在做内容分析的时候可以直接使用re,也可以直接调用BeautifulSoup里的函数(我对正则表达式发怵,所以直接bs),比如

    1 def saveText(text):
    2     soup=BeautifulSoup(text)
    3     filename=soup.h2.get_text()+".txt"
    4     fp=file(filename,'w')
    5     content=soup.find('div',"content")
    6     content=content.get_text()

    show me the code

     1 #Filename:getZhihu.py
     2 import re
     3 import urllib2
     4 from bs4 import BeautifulSoup
     5 import sys
     6 
     7 reload(sys)
     8 sys.setdefaultencoding("utf-8")
     9 
    10 #get the html code
    11 def getHtml(url):
    12     header={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1','Referer' : '******'}
    13     request=urllib2.Request(url,None,header)
    14     response=urllib2.urlopen(request)
    15     text=response.read()
    16     return text
    17 #save the content in txt files
    18 def saveText(text):
    19     soup=BeautifulSoup(text)
    20     filename=soup.h2.get_text()+".txt"
    21     fp=file(filename,'w')
    22     content=soup.find('div',"content")
    23     content=content.get_text()
    24     
    25 #   print content #test
    26     fp.write(content)
    27     fp.close()
    28 #get the urls from the zhihudaily.ahorn.com
    29 def getUrl(url):
    30     html=getHtml(url) 
    31 #   print html
    32     soup=BeautifulSoup(html)
    33     urls_page=soup.find('div',"post-body")
    34 #   print urls_page
    35 
    36     urls=re.findall('"((http)://.*?)"',str(urls_page))
    37     return urls 
    38 #main() founction
    39 def main():
    40     page="http://zhihudaily.ahorn.me"
    41     urls=getUrl(page)
    42     for url in urls:
    43         text=getHtml(url[0])
    44         saveText(text)
    45 
    46 if __name__=="__main__":
    47     main()
  • 相关阅读:
    【leetcode】1295. Find Numbers with Even Number of Digits
    【leetcode】427. Construct Quad Tree
    【leetcode】1240. Tiling a Rectangle with the Fewest Squares
    【leetcode】1292. Maximum Side Length of a Square with Sum Less than or Equal to Threshold
    【leetcode】1291. Sequential Digits
    【leetcode】1290. Convert Binary Number in a Linked List to Integer
    【leetcode】1269. Number of Ways to Stay in the Same Place After Some Steps
    【leetcode】1289. Minimum Falling Path Sum II
    【leetcode】1288. Remove Covered Intervals
    【leetcode】1287. Element Appearing More Than 25% In Sorted Array
  • 原文地址:https://www.cnblogs.com/wswang/p/4435203.html
Copyright © 2011-2022 走看看