zoukankan      html  css  js  c++  java
  • python获取知乎日报另存为txt文件

    前言

    拿来练手的,比较简单(且有bug),欢迎交流~

    功能介绍

    抓取当日的知乎日报的内容,并将每篇博文另存为一个txt文件,集中放在一个文件夹下,文件夹名字为当日时间。

    使用的库

    re,BeautifulSoup,sys,urllib2

    注意事项

    1.运行环境是Linux,python2.7.x,想在win上使用直接改一下里边的命令就可以了

    2.bug是在处理 “如何正确吐槽”的时候只能获取第一个(懒癌发作了)

    3.直接获取(如下)内容是不可以的,知乎做了反抓取的处理

    urllib2.urlop(url).read()

    所以加个Headers就可以了

    4.因为zhihudaily.ahorn.me这个网站时不时挂掉,所以有时候会出现错误

    1 def getHtml(url):
    2     header={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1','Referer' : '******'}
    3     request=urllib2.Request(url,None,header)
    4     response=urllib2.urlopen(request)
    5     text=response.read()
    6     return text

    4.在做内容分析的时候可以直接使用re,也可以直接调用BeautifulSoup里的函数(我对正则表达式发怵,所以直接bs),比如

    1 def saveText(text):
    2     soup=BeautifulSoup(text)
    3     filename=soup.h2.get_text()+".txt"
    4     fp=file(filename,'w')
    5     content=soup.find('div',"content")
    6     content=content.get_text()

    show me the code

     1 #Filename:getZhihu.py
     2 import re
     3 import urllib2
     4 from bs4 import BeautifulSoup
     5 import sys
     6 
     7 reload(sys)
     8 sys.setdefaultencoding("utf-8")
     9 
    10 #get the html code
    11 def getHtml(url):
    12     header={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1','Referer' : '******'}
    13     request=urllib2.Request(url,None,header)
    14     response=urllib2.urlopen(request)
    15     text=response.read()
    16     return text
    17 #save the content in txt files
    18 def saveText(text):
    19     soup=BeautifulSoup(text)
    20     filename=soup.h2.get_text()+".txt"
    21     fp=file(filename,'w')
    22     content=soup.find('div',"content")
    23     content=content.get_text()
    24     
    25 #   print content #test
    26     fp.write(content)
    27     fp.close()
    28 #get the urls from the zhihudaily.ahorn.com
    29 def getUrl(url):
    30     html=getHtml(url) 
    31 #   print html
    32     soup=BeautifulSoup(html)
    33     urls_page=soup.find('div',"post-body")
    34 #   print urls_page
    35 
    36     urls=re.findall('"((http)://.*?)"',str(urls_page))
    37     return urls 
    38 #main() founction
    39 def main():
    40     page="http://zhihudaily.ahorn.me"
    41     urls=getUrl(page)
    42     for url in urls:
    43         text=getHtml(url[0])
    44         saveText(text)
    45 
    46 if __name__=="__main__":
    47     main()
  • 相关阅读:
    ini_set /ini_get函数功能-----PHP
    【转】那个什么都懂的家伙
    word 2007为不同页插入不同页眉页脚
    August 26th 2017 Week 34th Saturday
    【2017-11-08】Linux与openCV:opencv版本查看及库文件位置等
    August 25th 2017 Week 34th Friday
    August 24th 2017 Week 34th Thursday
    August 23rd 2017 Week 34th Wednesday
    August 22nd 2017 Week 34th Tuesday
    August 21st 2017 Week 34th Monday
  • 原文地址:https://www.cnblogs.com/wswang/p/4435203.html
Copyright © 2011-2022 走看看