办公自动化67_Python爬取博客的所有文章并存为带目录的word

zoukankan html css js c++ java

办公自动化67_Python爬取博客的所有文章并存为带目录的word
Python爬取博客的所有文章并存为带目录的word 文档#####

import requests
from bs4 import BeautifulSoup
url = f'http://blog.sina.com.cn/s/articlelist_5119330124_0_1.html'
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.content)

获取当页所有文章的标题和链接

soup.select('.atc_title')

获取当页所有文章的发表时间

soup.select('.atc_tm')
'''观察发现链接位于a标签里的href里面，
于是使用select方法选中a标签，可以看到结果是一个新的列表（如下）。
'''
soup.select('.atc_title')[0].select('a')
'''
使用get("href")方法获得链接；使用text方法获得标题。
'''
soup.select('.atc_title')[0].select('a')[0].get("href")

soup.select('.atc_title')[0].select('a')[0].text

soup.select('.atc_tm')[0].text

'''
因为我们已知作者的文章共有5页，所以直接使用range(1,6)。
将最终的信息存入字典all_links。
其中，“标题”作为键，文章链接和发表时间作为值。
通过len(all_links)查看获取的文章链接数，一共211篇文章。
'''

获取所有博客文章的链接

import requests
from bs4 import BeautifulSoup

all_links = {}
for i in range(1,6):
url = f'http://blog.sina.com.cn/s/articlelist_5119330124_0_{i}.html'
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.content)
links = soup.select('.atc_title')
times = soup.select('.atc_tm')
for i in range(len(links)):
http_link = links[i].select('a')[0].get('href')
title = links[i].text.strip()
time = times[i].text
all_links[title] = [http_link, time]

#######################################
'''
拿到所有文章链接后，先取一个来测试一下如何获取页面的文字。
在文字上点右键，选择“检查”，可见其内容位于class=articalContent newfont_family里面，
因此使用soup.select(".articalContent.newfont_family")就可以获取到
（注意articalContent和newfont_family之间的空格要用"."代替）。
将其存入article变量，显示一下，可以看到这是一个大列表，其中的文本就是我们需要的内容。
'''

获取单篇文章中的文字

url = 'http://blog.sina.com.cn/s/blog_13122c74c0102zbt3.html'
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.content)
article = soup.select(".articalContent.newfont_family")
article
article[0].text
article[0].text.replace("xa0","")

获取单篇文章中的图片链接

url = 'http://blog.sina.com.cn/s/blog_13122c74c0102zbud.html'
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.content)
img_link = soup.select(".articalContent.newfont_family")[0].find_all("img")[0].get("real_src")

获取单篇文章中的图片链接

url = 'http://blog.sina.com.cn/s/blog_13122c74c0102zbud.html'
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.content)
img_link = soup.select(".articalContent.newfont_family")[0].find_all("img")[0].get("real_src")

图片下载函数

def downloadImg(img_url, file_path):
req = requests.get(url=img_url)
with open(file_path, 'wb') as f:
f.write(req.content)
downloadImg(r'http://s8.sinaimg.cn/middle/005AsbCIzy7vEfdM1M599',r'F:python_2020python办公自动化实例67_Python爬取博客的所有文章并存为带目录的word文档images2.jpg')

'''
有些文章被加密，获取不到内容，此时article变量为空，所以加个if语句判断，以免程序崩溃。
每写入一篇文章，计数器自动加1，然后通过print输出信息。
'''

#########################################################

写入标题，内容到word文件

import docx
from docx import Document
from docx.oxml.ns import qn #用于应用中文字体

def to_word(all_links):
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0"}
doc=docx.Document() #新建word文档
doc.styles['Normal'].font.name=u'宋体'
doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体')
```
counter = 0 #计数器，用于记录写入word的文章数
for title in all_links.keys():
    doc.add_heading(title,1)
    date = all_links[title][1][:10]#只取日期，不要时间
    doc.add_paragraph(date)
    wb_data = requests.get(all_links[title][0],headers = header)
    soup = BeautifulSoup(wb_data.content)        
    article = soup.select(".articalContent.newfont_family")
    #有些文章被加密，获取不到内容，此时article为空，所以加个if语句判断
    if article:
        text = article[0].text.replace("xa0","")
        doc.add_paragraph(text)
        print(f"写入文章 {title} 。")
        counter += 1
print(f"共写入 {counter} 篇文章。")
doc.save("新浪微博文章2.doc")
```
to_word(all_links)

print("succed")
################### end ########################
因为不懂，所以要学；因为平凡，所以努力。因为有为，所以有位。因为执着，所以精彩。
查看全文

相关阅读:
很好很強大..
［草稿］初次接触RoR+MySQL之资源收集篇
 我的首页收藏链接之07年前的LIST
生成不重复随机数
 The C# Programming Language(Third Edition) Part I
The C# Programming Language(Third Edition) Part III
The C# Programming Language(Third Edition) Part II
求职技术题目收集② 算法
 求职路姊妹篇笔试
 求职技术题目收集① 数据结构

原文地址：https://www.cnblogs.com/quezesheng/p/13289196.html

办公自动化67_Python爬取博客的所有文章并存为带目录的word

Python爬取博客的所有文章并存为带目录的word 文档#####

获取当页所有文章的标题和链接

获取当页所有文章的发表时间

获取所有博客文章的链接

获取单篇文章中的文字

获取单篇文章中的图片链接

获取单篇文章中的图片链接

图片下载函数

写入标题，内容到word文件