zoukankan html css js c++ java

Python爬虫之豆瓣-新书速递-图书解析

1- 问题描述

　　抓取豆瓣“新书速递”^[1]页面下图书信息（包括书名，作者，简介，url），将结果重定向到txt文本文件下。

2- 思路分析^[2]

　　Step1 读取HTML

　　Step2 Xpath遍历元素和属性

3- 使用工具

　　Python，lxml模块，requests模块

4- 程序实现

 1 # -*- coding: utf-8 -*-
 2 from lxml import html
 3 import requests
 4 
 5 
 6 page = requests.get('http://book.douban.com/latest?icn=index-latestbook-all')
 7 tree = html.fromstring(page.text)
 8 
 9 # 若保存了html文件，可使用下面方法
10 # page = open('/home/freyr/codeHouse/python/512.htm', 'r').read()
11 # tree = html.fromstring(page)
12 
13 #提取图书信息
14 bookname = tree.xpath('//div[@class="detail-frame"]/h2/text()')    # 书名
15 author = tree.xpath('//div[@class="detail-frame"]/p[@class="color-gray"]/text()')    # 作者
16 info = tree.xpath('//div[@class="detail-frame"]/p[2]/text()')    # 简介
17 url = tree.xpath('//ul[@class="cover-col-4 clearfix"]/li/a[@href]')    # URL
18 
19 booknames = map(lambda x:x.strip(), bookname)
20 authors = map(lambda x:x.strip(), author)
21 infos = map(lambda x:x.strip(), info)
22 urls = map(lambda p: p.values()[0], url)
23 
24 with open('/home/freyr/codeHouse/python/dbBook.txt','w+') as f:
25     for book, author, info, url in zip(booknames, authors, infos, urls):
26         f.write('%s

%s

%s' % (book.encode('utf-8'), author.encode('utf-8'), info.encode('utf-8')))    
27         f.write('

%s
' % url )
28         f.write('

-----------------------------------------


')

PS: 　　1.还没有真正入手学习网页爬虫，先简单记录下。

　　　　2.程序涉及编码问题^[3]

[1] 豆瓣-新书速递

[2] lxml and Requests

[3] lxml 中文乱码

查看全文

相关阅读:
python 使用pyinstaller生成exe，以及编译报错：编译时报错如下：No module named timedeltas not build. If you want import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to
Remote desktop manager 如何导入.db配置文件
 C# string怎么转换成泛型T？
C# 如何在ComboBox输入文字改变时，触发事件？
C# 检查panel所有的checkbox 是否被选中
 C# bool? 的意思
 WPF: Accessing Databases with Windows Presentation Foundation / WPF链接数据库
 WPF 03
WPF MVVC 基础
 使用 Topshelf 创建 Windows 服务

原文地址：https://www.cnblogs.com/freyr/p/4500933.html