python 入门实践之网页数据抓取

zoukankan html css js c++ java

python 入门实践之网页数据抓取
这个不错。正好入门学习使用。

1、其中用到 feedparser：
技巧：使用 Universal Feed Parser 驾驭 RSS
http://www.ibm.com/developerworks/cn/xml/x-tipufp.html
请访问 feedparser.org，详细了解 Universal Feed Parser，其中还包括一些下载资料和文档。

feedparser 实际下载地址：
http://code.google.com/p/feedparser/downloads/list

2、另外，需要将文件加上 utf-8 的 bom 头，需要用到 python 写入十六进制字符：
http://linux.byexamples.com/archives/478/python-writing-binary-file/
python 写入十六进制字符
file.write("x5Fx9Dx3E")
file.close()

3、因为要调试，文件的打开模式改成 w 方便一些。
Python代码

import urllib

import sys

import re

from feedparser import _getCharacterEncoding as enc



class TagParser:

    def __init__(self, value):

        self.value = value

    def get(self, start, end):

        regx = re.compile(r'<' + start + r'.*?>.*</' + end + r'>')

        return re.findall(regx, self.value)



if __name__ == "__main__":

    baseurl = "http://data.book.163.com/book/section/000BAfLU/000BAfLU"

    f = open("test_01.txt", "w")

    f.write("xefxbbxbf")

#    for ndx in range(0, 56):

    for ndx in range(0, 1):

        url = baseurl + str(ndx) + ".html"

        print "get content from " + url

        src = urllib.urlopen(url)

        text = src.read()



    f1= open("tmp_" + str(ndx) + ".txt", "w")

    f1.write(text)

    f1.close()



        encoding = enc(src.headers, text)[0]



        tp = TagParser(text)



        title = tp.get('h1 class="f26s tC"', 'h1')

        article = tp.get('p class="ti2em"', 'p')



        t = re.sub(r'</.+>', ' ', title[0])

        t = re.sub(r'<.+>', ' ', t)

        data = t



        c = ""

        for p in article:

            pt = re.sub(r'</p>', ' ', p)

            c += pt

        c = re.sub(r'<.+>', ' ', c)

        data += c

        data = data.decode(encoding)

        f.write(data.encode('utf-8', 'ignore'))



    f.close()
查看全文

相关阅读:
PHP微信公众号支付，JSAPI支付方法，ThinkPHP5+微信支付
 PHP微信扫码支付DEMO，thinkphp5+微信支付
 解决vue axios跨域请求发送两次问题
 解决navicat远程连接mysql很卡的问题
 GIT的工作原理和基本命令
 简单好用的网站压力测试工具
 vscode中让html中php代码高亮
 redis的安装及使用总结
 tp32-layuicms项目介绍
 vscode Vue格式化HTML标签换行问题

原文地址：https://www.cnblogs.com/rrxc/p/4027974.html