zoukankan      html  css  js  c++  java
  • python feedparser 使用

    号称Universal feed parser, handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds。官网:

    https://pypi.python.org/pypi/feedparser/ 

    基本用法

    >>> import feedparser
    >>> d = feedparser.parse("http://feedparser.org/docs/examples/atom10.xml")
    >>> d['feed']['title']             # feed data is a dictionary
    u'Sample Feed'
    >>> d.feed.title                   # get values attr-style or dict-style
    u'Sample Feed'
    >>> d.channel.title                # use RSS or Atom terminology anywhere
    u'Sample Feed'
    >>> d.feed.link                    # resolves relative links
    u'http://example.org/'
    >>> d.feed.subtitle                 # parses escaped HTML
    u'For documentation <em>only</em>'
    >>> d.channel.description          # RSS terminology works here too
    u'For documentation <em>only</em>'
    >>> len(d['entries'])              # entries are a list
    1
    >>> d['entries'][0]['title']       # each entry is a dictionary
    u'First entry title'
    >>> d.entries[0].title             # attr-style works here too
    u'First entry title'
    >>> d['items'][0].title            # RSS terminology works here too
    u'First entry title'
    >>> e = d.entries[0]
    >>> e.link                         # easy access to alternate link
    u'http://example.org/entry/3'
    >>> e.links[1].rel                 # full access to all Atom links
    u'related'
    >>> e.links[0].href                # resolves relative links here too
    u'http://example.org/entry/3'
    >>> e.author_detail.name           # author data is a dictionary
    u'Mark Pilgrim'
    >>> e.updated_parsed              # parses all date formats
    (2005, 11, 9, 11, 56, 34, 2, 313, 0)
    >>> e.content[0].value             # sanitizes dangerous HTML
    u'<div>Watch out for <em>nasty tricks</em></div>'
    >>> d.version                      # reports feed type and version
    u'atom10'
    >>> d.encoding                     # auto-detects character encoding
    u'utf-8'
    >>> d.headers.get('Content-type')  # full access to all HTTP headers
    u'application/xml'

    标准的item:

    <item>
    <title><![CDATA[厦门公交车放火案死者名单公布<br/>警方公布嫌犯犯罪证据]]></title>
    <link>http://www.infzm.com/content/91404</link>
    <description><![CDATA[6月11日下午,厦门BRT公交车放火案47名死亡者名单公布。厦门政府新闻办6月10日发布消息称,有证据表明,陈水总携带汽油上了闽DY7396公交车。且有多名幸存者指认其在车上纵火,致使整部车引起猛烈燃烧。经笔迹鉴定,陈水总6月7日致妻、女的两封绝笔书系陈水总本人所写。]]></description>
    <category>南方周末-热点新闻</category>
    <author>infzm</author>
    <pubDate>2013-06-11 11:24:32</pubDate>
    </item>

    feedparser.parse()得到什么,

    d=feedparser.parse(' ')
    >>> print d
    {'feed': {}, 'encoding': u'utf-8', 'bozo': 1, 'version': u'', 'namespaces': {}, 'entries': [], 'bozo_exception': SAXParseException('no element found',)}

    可以看到,得到的是字典,feed也是字典,entries是list。

  • 相关阅读:
    springmvc控制器controller单例问题
    用java求一个整数各位数字之和
    Java实现对List去重
    Oracle 11g修改字符集AL32UTF8为ZHS16GBK
    数据库字符集(AL32UTF8)和客户端字符集(2%)是不同的
    第1个人10,第2个比第1个人大2岁,依次递推,请用递归方式计算出第8个人多大?
    用table显示n条记录,每3行换一次颜色,即1,2,3用红色字体,4,5,6用绿色字体,7,8,9用红颜色字体。
    tomcat和jboss的区别
    poj_2486 动态规划
    poj_1464 动态规划
  • 原文地址:https://www.cnblogs.com/youxin/p/3132713.html
Copyright © 2011-2022 走看看