zoukankan      html  css  js  c++  java
  • <Web Scraping with Python>:Chapter 1 & 2

    <Web Scraping with Python>

    Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsing

    • BeautifulSoup 

    Key:

        P5:

    • urlib or urlib2?

     If you’ve used the urllib2 library in Python 2.x, you might have noticed that things have changed somewhat between urllib2 and urllib. In Python 3.x, urllib2 was renamed urllib and was split into several submodules: urllib.request, urllib.parse, and url lib.error. Although function names mostly remain the same, you might want to note which functions have moved to submodules when using the new urllib. 

    在学习这本书之前,使用过此package(我一开始学习Python就用的是3.x,Mac自带Python2.x),当时出错了,上Stackoverflow找到了答案,现在这本书提到了这点,重新回顾一下。如果你用过 Python 2.x 里的 urllib2 库,可能会发现 urllib2 与 urllib 有些不同。在 Python 3.x 里,urllib2 改名为 urllib,被分成一些子模块:urllib.requesturllib.parse 和 urllib.error。尽管函数名称大多和原来一样,但是在用新的 urllib 库时需要注意哪些函数被移动到子模块里了。

      

        P15:

    • When to get_text() and When to Preserve Tags?

    .get_text() strips all tags from the document you are working with and returns a string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away and you’ll be left with a tagless block of text.

    Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Call‐ ing .get_text() should always be the last thing you do, immedi‐ ately before you print, store, or manipulate your final data. In general, you should try to preserve the tag structure of a document as long as possible. 

        P16:

    • find() and findAll() with BeautifulSoup?
  • 相关阅读:
    ITU 测试向量 下载地址
    转:数字集群移动通信系统技术体制综述及优选准则
    转:留一手教你在美国亚马逊网购
    离散度的测量(来自百度百科)与应用(自己理解)
    G.718的mos分
    【转】关于Alchemy
    Ogg Squish 0.98 源代码
    转:分布式视频编码关键技术及其发展趋势
    分布式视频编码概述与应用(来自百度百科)和WynerZiv Coding算法
    @PostConstruct和@PreDestroy注解在spring源码中生效的流程
  • 原文地址:https://www.cnblogs.com/Chinawolfman/p/5436494.html
Copyright © 2011-2022 走看看