zoukankan      html  css  js  c++  java
  • Ruby ScreenScraper in 60 Seconds

    Ruby Screen-Scraper in 60 Seconds - igvita.com

    Ruby Screen-Scraper in 60 Seconds

    I often find myself trying to automate content extraction from a saved HTML file or a remote server. I've tried a number of approaches over the years, but the dynamic duo of Hpricot and Firebug blew me away - this is by far the fastest way to get what you want without compromising flexibility. Hpricot is an extremely powerful ruby-based HTML parser, and Firebug is arguably the best on-the-fly development add-on for Firefox. Now, I said it will take you about 60 seconds. I lied, it should take less. Let's get right to it.
    Introducing open-uri

    Ruby comes with a very flexible, production ready library that wraps all http/https connections into a single method call: open. Among other things, open-uri will gracefully handle http redirects, allow you to specify custom headers, and even work with ftp addresses. In other words, all the dirty work is already done, but you should still check the RDoc. I'll let the code speak for itself:

    require 'rubygems'
    require 'open-uri'

    @url = "http://www.igvita.com/blog"
    @response = ''

    # open-uri RDoc: http://stdlib.rubyonrails.org/libdoc/open-uri/rdoc/index.html
    open(@url, "User-Agent" => "Ruby/#{RUBY_VERSION}",
    "From" => "email@addr.com",
    "Referer" => "http://www.igvita.com/blog/") { |f|
    puts "Fetched document: #{f.base_uri}"
    puts "\\t Content Type: #{f.content_type}\\n"
    puts "\\t Charset: #{f.charset}\\n"
    puts "\\t Content-Encoding: #{f.content_encoding}\\n"
    puts "\\t Last Modified: #{f.last_modified}\\n\\n"

    # Save the response body
    @response = f.read
    }


    FireBug kung-foo

    Now that we have the document, we need to pull out some content that interests us - usually, this is the tedious part based on regular expressions, stream parsers, etc. Instead, we're going to sidestep all of these issues and let firebug do its magic. First, install the extension, then while on this page, click in the bottom-right corner of your browser to bring it up. It should ask you if you want to enable firebug (hint: say yes). You should now be greeted with the following screen:

    For the sake of an example, assume that we want to extract three things out of this very page: some quoted text (sample below), number of comments, and the list of my latest posts found at the bottom of this page. Here is an example of quoted text:

    So which came first, the parser, which will extract this, or this quote? - Extract me!

    In your firebug window, click "Inspect" and hover your mouse over the quote. You will notice that firebug navigates to the exact part of the DOM-tree (HTML source code) as you do this. When you put your mouse over the quote, you should see the following:

    Here's the trick, right click on the selected blockquote element in your firebug window and select "Copy XPath". This will provide you with the exact drill-down code for the DOM-Tree. In our case your clipboard should contain: "/html/body/div[2]/div/div/blockquote".
    Hpricot magic

    It is at this point that Hpricot comes into the picture, and you have probably guessed it already - it supports XPath. All we need to do is pass our HTML to it to build the internal tree, and then we're ready to go:

    #Rdoc: http://code.whytheluckystiff.net/hpricot/
    doc = Hpricot(@response)

    # Retrive number of comments
    # - Hover your mouse over the 'X Comments' heading at the end of this article
    # - Copy the XPath and confirm that it's the same as shown below
    puts (doc/"/html/body/div[3]/div/div/h2").inner_html

    # Pull out first quote (
    ....
    )
    # - Note that we don't have to use the full XPath, we can simply search for all quotes
    # - Because this function can return more than one element, we will only look at 'first'
    puts (doc/"blockquote/p").first.inner_html

    # Pull out all other posted stories and date posted
    # - This searh function will return multiple elements
    # - We are going to print the date, and then print the article name beside it
    (doc/"/html/body/div[4]/div/div[2]/ul/li/a/span").each do |article|
    puts "#{article.inner_html} :: #{article.next_node.to_s}"
    end



    As you can see I provided a few other examples, but the idea is simple. Open firebug, navigate to component you want to extract, copy XPath, paste it right into the search function of Hpricot and then print out the results. How simple is that? I should also mention that Hpricot is not limited to XPath, nor did my examples cover all the functionality of it, I strongly encourage you to check the official Hpricot page for more tips and tricks.

    Download
    screen-scraper.rb (Combined final code)

    Downloads: 4418 File Size: 1.2 KB

    Running our screen-scraper produces:

    Fetched document: http://www.igvita.com/blog/
    Content Type: text/html
    Charset: utf-8
    Content-Encoding:
    Last Modified:

    No Comments
    So which came first, the parser, which will extract this, or this quote? - Extract me!
    04.02 :: Ruby Screen-Scraper in 60 Seconds
    31.01 :: World News With Geographic Heatmaps
    27.01 :: Correlating Netflix and IMDB Datasets
    ...

    Copy, paste, done. Now you have no excuse to put off that custom RSS generator you always wanted.

    Regin Gaarsmand and Harish Mallipeddi posted PHP and Python equivalents of this method. Awesome!


  • 相关阅读:
    运行 npm run dev 不能自动打开浏览器
    npm run dev 报错:Strings must use singlequote 的解决方法
    new和this
    new Object()、Object.create()、{}三种对象创建方式的区别
    Python 详解修饰器 附带 js使用修饰器
    Python
    react项目使用axios和Charles模拟数据接口
    react切换隐藏或显示状态(包含过渡动画)
    react里使用ref的几种方法
    js对象转数组
  • 原文地址:https://www.cnblogs.com/lexus/p/2224797.html
Copyright © 2011-2022 走看看