zoukankan      html  css  js  c++  java
  • Scraping with Typhoeus and Nokogiri

    Scraping with Typhoeus and Nokogiri

    June 12th, 2009 · 1 Comment

    I’ve been working on some cool new functionality at OneSpot. We want to provide a widget that can give the reader more context about a given article. Zemanta takes the article text and hands us back a set of semantic entities, including links to their Wikipedia page, but we wanted to get a nice blurb about each entity and figured that the opening paragraph from the Wikipedia page would be reasonable.

    To do this, we use Typhoeus to fetch the Wikipedia pages in parallel and Nokogiri to pull the relevant content using a custom XPath expression for Wikipedia’s page layout.

    Some notes:

    • We configure Typhoeus to use Rails’s cache store for its own cache store. We cache the Wikipedia response for 7 days in order to be good Netizens and not overburden their servers.
    • Wikipedia links do not specify a hostname so we make them absolute so the links will work embedded in another page.
    • We tried Curl::Multi but it was giving us occasional bus errors.
    • My wordpress syntax highlighter is obviously subpar when it comes to regular expressions.
    require 'typhoeus'
    require 'nokogiri'
     
    class Wikipedia
      include Typhoeus
      #self.cache = Rails.cache.instance_variable_get(:@data)
     
      remote_defaults :cache_responses => 7*24*60*60, 
          :user_agent => 'typhoeus crawler', 
          :timeout => 5
     
      define_remote_method :extract, 
          :on_success => lambda {|response| Wikipedia.extract_first_paragraph(response.body) }
     
      def self.extract_first_paragraph(content)
        nh = Nokogiri::HTML(content)
        str = nh.xpath("//div[@id='bodyContent']/p[1]").inner_html
        str.gsub /href="\/wiki/, 'href="http://en.wikipedia.org/wiki'
      end
    end

    And here’s how you use it.

        entities = %w(
    http://en.wikipedia.org/wiki/Garth_Marenghi's_Darkplace
    http://en.wikipedia.org/wiki/Bus_error
    http://en.wikipedia.org/wiki/Washington
    )
        content = entities.map do |url|
          Wikipedia.extract(:base_uri => url)
        end
        p content

    Tags: Ruby

  • 相关阅读:
    第二十篇:不为客户连接创建子进程的并发回射服务器(poll实现)
    第十九篇:不为客户连接创建子进程的并发回射服务器(select实现)
    第十八篇:批量处理情况下的回射客户端
    第十七篇:IO复用之select实现
    修改文件中的内容,使用fileinput模块
    登陆脚本
    内置函数 字符串操作
    loj 1316(spfa预处理+状压dp)
    loj 1099(最短路)
    loj 1044(dp+记忆化搜索)
  • 原文地址:https://www.cnblogs.com/lexus/p/1934915.html
Copyright © 2011-2022 走看看