Scraping with Typhoeus and Nokogiri

zoukankan html css js c++ java

Scraping with Typhoeus and Nokogiri
Scraping with Typhoeus and Nokogiri
June 12th, 2009 · 1 Comment
I’ve been working on some cool new functionality at OneSpot. We want to provide a widget that can give the reader more context about a given article. Zemanta takes the article text and hands us back a set of semantic entities, including links to their Wikipedia page, but we wanted to get a nice blurb about each entity and figured that the opening paragraph from the Wikipedia page would be reasonable.
To do this, we use Typhoeus to fetch the Wikipedia pages in parallel and Nokogiri to pull the relevant content using a custom XPath expression for Wikipedia’s page layout.
Some notes:
We configure Typhoeus to use Rails’s cache store for its own cache store. We cache the Wikipedia response for 7 days in order to be good Netizens and not overburden their servers.
Wikipedia links do not specify a hostname so we make them absolute so the links will work embedded in another page.
We tried Curl::Multi but it was giving us occasional bus errors.
My wordpress syntax highlighter is obviously subpar when it comes to regular expressions.
require 'typhoeus' require 'nokogiri' class Wikipedia include Typhoeus #self.cache = Rails.cache.instance_variable_get(:@data) remote_defaults :cache_responses => 7*24*60*60, :user_agent => 'typhoeus crawler', :timeout => 5 define_remote_method :extract, :on_success => lambda {|response| Wikipedia.extract_first_paragraph(response.body) } def self.extract_first_paragraph(content) nh = Nokogiri::HTML(content) str = nh.xpath("//div[@id='bodyContent']/p[1]").inner_html str.gsub /href="\/wiki/, 'href="http://en.wikipedia.org/wiki' end end
And here’s how you use it.
entities = %w( http://en.wikipedia.org/wiki/Garth_Marenghi's_Darkplace http://en.wikipedia.org/wiki/Bus_error http://en.wikipedia.org/wiki/Washington ) content = entities.map do |url| Wikipedia.extract(:base_uri => url) end p content
Tags: Ruby
查看全文

相关阅读:
python之访问限制机制
 python之property装饰器
 python之封装、组合
 python中classmethod和staticmethod
（专题一）01 matlab基础
 代数运算
 点运算
 研究生学习安排2019/6/6
图像处理中创建CDib类时无法选择基类类型时怎么办
 04 学习java养成良好的写作习惯

原文地址：https://www.cnblogs.com/lexus/p/1934915.html

Scraping with Typhoeus and Nokogiri

Scraping with Typhoeus and Nokogiri

June 12th, 2009 · 1 Comment