zoukankan      html  css  js  c++  java
  • Using YQL as crawler for Javascript

    http://www.julianwong.net/blog/2009/06/using-yql-as-crawler-for-javascript/


    Using YQL as crawler for Javascript

    It is a good fun to play with Yahoo! Query Language (YQL). YQL is a service enables applications to query, filter, and combine data from different sources across the Internet. Many data in the Yahoo! network can be retrieved from YQL with a SQL like syntax.

    SELECT * FROM flickr.photos.search WHERE text="cat"

    Means to do a flickr search on photo with the text equals to cat. But the thing that catch me is the capability to convert the content (HTML page) from an external site to a well formatted XML / JSON.

    select * from html where url="http://news.yahoo.com/"<br />and xpath="/html/body/div[@id='doc4']/div[@id='bd']/div[@id='yui-main']/div/div[@id='top-story']/div/div[1]/div[2]/h2/a"

    The YQL above will return the headline from Yahoo! news. The xpath part looks pretty scary, but with the xpather firefox addon, you can get the xpath on any DOM element with right click -> Show in XPather. (P.S. One thing to notice with xpather is the tbody tag, which firefox will add to its DOM tree for table which might not really exist in the source HTML. This extra tbody will make YQL returns nothing as it never exists in the HTML code.)

    This is an excellent tool for the Javascript. Imagine that if you are going implement a RSS reader, without YQL, the RSS reader application must prepare all the data at the server side and send back to the client (like Fig.1). This is bad for performance as curl call are blocking calls while consuming YQL at client browser can be asynchronous and parallel. This sounds wise to offload those data crawling process to the client (like Fig.2).


    Fig. 1 The web application prepare all the data at the server side

    Fig. 2 Offloading the blocking curl calls to client side parallel YQL request.

    Leave a Reply





  • 相关阅读:
    oracle数据字典(笔记)
    oracle管理表空间和数据文件(笔记)
    oracle权限管理(笔记)
    hibernate获取session的两个方法(笔记)
    hibernate一级缓存(笔记)
    hibernate主要接口和类(笔记)
    hibernate本地事务、全局事务
    hibernate:get和load方法的区别
    位权
    学习使用CGI和HTML
  • 原文地址:https://www.cnblogs.com/lexus/p/2213821.html
Copyright © 2011-2022 走看看