zoukankan      html  css  js  c++  java
  • gae crawler

    http://stackoverflow.com/questions/3087871/crawler-on-appengine

    http://code.google.com/p/jyf-code/source/browse/trunk/gae/jyfbot/bot.py


    http://aws.amazon.com/search-engines/

    amazon Search Engines & Web Crawlers


    Objectify-Appengine 或 Objectify 是一个 ORM 类的库,它简化 Bigtable 以及 GAE
    中的数据持久性。作为一个映射层,Objectify 通过一个简洁的 API 将自身插入到 POJOs 与 Google
    的重型设备之间。您可以使用一个熟悉的 JPA 注释子集(尽管 Objectify 不实现完整的规范)以及少量生命周期注释,来存留和检索 Java
    对象形式的数据。从本质上讲,Objectify 是为 Google 的 Bigtable 明确设计的一个轻量级 Hibernate。


    Objectify 与 Hibernate
    的类似之处在于,它允许您针对 Bigtable 映射和利用 POJOs,您将这个看作是 GAE 中的一个抽象。除了 JPA
    注释的子集之外,Objectify 运用其自己的注释,这体现了 GAE 数据存储的独特功能。Objectify
    还允许关系,显示一个查询界面来支持 GAE 筛选和排序概念。


    Apache Droids
    droids-crawler
    https://cwiki.apache.org/DROIDS/droids-crawler.html
    http://code.google.com/p/gwt-platform/
    http://code.google.com/p/gwt-platform/source/browse/#hg%2Fgwtp-samples%2Fgwtp-sample-crawler-service

    Web crawlers and Google App Engine Hosted applications




























    Is it impossible to run a web crawler on GAE along side with my app considering the I am running the free startup version?














    link|edit|flag






    50% accept rate





























    4 Answers























    up vote
    2
    down vote
    accepted



    While Google hadn't exposed
    scheduling, queue and background tasks API, you can do any processing
    only as an answer to external HTTP request. You'd need some heartbeat
    service that will process one item from crawler's queue at a time (not
    to hit GAE limits).



    To do crawling from GAE, you have to split your application into
    queue (that stores queue data in Datastore), queue processor that will
    react to external HTTP heartbeat and your actual crawling logic.



    You'd manually have to watch your quota usage and start heartbeat when you have spare quota, and stop if it is used up.



    When Google introduces the APIs I've told in the beginning you'd have
    to rewrite parts that are implemented more effectively via Google API.



    UPDATE: Google introduced Task Queue API some time ago. See task queue docs for python and java.












    link|edit|flag
















































    App Engine code only runs in
    response to HTTP requests, so you can't run a persistent crawler in the
    background. With the upcoming release of scheduled tasks, you could
    write a crawler that uses that functionality, but it would be less than
    ideal.











    link|edit|flag












































    I suppose you can (i.e., it's not impossible to) run it, but it will be slow and you'll run into limits quite quickly. As CPU quotas are going to be decreased at the end of May even further, I'd recommend against it.












    link|edit|flag













































    It's possible. But that's not
    really an application for appengine just as Arachnid wrote. If you
    manage to get it working I'll doubt you'll stay in the qotas for free
    accounts.











    link|edit|flag





























    Your Answer






























     




    gae
  • 相关阅读:
    log4j的基本配置参数
    插入透明背景Flash的HTML代码
    oracle获取字符串长度函数length()和hengthb()
    HSQLDB安装与使用
    linux下完全删除Oracle
    SQL查询前几条记录
    LINUX安装ORACLE 9204 报错解决!!
    ORACLE 归档日志打开关闭方法
    hsqldb快速入门
    Openstack中查看虚拟机console log的几种方法
  • 原文地址:https://www.cnblogs.com/lexus/p/2061384.html
Copyright © 2011-2022 走看看