zoukankan      html  css  js  c++  java
  • pyhton mechanize 学习笔记

    1:简单的使用

    import mechanize
    # response = mechanize.urlopen("http://www.hao123.com/")
    request = mechanize.Request("http://www.hao123.com/")
    response = mechanize.urlopen(request)
    print response.geturl()
    print response.info()
    # print response.read()

    2:mechanize.urlretrieve

    >>> import mechanize
    >>> help(mechanize.urlretrieve)
    Help on function urlretrieve in module mechanize._opener:
    
    urlretrieve(url, filename=None, reporthook=None, data=None, timeout=<object object>)
    • 参数 finename 指定了保存本地路径(如果参数未指定,urllib会生成一个临时文件保存数据。)
    • 参数 reporthook 是一个回调函数,当连接上服务器、以及相应的数据块传输完毕时会触发该回调,我们可以利用这个回调函数来显示当前的下载进度。
    • 参数 data 指 post 到服务器的数据,该方法返回一个包含两个元素的(filename, headers)元组,filename 表示保存到本地的路径,header 表示服务器的响应头
    • 参数 timeout 是设定的超时对象

    reporthook(block_read,block_size,total_size)定义回调函数,block_size是每次读取的数据块的大小,block_read是每次读取的数据块个数,taotal_size是一一共读取的数据量,单位是byte。可以使用reporthook函数来显示读取进度。

    简单的例子

    def cbk(a, b, c):print a,b,c
      
    url = 'http://www.hao123.com/'
    local = 'd://hao.html'
    mechanize.urlretrieve(url,local,cbk)

     3:form表单登陆

    br = mechanize.Browser()
    br.set_handle_robots(False)
    br.open("http://www.zhaopin.com/")
    br.select_form(nr=0)
    br['loginname'] = '**'自己注册一个账号密码就行了
    br['password'] = '**'
    r = br.submit()
    print os.path.dirname(__file__)+'login.html'
    h = file(os.path.dirname(__file__)+'login.html',"w")
    rt = r.read()
    h.write(rt)
    h.close()

    4:Browser

    看完help的文档基本可以成神了

    Help on class Browser in module mechanize._mechanize:
    
    class Browser(mechanize._useragent.UserAgentBase)
     |  Browser-like class with support for history, forms and links.
     |  
     |  BrowserStateError is raised whenever the browser is in the wrong state to
     |  complete the requested operation - e.g., when .back() is called when the
     |  browser history is empty, or when .follow_link() is called when the current
     |  response does not contain HTML data.
     |  
     |  Public attributes:
     |  
     |  request: current request (mechanize.Request)
     |  form: currently selected form (see .select_form())
     |  
     |  Method resolution order:
     |      Browser
     |      mechanize._useragent.UserAgentBase
     |      mechanize._opener.OpenerDirector
     |      mechanize._urllib2_fork.OpenerDirector
     |  
     |  Methods defined here:
     |  
     |  __getattr__(self, name)
     |  
     |  __init__(self, factory=None, history=None, request_class=None)
     |      Only named arguments should be passed to this constructor.
     |      
     |      factory: object implementing the mechanize.Factory interface.
     |      history: object implementing the mechanize.History interface.  Note
     |       this interface is still experimental and may change in future.
     |      request_class: Request class to use.  Defaults to mechanize.Request
     |      
     |      The Factory and History objects passed in are 'owned' by the Browser,
     |      so they should not be shared across Browsers.  In particular,
     |      factory.set_response() should not be called except by the owning
     |      Browser itself.
     |      
     |      Note that the supplied factory's request_class is overridden by this
     |      constructor, to ensure only one Request class is used.
     |  
     |  __str__(self)
     |  
     |  back(self, n=1)
     |      Go back n steps in history, and return response object.
     |      
     |      n: go back this number of steps (default 1 step)
     |  
     |  clear_history(self)
     |  
     |  click(self, *args, **kwds)
     |      See mechanize.HTMLForm.click for documentation.
     |  
     |  click_link(self, link=None, **kwds)
     |      Find a link and return a Request object for it.
     |      
     |      Arguments are as for .find_link(), except that a link may be supplied
     |      as the first argument.
     |  
     |  close(self)
     |  
     |  encoding(self)
     |  
     |  find_link(self, **kwds)
     |      Find a link in current page.
     |      
     |      Links are returned as mechanize.Link objects.
     |      
     |      # Return third link that .search()-matches the regexp "python"
     |      # (by ".search()-matches", I mean that the regular expression method
     |      # .search() is used, rather than .match()).
     |      find_link(text_regex=re.compile("python"), nr=2)
     |      
     |      # Return first http link in the current page that points to somewhere
     |      # on python.org whose link text (after tags have been removed) is
     |      # exactly "monty python".
     |      find_link(text="monty python",
     |                url_regex=re.compile("http.*python.org"))
     |      
     |      # Return first link with exactly three HTML attributes.
     |      find_link(predicate=lambda link: len(link.attrs) == 3)
     |      
     |      Links include anchors (<a>), image maps (<area>), and frames (<frame>,
     |      <iframe>).
     |      
     |      All arguments must be passed by keyword, not position.  Zero or more
     |      arguments may be supplied.  In order to find a link, all arguments
     |      supplied must match.
     |      
     |      If a matching link is not found, mechanize.LinkNotFoundError is raised.
     |      
     |      text: link text between link tags: e.g. <a href="blah">this bit</a> (as
     |       returned by pullparser.get_compressed_text(), ie. without tags but
     |       with opening tags "textified" as per the pullparser docs) must compare
     |       equal to this argument, if supplied
     |      text_regex: link text between tag (as defined above) must match the
     |       regular expression object or regular expression string passed as this
     |       argument, if supplied
     |      name, name_regex: as for text and text_regex, but matched against the
     |       name HTML attribute of the link tag
     |      url, url_regex: as for text and text_regex, but matched against the
     |       URL of the link tag (note this matches against Link.url, which is a
     |       relative or absolute URL according to how it was written in the HTML)
     |      tag: element name of opening tag, e.g. "a"
     |      predicate: a function taking a Link object as its single argument,
     |       returning a boolean result, indicating whether the links
     |      nr: matches the nth link that matches all other criteria (default 0)
     |  
     |  follow_link(self, link=None, **kwds)
     |      Find a link and .open() it.
     |      
     |      Arguments are as for .click_link().
     |      
     |      Return value is same as for Browser.open().
     |  
     |  forms(self)
     |      Return iterable over forms.
     |      
     |      The returned form objects implement the mechanize.HTMLForm interface.
     |  
     |  geturl(self)
     |      Get URL of current document.
     |  
     |  global_form(self)
     |      Return the global form object, or None if the factory implementation
     |      did not supply one.
     |      
     |      The "global" form object contains all controls that are not descendants
     |      of any FORM element.
     |      
     |      The returned form object implements the mechanize.HTMLForm interface.
     |      
     |      This is a separate method since the global form is not regarded as part
     |      of the sequence of forms in the document -- mostly for
     |      backwards-compatibility.
     |  
     |  links(self, **kwds)
     |      Return iterable over links (mechanize.Link objects).
     |  
     |  open(self, url, data=None, timeout=<object object>)
     |  
     |  open_local_file(self, filename)
     |  
     |  open_novisit(self, url, data=None, timeout=<object object>)
     |      Open a URL without visiting it.
     |      
     |      Browser state (including request, response, history, forms and links)
     |      is left unchanged by calling this function.
     |      
     |      The interface is the same as for .open().
     |      
     |      This is useful for things like fetching images.
     |      
     |      See also .retrieve().
     |  
     |  reload(self)
     |      Reload current document, and return response object.
     |  
     |  response(self)
     |      Return a copy of the current response.
     |      
     |      The returned object has the same interface as the object returned by
     |      .open() (or mechanize.urlopen()).
     |  
     |  select_form(self, name=None, predicate=None, nr=None)
     |      Select an HTML form for input.
     |      
     |      This is a bit like giving a form the "input focus" in a browser.
     |      
     |      If a form is selected, the Browser object supports the HTMLForm
     |      interface, so you can call methods like .set_value(), .set(), and
     |      .click().
     |      
     |      Another way to select a form is to assign to the .form attribute.  The
     |      form assigned should be one of the objects returned by the .forms()
     |      method.
     |      
     |      At least one of the name, predicate and nr arguments must be supplied.
     |      If no matching form is found, mechanize.FormNotFoundError is raised.
     |      
     |      If name is specified, then the form must have the indicated name.
     |      
     |      If predicate is specified, then the form must match that function.  The
     |      predicate function is passed the HTMLForm as its single argument, and
     |      should return a boolean value indicating whether the form matched.
     |      
     |      nr, if supplied, is the sequence number of the form (where 0 is the
     |      first).  Note that control 0 is the first form matching all the other
     |      arguments (if supplied); it is not necessarily the first control in the
     |      form.  The "global form" (consisting of all form controls not contained
     |      in any FORM element) is considered not to be part of this sequence and
     |      to have no name, so will not be matched unless both name and nr are
     |      None.
     |  
     |  set_cookie(self, cookie_string)
     |      Request to set a cookie.
     |      
     |      Note that it is NOT necessary to call this method under ordinary
     |      circumstances: cookie handling is normally entirely automatic.  The
     |      intended use case is rather to simulate the setting of a cookie by
     |      client script in a web page (e.g. JavaScript).  In that case, use of
     |      this method is necessary because mechanize currently does not support
     |      JavaScript, VBScript, etc.
     |      
     |      The cookie is added in the same way as if it had arrived with the
     |      current response, as a result of the current request.  This means that,
     |      for example, if it is not appropriate to set the cookie based on the
     |      current request, no cookie will be set.
     |      
     |      The cookie will be returned automatically with subsequent responses
     |      made by the Browser instance whenever that's appropriate.
     |      
     |      cookie_string should be a valid value of the Set-Cookie header.
     |      
     |      For example:
     |      
     |      browser.set_cookie(
     |          "sid=abcdef; expires=Wednesday, 09-Nov-06 23:12:40 GMT")
     |      
     |      Currently, this method does not allow for adding RFC 2986 cookies.
     |      This limitation will be lifted if anybody requests it.
     |  
     |  set_handle_referer(self, handle)
     |      Set whether to add Referer header to each request.
     |  
     |  set_response(self, response)
     |      Replace current response with (a copy of) response.
     |      
     |      response may be None.
     |      
     |      This is intended mostly for HTML-preprocessing.
     |  
     |  submit(self, *args, **kwds)
     |      Submit current form.
     |      
     |      Arguments are as for mechanize.HTMLForm.click().
     |      
     |      Return value is same as for Browser.open().
     |  
     |  title(self)
     |      Return title, or None if there is no title element in the document.
     |      
     |      Treatment of any tag children of attempts to follow Firefox and IE
     |      (currently, tags are preserved).
     |  
     |  viewing_html(self)
     |      Return whether the current response contains HTML data.
     |  
     |  visit_response(self, response, request=None)
     |      Visit the response, as if it had been .open()ed.
     |      
     |      Unlike .set_response(), this updates history rather than replacing the
     |      current response.
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  default_features = ['_redirect', '_cookies', '_refresh', '_equiv', '_b...
     |  
     |  handler_classes = {'_basicauth': <class mechanize._urllib2_fork.HTTPBa...
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from mechanize._useragent.UserAgentBase:
     |  
     |  add_client_certificate(self, url, key_file, cert_file)
     |      Add an SSL client certificate, for HTTPS client auth.
     |      
     |      key_file and cert_file must be filenames of the key and certificate
     |      files, in PEM format.  You can use e.g. OpenSSL to convert a p12 (PKCS
     |      12) file to PEM format:
     |      
     |      openssl pkcs12 -clcerts -nokeys -in cert.p12 -out cert.pem
     |      openssl pkcs12 -nocerts -in cert.p12 -out key.pem
     |      
     |      
     |      Note that client certificate password input is very inflexible ATM.  At
     |      the moment this seems to be console only, which is presumably the
     |      default behaviour of libopenssl.  In future mechanize may support
     |      third-party libraries that (I assume) allow more options here.
     |  
     |  add_password(self, url, user, password, realm=None)
     |  
     |  add_proxy_password(self, user, password, hostport=None, realm=None)
     |  
     |  set_client_cert_manager(self, cert_manager)
     |      Set a mechanize.HTTPClientCertMgr, or None.
     |  
     |  set_cookiejar(self, cookiejar)
     |      Set a mechanize.CookieJar, or None.
     |  
     |  set_debug_http(self, handle)
     |      Print HTTP headers to sys.stdout.
     |  
     |  set_debug_redirects(self, handle)
     |      Log information about HTTP redirects (including refreshes).
     |      
     |      Logging is performed using module logging.  The logger name is
     |      "mechanize.http_redirects".  To actually print some debug output,
     |      eg:
     |      
     |      import sys, logging
     |      logger = logging.getLogger("mechanize.http_redirects")
     |      logger.addHandler(logging.StreamHandler(sys.stdout))
     |      logger.setLevel(logging.INFO)
     |      
     |      Other logger names relevant to this module:
     |      
     |      "mechanize.http_responses"
     |      "mechanize.cookies"
     |      
     |      To turn on everything:
     |      
     |      import sys, logging
     |      logger = logging.getLogger("mechanize")
     |      logger.addHandler(logging.StreamHandler(sys.stdout))
     |      logger.setLevel(logging.INFO)
     |  
     |  set_debug_responses(self, handle)
     |      Log HTTP response bodies.
     |      
     |      See docstring for .set_debug_redirects() for details of logging.
     |      
     |      Response objects may be .seek()able if this is set (currently returned
     |      responses are, raised HTTPError exception responses are not).
     |  
     |  set_handle_equiv(self, handle, head_parser_class=None)
     |      Set whether to treat HTML http-equiv headers like HTTP headers.
     |      
     |      Response objects may be .seek()able if this is set (currently returned
     |      responses are, raised HTTPError exception responses are not).
     |  
     |  set_handle_gzip(self, handle)
     |      Handle gzip transfer encoding.
     |  
     |  set_handle_redirect(self, handle)
     |      Set whether to handle HTTP 30x redirections.
     |  
     |  set_handle_refresh(self, handle, max_time=None, honor_time=True)
     |      Set whether to handle HTTP Refresh headers.
     |  
     |  set_handle_robots(self, handle)
     |      Set whether to observe rules from robots.txt.
     |  
     |  set_handled_schemes(self, schemes)
     |      Set sequence of URL scheme (protocol) strings.
     |      
     |      For example: ua.set_handled_schemes(["http", "ftp"])
     |      
     |      If this fails (with ValueError) because you've passed an unknown
     |      scheme, the set of handled schemes will not be changed.
     |  
     |  set_password_manager(self, password_manager)
     |      Set a mechanize.HTTPPasswordMgrWithDefaultRealm, or None.
     |  
     |  set_proxies(self, proxies=None, proxy_bypass=None)
     |      Configure proxy settings.
     |      
     |      proxies: dictionary mapping URL scheme to proxy specification.  None
     |        means use the default system-specific settings.
     |      proxy_bypass: function taking hostname, returning whether proxy should
     |        be used.  None means use the default system-specific settings.
     |      
     |      The default is to try to obtain proxy settings from the system (see the
     |      documentation for urllib.urlopen for information about the
     |      system-specific methods used -- note that's urllib, not urllib2).
     |      
     |      To avoid all use of proxies, pass an empty proxies dict.
     |      
     |      >>> ua = UserAgentBase()
     |      >>> def proxy_bypass(hostname):
     |      ...     return hostname == "noproxy.com"
     |      >>> ua.set_proxies(
     |      ...     {"http": "joe:password@myproxy.example.com:3128",
     |      ...      "ftp": "proxy.example.com"},
     |      ...     proxy_bypass)
     |  
     |  set_proxy_password_manager(self, password_manager)
     |      Set a mechanize.HTTPProxyPasswordMgr, or None.
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from mechanize._useragent.UserAgentBase:
     |  
     |  default_others = ['_unknown', '_http_error', '_http_default_error']
     |  
     |  default_schemes = ['http', 'ftp', 'file', 'https']
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from mechanize._opener.OpenerDirector:
     |  
     |  add_handler(self, handler)
     |  
     |  error(self, proto, *args)
     |  
     |  retrieve(self, fullurl, filename=None, reporthook=None, data=None, timeout=<object object>, open=<built-in function open>)
     |      Returns (filename, headers).
     |      
     |      For remote objects, the default filename will refer to a temporary
     |      file.  Temporary files are removed when the OpenerDirector.close()
     |      method is called.
     |      
     |      For file: URLs, at present the returned filename is None.  This may
     |      change in future.
     |      
     |      If the actual number of bytes read is less than indicated by the
     |      Content-Length header, raises ContentTooShortError (a URLError
     |      subclass).  The exception's .result attribute contains the (filename,
     |      headers) that would have been returned.
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from mechanize._opener.OpenerDirector:
     |  
     |  BLOCK_SIZE = 8192
  • 相关阅读:
    (Java实现) 洛谷 P1106 删数问题
    (Java实现) 洛谷 P1603 斯诺登的密码
    (Java实现) 洛谷 P1036 选数
    (Java实现) 洛谷 P1012 拼数
    (Java实现) 洛谷 P1028 数的计算
    (Java实现) 洛谷 P1553 数字反转(升级版)
    (Java实现) 洛谷 P1051 谁拿了最多奖学金
    (Java实现) 洛谷 P1051 谁拿了最多奖学金
    (Java实现) 洛谷 P1106 删数问题
    目测ZIP的压缩率
  • 原文地址:https://www.cnblogs.com/qwj-sysu/p/3892043.html
Copyright © 2011-2022 走看看