zoukankan      html  css  js  c++  java
  • python 网络爬虫(一)

    一、识别网站所用技术

    构建网站所使用的技术类型也会对我们如何爬取产生影响。有一个十分有用的工具可以检查网站构建的技术类型---builtwith模块。该模块的安装如下

    pip install builtwith

    该模块将url作为参数,下载该 url 并其进行分析,返回该网站的技术。

    >>> import builtwith
    >>> builtwith.parse('http://123.127.249.126:8081/manage/login')
    {'programming-languages': ['Java'], 

    'web-servers': ['Nginx'],

    'web-frameworks': ['Twitter Bootstrap'],

    'javascript-frameworks': ['jQuery']}

    从上面返回的结果可以看出,使用的框架是java 的 Twitter Bootstrap(前段框架利器)

    二、寻找网站所有者

    对于一些网站,我们可能会关心其所有者是谁。比如、我们已知网站的所有者会封禁网络爬虫,那么我们最好把下载速度控制的更加保守一些,为了找到网站所有者,我们可以使用WHOIS 协议查询域名的注册者是谁。Python 中有一个针对该协议的封装库,器文档地址为 https://pypi.python.org/pypi/python-whois,我们可以通过pip进行安装:

    pip install python-whois

    下面使用该模块对 http://www.baidu.com 进行WHOIS 查询是返回的结果。

    >>>import whois
    >>> print(whois.whois('http://www.baidu.com'))
    {
      "domain_name": [
        "BAIDU.COM",
        "baidu.com"
      ],
      "registrar": "MarkMonitor, Inc.",
      "whois_server": "whois.markmonitor.com",
      "referral_url": null,
      "updated_date": [
        "2019-05-09 04:30:46",
        "2019-05-08 20:59:33"
      ],
      "creation_date": [
        "1999-10-11 11:05:17",
        "1999-10-11 04:05:17"
      ],
      "expiration_date": [
        "2026-10-11 11:05:17",
        "2026-10-11 00:00:00"
      ],
      "name_servers": [
        "NS1.BAIDU.COM",
        "NS2.BAIDU.COM",
        "NS3.BAIDU.COM",
        "NS4.BAIDU.COM",
        "NS7.BAIDU.COM",
        "ns7.baidu.com",
        "ns2.baidu.com",
        "ns1.baidu.com",
        "ns3.baidu.com",
        "ns4.baidu.com"
      ],
      "status": [
        "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
        "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
        "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
        "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
        "serverTransferProhibited https://icann.org/epp#serverTransferProhibited",
        "serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited",
        "clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)",
        "clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)",
        "clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)",
        "serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)",
        "serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)",
        "serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)"
      ],
      "emails": [
        "abusecomplaints@markmonitor.com",
        "whoisrequest@markmonitor.com"
      ],
      "dnssec": "unsigned",
      "name": null,
      "org": "Beijing Baidu Netcom Science Technology Co., Ltd.",
      "address": null,
      "city": null,
      "state": "Beijing",
      "zipcode": null,
      "country": "CN"
    }

    可以看出该域名是 属于百度的,实际上却是如此。

  • 相关阅读:
    209. Minimum Size Subarray Sum
    208. Implement Trie (Prefix Tree)
    207. Course Schedule
    206. Reverse Linked List
    205. Isomorphic Strings
    204. Count Primes
    203. Remove Linked List Elements
    201. Bitwise AND of Numbers Range
    199. Binary Tree Right Side View
    ArcGIS API for JavaScript 4.2学习笔记[8] 2D与3D视图同步
  • 原文地址:https://www.cnblogs.com/jcjc/p/10870530.html
Copyright © 2011-2022 走看看