zoukankan      html  css  js  c++  java
  • python 网络爬虫(一)

    一、识别网站所用技术

    构建网站所使用的技术类型也会对我们如何爬取产生影响。有一个十分有用的工具可以检查网站构建的技术类型---builtwith模块。该模块的安装如下

    pip install builtwith

    该模块将url作为参数,下载该 url 并其进行分析,返回该网站的技术。

    >>> import builtwith
    >>> builtwith.parse('http://123.127.249.126:8081/manage/login')
    {'programming-languages': ['Java'], 

    'web-servers': ['Nginx'],

    'web-frameworks': ['Twitter Bootstrap'],

    'javascript-frameworks': ['jQuery']}

    从上面返回的结果可以看出,使用的框架是java 的 Twitter Bootstrap(前段框架利器)

    二、寻找网站所有者

    对于一些网站,我们可能会关心其所有者是谁。比如、我们已知网站的所有者会封禁网络爬虫,那么我们最好把下载速度控制的更加保守一些,为了找到网站所有者,我们可以使用WHOIS 协议查询域名的注册者是谁。Python 中有一个针对该协议的封装库,器文档地址为 https://pypi.python.org/pypi/python-whois,我们可以通过pip进行安装:

    pip install python-whois

    下面使用该模块对 http://www.baidu.com 进行WHOIS 查询是返回的结果。

    >>>import whois
    >>> print(whois.whois('http://www.baidu.com'))
    {
      "domain_name": [
        "BAIDU.COM",
        "baidu.com"
      ],
      "registrar": "MarkMonitor, Inc.",
      "whois_server": "whois.markmonitor.com",
      "referral_url": null,
      "updated_date": [
        "2019-05-09 04:30:46",
        "2019-05-08 20:59:33"
      ],
      "creation_date": [
        "1999-10-11 11:05:17",
        "1999-10-11 04:05:17"
      ],
      "expiration_date": [
        "2026-10-11 11:05:17",
        "2026-10-11 00:00:00"
      ],
      "name_servers": [
        "NS1.BAIDU.COM",
        "NS2.BAIDU.COM",
        "NS3.BAIDU.COM",
        "NS4.BAIDU.COM",
        "NS7.BAIDU.COM",
        "ns7.baidu.com",
        "ns2.baidu.com",
        "ns1.baidu.com",
        "ns3.baidu.com",
        "ns4.baidu.com"
      ],
      "status": [
        "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
        "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
        "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
        "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
        "serverTransferProhibited https://icann.org/epp#serverTransferProhibited",
        "serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited",
        "clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)",
        "clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)",
        "clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)",
        "serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)",
        "serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)",
        "serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)"
      ],
      "emails": [
        "abusecomplaints@markmonitor.com",
        "whoisrequest@markmonitor.com"
      ],
      "dnssec": "unsigned",
      "name": null,
      "org": "Beijing Baidu Netcom Science Technology Co., Ltd.",
      "address": null,
      "city": null,
      "state": "Beijing",
      "zipcode": null,
      "country": "CN"
    }

    可以看出该域名是 属于百度的,实际上却是如此。

  • 相关阅读:
    mini-web框架-WSGI-mini-web框架-多进程,面向对象的服务器(5.1.1)
    遍历对象打印对象中的值
    原型的使用和我对原型的理解
    上下高度固定中间自适应的布局方式
    高度固定,左右宽度300,中间自适应
    promise.all方法合并请求接口的两个值
    bus实现兄弟组件传值
    数组对象里面的值处理
    pre标签
    Script标签
  • 原文地址:https://www.cnblogs.com/jcjc/p/10870530.html
Copyright © 2011-2022 走看看