zoukankan      html  css  js  c++  java
  • 企查查简单爬虫

    经历过企查查这个网站后,强烈感觉到使用抓包的重要性,以至于决定从此以后使用抓包进行模拟请求,放弃使用F12进行分析。

    写下这篇文章,奠基死去的F12~~~

     1 import requests
     2 from lxml import etree
     3 
     4 url = "https://www.qcc.com/search?key=%E5%A4%A9%E6%B4%A5%E6%BB%A8%E6%B5%B7%E6%96%B0%E5%8C%BA"
     5 
     6 hed = {
     7     "host": "www.qcc.com",
     8     "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36",
     9     "upgrade-insecure-requests": "1",
    10     "cookie": "QCCSESSID=vpk1mpc45ci95eu83etg528881; zg_did=%7B%22did%22%3A%20%221732cdcac86bf-0039dd6baef69a-4353761-100200-1732cdcac8844f%22%7D; UM_distinctid=1732cdcb0a713b-01b058b949aa5a-4353761-100200-1732cdcb0ab44e; hasShow=1; _uab_collina=159418552807339394444789; acw_tc=7d27c71c15941953776602556e6b8442bc8001e4e1270e8fead4b79557; CNZZDATA1254842228=1092104090-1594185078-https%253A%252F%252Fwww.baidu.com%252F%7C1594195878; Hm_lvt_78f134d5a9ac3f92524914d0247e70cb=1594194111,1594195892,1594195918,1594196042; Hm_lpvt_78f134d5a9ac3f92524914d0247e70cb=1594196294; zg_de1d1a35bfa24ce29bbf2c7eb17e6c4f=%7B%22sid%22%3A%201594185526424%2C%22updated%22%3A%201594196294349%2C%22info%22%3A%201594185526455%2C%22superProperty%22%3A%20%22%7B%7D%22%2C%22platform%22%3A%20%22%7B%7D%22%2C%22utm%22%3A%20%22%7B%5C%22%24utm_source%5C%22%3A%20%5C%22baidu1%5C%22%2C%5C%22%24utm_medium%5C%22%3A%20%5C%22cpc%5C%22%2C%5C%22%24utm_term%5C%22%3A%20%5C%22pzsy%5C%22%7D%22%2C%22referrerDomain%22%3A%20%22www.baidu.com%22%2C%22cuid%22%3A%20%22fd05f1ac2b561244aaa6b27b3bb617a4%22%7D",
    11 }
    12 
    13 resq = requests.get(url = url,headers = hed).content
    14 response = etree.HTML(resq)
    15 
    16 title_list = []
    17 title = response.xpath('//*[@id="search-result"]//tr/td[3]/a//text()')
    18 for tit in title:
    19     tit = tit.replace(',','').strip()
    20     title_list.append(tit)
    21 
    22 addr_list = []
    23 addrs = response.xpath('//*[@id="search-result"]//tr/td[3]/p[4]//text()')
    24 for addr in addrs:
    25     addr = addr.replace(',','').strip()
    26     addr_list.append(addr)
    27 
    28 print(title_list)
    29 print(addr_list)

    代码很简单,甚至于简陋,为什么要记录下这个爬虫,因为请求头部信息,自己进行分析,和ctrl+c+v导致请求头数据不准确,严重感觉到抓包工具的请求分析更加快速有效。

    继续加油,继续努力

    自有风云来时雨, 似有风霜沾蓑衣
  • 相关阅读:
    dynamic_cast
    struct 字节对齐详解
    CentOS修改系统的默认启动模式为命令号界面
    linux系统备份还原
    linux 缺少libxxx.a 静态链接库
    linux下SVN忽略文件/文件夹的方法
    取消svn add
    centos 中文乱码解决办法2
    安装rpm包时遇到error: Failed dependencies:错误
    Linux rpm 命令参数使用详解[介绍和应用]
  • 原文地址:https://www.cnblogs.com/meipu/p/13267792.html
Copyright © 2011-2022 走看看