zoukankan      html  css  js  c++  java
  • 采用requests库构建简单的网络爬虫

    Date: 2019-06-09

    Author: Sun

    我们分析格言网 https://www.geyanw.com/, 通过requests网络库和bs4解析库进行爬取此网站内容。

    项目操作步骤

    1. 创建项目文件夹

      --geyanwang
         ---spiders  # 保存我们爬虫代码
            ---- geyan.py # 爬虫的代码
         ---doc   # 操作步骤说明文档
      
    2. 创建虚拟环境

      cd   geyanwang/
      virtualenv spider  --python=python3  # 创建venv虚拟环境
      
    3. 安装依赖库

      $ source venv/bin/activate
      (spider) $ pip install requests
      (spider) $ pip install lxml
      (spider) $ pip install bs4
      
    4. 编写代码 spiders/geyan.py

    # -*- coding: utf-8 -*-  
    __author__ = 'sun'
    __date__ = '2019/6/19 下午2:22' 
    
    from bs4 import BeautifulSoup as BSP4
    
    import requests
    
    g_set = set()
    
    def store_file(file_name, r):
    	html_doc = r.text
    	with open("geyan_%s.html"%file_name, "w") as f:
    		f.write(html_doc)
    
    def download(url, filename='index'):
    	'''
    	:param url: 待下载页面地址
    	:return: 页面内容
    	'''
    	r = requests.get(url)   #发送url请求,得到url网页内容
    
    	store_file(filename, r)
    	return r
    
    
    def parse_tbox(tbox, base_domain):
    	'''
    	解析某个小说类别
    	:param tbox:
    	:param base_domain:
    	:return:
    	'''
    	tbox_tag = tbox.select("dt a")[0].text
    	print(tbox_tag)
    
    	index = 0
    	li_list = tbox.find_all("li")
    	for li in li_list:
    		link = base_domain + li.a['href']
    		print("index:%s, link:%s" % (index, link))
    		index += 1
    		if link not in g_set:
    			g_set.add(link)
    			filename = "%s_%s" % (tbox_tag, index)
    			sub_html = download(link, filename)
    
    
    def parse(response):
    	'''
    	对页面进行解析
    	:param response: 页面的返回内容
    	:return:
    	'''
    	base_domin = response.url[:-1]
    	g_set.add(base_domin)
    	#print(base_domin)
    	html_doc = response.content
    	soup = BSP4(html_doc, "lxml")
    	tbox_list = soup.select("#p_left   dl.tbox")  #小说
    	[parse_tbox(tbox, base_domin)  for tbox in tbox_list]
    
    
    
    def main():
    	base_url = "https://www.geyanw.com/"
    	response = download(base_url)
    	parse(response)
    
    
    if __name__ == "__main__":
    	main()
    
    1. 运行上述代码,会产生一堆的html文件至本地

    作业

    上述geyan.py文件中只处理了首页

    如何按照类别分页爬取相关内容,采用多线程实现

    eg:

    https://www.geyanw.com/lizhimingyan/

    https://www.geyanw.com/renshenggeyan/

    将爬取的网页以文件夹命名不同的方式进行保存至本地

  • 相关阅读:
    使用 Spring data redis 结合 Spring cache 缓存数据配置
    Spring Web Flow 笔记
    Linux 定时实行一次任务命令
    css js 优化工具
    arch Failed to load module "intel"
    go 冒泡排序
    go (break goto continue)
    VirtualBox,Kernel driver not installed (rc=-1908)
    go运算符
    go iota
  • 原文地址:https://www.cnblogs.com/sunBinary/p/11055662.html
Copyright © 2011-2022 走看看