zoukankan      html  css  js  c++  java
  • 爬虫

    # -*- coding: utf-8 -*-
    # @Time : 2019/5/31 19:33
    # @Author : zejin
    # @File : pachong.py

    from urllib import request
    import re

    class Analysis():
    url = 'https://book.douban.com/'
    root_patten = '<div class="cover">([sS]*?)</div>'
    name_patten = 'alt="([sS]*?)">'
    adress_patten = 'href="([sS]*?)" title'

    def __face_connect(self):
    r = request.urlopen(self.url)
    htmls = r.read()
    htmls = str(htmls, encoding='utf-8')
    return htmls

    def __analysis(self,htmls):
    root_htmls = re.findall(self.root_patten, htmls)
    # print(root_htmls)
    ancors = []
    for html in root_htmls:
    name = re.findall(self.name_patten, html)
    adress = re.findall(self.adress_patten, html)
    ancor = {"name":name, "adress":adress}
    ancors.append(ancor)
    # print(ancors)
    return ancors

    def __refine(self, ancors):
    pass

    def go(self):
    htmls = self.__face_connect()
    ancors = self.__analysis(htmls)
    # self.__refine(ancors)
    # ancors = self.__refine(ancors)
    print(ancors)

    analysis = Analysis()
    analysis.go()
  • 相关阅读:
    第二十一章 PHP编译安装(centos7)
    第二十章 nginx常见问题
    第十九章 keepalived高可用
    dijkstra
    求逆序对
    A
    P2014 [CTSC1997]选课
    樱花 混合背包
    1401D
    CF1343D
  • 原文地址:https://www.cnblogs.com/jinbaobao/p/10959606.html
Copyright © 2011-2022 走看看