zoukankan      html  css  js  c++  java
  • 中国证券投资基金业协会爬取

    爬取要求:

    网页爬取范围:5875页-尾页;
    基金名称点击进去的二级页面!需要提取《基金类型》、《管理类型》两个字段;
    私募基金管理人名称点进去的二级页面,需要提取《登记时间》、《成立时间》字段;
    二级页面的四个字段跟到列表页后面形成表格。
    

    代码:

    import codecs
    import csv
    from lxml import etree
    import requests
    import random
    import json
    import time
    import pandas as pd
    import threading
    
    
    # 输入毫秒级的时间,转出正常格式的时间
    def timeStamp(timeNum):
        timeStamp = float(timeNum / 1000)
        timeArray = time.localtime(timeStamp)
        otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
        return otherStyleTime
    
    
    def save(rows):
        with codecs.open('证券.csv', 'ab', encoding='utf8') as f:
            writer = csv.writer(f)
            writer.writerows(rows)
    
    
    baocuo_list = []
    
    
    def craw(num):
        rows = []
        try:
            print('开始爬取=========', num)
            headers = {
                'Accept': 'application/json,text/javascript,*/*; q=0.01',
                'Accept-Encoding': 'gzip,deflate',
                'Connection': 'keep-alive',
                'Host': 'gs.amac.org.cn',
                'Content-Type': 'application/json;charset=UTF-8',
                'Origin': 'http://gs.amac.org.cn',
                'X-Requested-With': 'XMLHttpRequest',
                'Referer': 'http://gs.amac.org.cn/amac-infodisc/res/pof/fund/index.html',
                'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Mobile Safari/537.36'
            }
            r = random.random()
            url = "http://gs.amac.org.cn/amac-infodisc/api/pof/fund?rand=" + str(r) + "&page=" + str(num) + "&size=20"
            data = {}
            data = json.dumps(data)
            response = requests.post(url=url, data=data, headers=headers)
            data_list = json.loads(response.text)["content"]
            count = 0
            for data in data_list:
                # print(data)
                fund_name = data['fundName']
                manager_name = data['managerName']
                mandator_name = data['mandatorName']
                establishDate = timeStamp(data['establishDate'])
                putOnRecordDate = str(establishDate)[:11]
                count += 1
                # 中国证券投资基金业协会提示地址
                url = 'http://gs.amac.org.cn/amac-infodisc/res/pof/fund/' + data['url']
                manager_url = 'http://gs.amac.org.cn/amac-infodisc/res/pof/' + data.get('managerUrl')[3:]
                response = requests.get(url=url, headers=headers)
                response.encoding = 'utf-8'
                # 管理类型
                manager_type = data['managerType']
                # 基金类型
                text = response.text
                text = etree.HTML(text)
                basic_type = 
                    text.xpath('/ html / body / div[3] / div / div[2] / div[1] / div / table / tbody')[0].xpath(
                        'string(.)').strip().split(":")
                a = 0
    
                for i in basic_type:
                    if '基金类型' in i:
                        a = basic_type.index(i)
                basic_type = basic_type[a + 1]
                basic_type = basic_type.split()[0].strip()
    
                # 备案时间
                beian_time = 
                    text.xpath('/ html / body / div[3] / div / div[2] / div[1] / div / table / tbody / tr[4] / td[2]')[
                        0].xpath(
                        'string(.)').replace(
                        '
    ', '').replace(" ", "").replace("	", "")
                response = requests.get(url=manager_url, headers=headers)
                response.encoding = 'utf-8'
                text = response.text
                text = etree.HTML(text)
                # 成立时间
                establish_time = 
                    text.xpath('/ html / body / div[3] / div / div[4] / div[2] / div[2] / table / tbody / tr[6] / td[2]')[
                        0].xpath('string(.)').replace(
                        '
    ', '').replace(" ", "").replace("	", "").split(':')[-1]
                # 登记时间
                register_time = text.xpath('/html/body/div[3]/div/div[4]/div[2]/div[2]/table/tbody/tr[5]/td[2]')[0].xpath(
                    'string(.)').replace(
                    '
    ', '').replace(" ", "").replace("	", "").split(':')[-1]
                row = (
                    fund_name, manager_name, mandator_name, putOnRecordDate, beian_time, basic_type, manager_type,
                    register_time,
                    establish_time)
                rows.append(row)
                if num in baocuo_list:
                    baocuo_list.pop(num)
    
    
            if len(rows) > 0:
                save(rows)
            print('爬取完成==========', num)
        except Exception as e:
            print('爬不了的======', num)
            print('爬不了的原因======', e)
            baocuo_list.append(num)
    
    
    if __name__ == '__main__':
        with codecs.open('证券.csv', 'ab', encoding='utf8') as f:
            writer = csv.writer(f)
            writer.writerow(["基金名称", "私募基金管理人名称", "托管人名称", "成立时间", "备案时间", "基金类型", "管理类型", "登记时间", "成立时间-管理人"])
        for num in range(5874, 6620):
            t = threading.Thread(target=craw, args=(num,))
            t.start()
            t.join()
    
        print(baocuo_list)
        while 1:
            for i in baocuo_list:
                print('重新爬取===========', i)
                t = threading.Thread(target=craw, args=(i,))
                t.start()
                t.join()
                if len(baocuo_list) <= 0:
                    break
            if len(baocuo_list) <= 0:
                break
    
    
  • 相关阅读:
    MyEclipse的配置--博客园老牛大讲堂
    Hbuilder连接夜神模拟器---博客园老牛大讲堂
    APICloud连接夜神模拟器--博客园老牛大讲堂
    H5混合开发APP配置以及第一个工程--博客园老牛大讲堂
    实现标签页(菜单栏)--博客园老牛大讲堂
    H5动态添加数据-博客园老牛大讲堂
    bootstrap实现网页手风琴--博客园老牛大讲堂
    返回零长度的数组或集合,而不是null
    用EnumMap代替序数索引
    用EnumSet代替位域
  • 原文地址:https://www.cnblogs.com/ghh520/p/13624029.html
Copyright © 2011-2022 走看看