zoukankan      html  css  js  c++  java
  • python scrapy爬虫存储数据库方法带去重步骤

    import pymongo
    import requests
    import random
    import time
    import pymysql
    
    db = pymongo.MongoClient()['cs']['dn']
    db1 = pymysql.connect(user='root',password='root',db='cs',charset='utf8')
    cursor = db1.cursor()
    
    class CsdnPipeline(object):
        def __init__(self):
            self.set = set()
        def process_item(self, item, spider):
            if item not in self.set:
                title = item['title']
                content_text = item['content_text']
                create_time_datetime = item['create_time_datetime']
                nickName = item['nickName']
                read_count = item['read_count']
                content_img = item['content_img']
                keyword = item['keyword']
                if len(content_img)>0:
                    path = []
                    for img in content_img:
                        img_name = 'F:\34\tu\'+str(time.time()).split('.')[1]+str(random.randrange(1,9999999999999999999999999))+'.jpg'
                        img_source = requests.get(img).content
                        op = open(img_name,'wb')
                        op.write(img_source)
                        op.close()
                        path.append(img_name)
                    item['content_img'] = path
    
                else:
                    item['content_img'] = '暂无图片'
                db.insert(dict(item))
                import json
                data = json.dumps(dict(item))
                sql = "insert into dn1(`data`) VALUES ('{}')".format(data)
                cursor.execute(sql)
                db1.commit()
                self.set.add(item)
                return item
            else:
                print('已经存在')
                return item
  • 相关阅读:
    前端安全
    关于HTTPS的概念性了解
    数组去重
    防抖与节流
    对meta标签的再次认识
    关于路由, 我好奇的那些点
    关于构造函数,实例,原型对象一纯手工的理解
    数据库查找操作-java
    python之图像加载和简单处理
    python之excel表格操作
  • 原文地址:https://www.cnblogs.com/duanlinxiao/p/9851206.html
Copyright © 2011-2022 走看看