zoukankan html css js c++ java

Python+sqlite爬虫之实践

主题：爬取某网站的招聘信息，然后存进Sqlite数据库。

环境准备：

Python3.5

Sqlite

Navicat for SQLite（方便查看）

步骤：

一、安装Sqlite

下载地址：http://www.sqlite.org/download.html

这里是window 10系统，所以找到Precompiled Binaries for Windows下的sqlite tools下载，解压后将**sqlite3.exe**放到python的安装目录下既可

二、初始化Sqlite数据库

1、在d:studyspyder目录下新建job_model.py文件

 1 from peewee import *
 2 
 3 db = SqliteDatabase('job.db')
 4 
 5 class Job(Model):
 6     job_id = IntegerField(unique=True)    #key
 7     salary_min = IntegerField()
 8     salary_max = IntegerField()
 9     job_exp = CharField(max_length=100)
10     company = CharField(max_length=100)
11     company_id = IntegerField()
12     company_info = CharField(max_length=100)
13     url = CharField(max_length=100)
14     attract = CharField(max_length=100)
15     detail = TextField()
16     address = CharField(max_length=100)
17     publish_time = DateField()
18     keyword = CharField(max_length=100)
19     city = CharField(max_length=100)
20     position = CharField(max_length=100)
21     create_time = DateTimeField()
22 
23     class Meta:
24         database = db

2、在d:studyspyder目录下的命令行中依次输入，也就是job_model.py所在的目录

```
python -i job_model.py
db.connect()
db.create_tables([Job])
```

如果此目录下没有job.db文件，则新建一个，并且新建一张名job的表，表结构如job_model.py所设计那样。

如果此目录已有job.db文件，则在原有的数据上新建一张名job的表。

此时，可以用Navicat连接job.db数据库查看，是否新增了表job

三、如何用python将数据写入Sqlite？

这里将用到python的第三方库peewee，在命令行输入pip3 install peewee进行安装

from spyder.job_model import Job
import peewee

class spyder: 

    ......

　　# 这里传入参数是一个字典
    def storeDataToSqlite(self, dic):
        try:
            Job.create(job_id = dic['job_id'],
                        salary_min = dic['salary_min'],
                        salary_max = dic['salary_max'],
                        company = dic['company'],
                        company_id = 0,　　#先设为0
                        company_info = dic['company_info'],
                        url = dic['url'],
                        attract = dic['attract'],
                        detail = dic['detail'],
                        address = dic['address'],
                        publish_time = dic['publish_time'],
                        keyword = dic['keyword'],
                        city = dic['city'],
                        job_exp = dic['exp'],
                        position = dic['position'],
                        create_time = self.today_date)

        except peewee.IntegrityError:
            print("数据插入错误:ID：%s，公司：%s已经存在" % (dic['job_id'],dic['company']))

四、新建一个spyder_job.py文件，开始设计编码

方案一：用requests库+BeautifulSoup

方案二：用selenium 3+Chrome

思路：搜索城市、招聘关键字 --》页码--》爬一个一个招聘的URL --》重复一个一个招聘页 --》重复爬取信息 --》重复写入数据库

五、用Navicat连接job.db数据库查看表job，是否新增了数据

查看全文

相关阅读:
从语料中找出低频词-去除无用信息
 pytorch seq2seq模型示例
 An Open-Source Package for Knowledge Embedding- 知识嵌入为人机交互做支撑
 jiagu-工具使用
 多线程操作数据
 pycharm安装pytorch失败的问题
 模型区分度衡量指标-KS值
 jstree：获得根节点，checkbox事件处理
 jquery:删除第一个子元素
 js:如何在iframe重载前执行特定动作

原文地址：https://www.cnblogs.com/hlphlp/p/6855777.html