如何高效地爬取链家的房源信息（二）

zoukankan html css js c++ java

如何高效地爬取链家的房源信息（二）

“Python实现的链家网站的爬虫第二部分。”

本系列文将以链家南京站为例，使用Python实现链家二手房源信息的爬虫，将数据爬取，并存入数据库中，以便使用。

本系列第一部分：
如何高效地爬取链家的房源信息（一）

本文是第二部分，爬取小区信息并存入数据库，部分代码依赖于第一部分。

在前文中已经获取了大区域的URL，接下来只需要遍历各个URL即可爬下所有小区信息：
# 爬下所有的小区信息
for regionurl in regionurls:
do_xiaoqu_spider(db_xq, regionurl)

对一个区内的所有小区进行爬取，需要分页：
def do_xiaoqu_spider(db_xq, url=u"https://nj.lianjia.com/xiaoqu/gulou/"):
"""
爬取大区域中的所有小区信息
"""
try:
req = urllib.request.Request(url, headers=hds[random.randint(0, len(hds) - 1)])
source_code = urllib.request.urlopen(req, timeout=5).read()
plain_text = source_code.decode('utf-8');
soup = BeautifulSoup(plain_text,"html.parser")
except (urllib.request.HTTPError, urllib.request.URLError) as e:
print(e)
return
except Exception as e:
print(e)
return

d = "d="+soup.find('div', {'class': 'page-box house-lst-page-box'}).get('page-data')
loc = {}
glb = {}
exec(d, glb, loc);
total_pages = loc['d']['totalPage']

threads = []
for i in range(total_pages):
url_page = url+u"pg%d/" % (i + 1);
print(url_page);
t = threading.Thread(target=xiaoqu_spider, args=(db_xq, url_page))
threads.append(t)
for t in threads:
t.start()
for t in threads:
t.join()
print(u"爬下了 %s 区全部的小区信息" % url)

爬取单个页面内的小区信息：
def xiaoqu_spider(db_xq, url_page=u"https://nj.lianjia.com/xiaoqu/gulou/pg1/"):
"""
爬取页面链接中的小区信息
"""
try:
req = urllib.request.Request(url_page, headers=hds[random.randint(0, len(hds) - 1)])
source_code = urllib.request.urlopen(req, timeout=10).read()
plain_text = source_code.decode('utf-8');
soup = BeautifulSoup(plain_text,"html.parser")
except (urllib.request.HTTPError, urllib.request.URLError) as e:
print(e)
exit(-1)
except Exception as e:
print(e)
exit(-1)

xiaoqu_list = soup.findAll('li', {'class': 'clear xiaoquListItem'})
for xq in xiaoqu_list:
info_dict = {}
title = xq.find('div', {'class': 'title'});
info_dict.update({u'小区名称': title.text})
d=title.findAll('a')
for item in d:
href = item['href'];
info_dict.update({u'url': href})

postioninfo = xq.find('div', {'class': 'positionInfo'}).renderContents().strip().decode('utf-8');
content = "".join(postioninfo.split())
info = re.match(r".+district.+>(.+)</a>.+bizcircle.+>(.+)</a>(.+)", content)
if info:
info = info.groups()
info_dict.update({u'大区域': info[0]})
info_dict.update({u'小区域': info[1]})
info_dict.update({u'建造时间': info[2]})
command = gen_xiaoqu_insert_command(info_dict)
db_xq.execute(command, 1)

爬取的小区信息将被存储到数据库表中，供后续使用。

在接下来将说明如何爬取在售二手房信息、历史成交二手房信息，敬请期待。

长按进行关注。

查看全文

相关阅读:
【348】通过 Numpy 创建各式各样的矩阵
 【347】将jupyter notebook嵌入博客园
 【346】TF-IDF
【345】机器学习入门
 Python 学习入门（28）—— 服务器实例
 HDU 1518 Square
建立树莓派raspberry交叉编译环境以及编译内核
 Android源码分析-消息队列和Looper
oracle 表空管理方式（LMT)、ASSM段管理方式、一级位图块、二级位图块、三级位图块。
Unity手游之路<六>游戏摇杆之Easy Touch 3教程

原文地址：https://www.cnblogs.com/protosec/p/11673337.html