zoukankan      html  css  js  c++  java
  • 爬虫小试之一(抓取豆瓣电影)

    工具

      python3.5

      BeautifulSoup

    步骤:

      1、根据url抓取豆瓣电影html,并解析

      2、BeautifulSoup截取节点,写入字典

      3、保存字典信息

    # -*- coding='utf-8' -*-
    import requests
    from bs4 import BeautifulSoup
    import json
    
    #发送request,返回response
    def getHTMLText(url):
    	try:
    		r = requests.get(url, timeout=30)
    		r.raise_for_status()
    		r.encoding = r.apparent_encoding
    		return r.text
    	except:
    		return ""
    
    
    def getMovieInfo(mlist, html):
    	soup = BeautifulSoup(html, 'html.parser')         #解析成html
    	lists = soup.find_all('li', attrs={'class':'list-item'})   
    	for ls in lists:
    			if ls.attrs['data-category']== 'nowplaying':  #判断正热播的电影
    				mdict = {}
    				mdict['电影名'] = ls.attrs['data-title']
    				mdict['评分'] = ls.attrs['data-score']
    				mdict['时长'] = ls.attrs['data-duration']
    				mdict['主演'] = ls.attrs['data-actors']
    				mlist.append(mdict)
    
    #写入txt文件
    def saveMovieInfo(mlist, path):    
    	with open(path, 'w', encoding='utf-8') as f:
    		f.write(str(mlist))
    		f.close()
    
    
    def main():
    	mlist = []
    	url = 'https://movie.douban.com/cinema/nowplaying/shenzhen/'
    	path = 'D://pachong//movie.txt'
    	html = getHTMLText(url)
    	print(len(html))
    	getMovieInfo(mlist, html)
    	print()
    	saveMovieInfo(mlist, path)
    
    if __name__ == '__main__':
    	main()
    

      

  • 相关阅读:
    博客园.netCore+阿里云服务器升级中会报的一个错误
    框架资料整理
    免费在线作图,实时协作-工具
    ubuntu14.04 配置网络
    vagrant的学习 之 基础学习
    mysql 定时任务
    linux 命令练习 2018-08-27
    一点一滴(一)
    Thinkphp5.0 的实践一
    Thinkphp5.0 的使用模型Model的获取器与修改器
  • 原文地址:https://www.cnblogs.com/waylon/p/6796269.html
Copyright © 2011-2022 走看看