zoukankan      html  css  js  c++  java
  • Python 爬虫实例(4)—— 爬取网易新闻

    自己闲来无聊,就爬取了网易信息,重点是分析网页,使用抓包工具详细的分析网页的每个链接,数据存储在sqllite中,这里只是简单的解析了新闻页面的文字信息,并未对图片信息进行解析

    仅供参考,不足之处请指正

    # coding:utf-8
    
    import random, re
    import sqlite3
    import json
    from bs4 import BeautifulSoup
    
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    import uuid
    
    import requests
    
    session = requests.session()
    def md5(str):
        import hashlib
        m = hashlib.md5()
        m.update(str)
        return m.hexdigest()
    
    def wangyi():
       
        for i in range(1,3):
            if i ==1:
                k = ""
            else:
                k = "_0" + str(i)
            url = "http://temp.163.com/special/00804KVA/cm_yaowen"  + k + ".js?callback=data_callback"
            print url
            headers = {
    
                "Host":"temp.163.com",
                "Connection":"keep-alive",
                "Accept":"*/*",
                "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 LBBROWSER",
                "Referer":"http://news.163.com/",
                "Accept-Encoding":"gzip, deflate, sdch",
                "Accept-Language":"zh-CN,zh;q=0.8",
    
            }
            result = session.get(url=url,headers=headers).text
            try:
                result1 = eval(eval((json.dumps(result)).replace('data_callback(','').replace(')','').replace(' ','')))
            except:
                pass
            try:
                for i in result1:
                    tlink = i['tlink']
                    headers2 = {
    
                            "Host":"news.163.com",
                            "Connection":"keep-alive",
                            "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
                            "Upgrade-Insecure-Requests":"1",
                            "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 LBBROWSER",
                            "Accept-Encoding":"gzip, deflate, sdch",
                            "Accept-Language":"zh-CN,zh;q=0.8",
    
                    }
                    print "tlinktlinktlinktlink",tlink
                    return_data = session.get(url=tlink,headers=headers2).text
                    try:
    
                        soup = BeautifulSoup(return_data, 'html.parser')
                        returnSoup = soup.find_all("div", attrs={"id": "endText"})[0]
                        print returnSoup
                        print "==============================="
                        try:
                            returnList = re.findall('<p>(.*?)</p>',str(returnSoup))
                            content1 = '<-->'.join(returnList)
                        except:
                            content1 =""
    
                        try:
                            returnList1 = re.findall('<p class="f_center">(.*?)</p>',str(returnSoup))
                            content2 = '<-->'.join(returnList1)
                        except:
                            content2 =""
    
                        content = content1 +content2
    
    
                    except:
                        content = ""
    
                    cx = sqlite3.connect("C:\Users\xuchunlin\PycharmProjects\study\db.sqlite3", check_same_thread=False)
                    cx.text_factory = str
    
                    try:
                        print "正在插入链接   %s   数据" % (url)
    
                        tlink = i['tlink']
                        title = (i['title']).decode('unicode_escape')
                        commenturl = i['commenturl']
                        tienum = i['tienum']
                        opentime = i['time']
    
                        print title
                        print tlink
                        print commenturl
                        print tienum
                        print opentime
                        print content
    
    
                        url2 = md5(str(tlink))
    
                        cx.execute("INSERT INTO wangyi (title,tlink,commenturl,tienum,opentime,content,url)VALUES (?,?,?,?,?,?,?)",(str(title), str(tlink), str(commenturl), str(tienum), str(opentime), str(content), str(url2)))
    
                    except Exception as e:
                        print e
                        print "cha ru shi bai "
                    cx.commit()
                    cx.close()
            except:
                pass
    
    
    wangyi()
  • 相关阅读:
    android Json解析详解
    Android 用ping的方法判断当前网络是否可用
    Android 监控网络状态
    Android TableLayout 常用的属性介绍及演示
    三星笔记本R428安装xp win7双系统,切换系统重启才能进入系统解决办法。
    解决Win8不能上网攻略第二版!三步秒杀原驱动
    Android三种实现自定义ProgressBar的方式介绍
    Android应用开发中半透明效果实现方案
    FFT算法的物理意义
    网络编程Socket之TCP之close/shutdown具体解释(续)
  • 原文地址:https://www.cnblogs.com/xuchunlin/p/7097731.html
Copyright © 2011-2022 走看看