zoukankan      html  css  js  c++  java
  • Python获取网页Html文本

    Python爬虫基础

      1.获取网页文本

          通过urllib2包,根据url获取网页的html文本内容并返回

    #coding:utf-8
    import requests, json, time, re, os, sys, time
    import urllib2
    
    #设置为utf-8模式
    reload(sys)
    sys.setdefaultencoding( "utf-8" )
    
    def getHtml(url):
        response = urllib2.urlopen(url)
        html = response.read()
        #可以根据编码格式进行编码
        #html = unicode(html,'utf-8')
        return html 
    url = 'https://www.cnblogs.com/'
    print getHtml(url)

    或者

    def getHtml(url):
        #使用将urllib2.Request()实例化,需要访问的URL地址则作为Request实例的参数
        request = urllib2.Request(url)
        #Request对象作为urlopen()方法的参数,发送给服务器并接收响应的类文件对象
        response = urllib2.urlopen(request)
        #类文件对象支持文件对象操作方法
        #如read()方法读取返回文件对象的全部内容并将其转换成字符串格式并赋值给html
        html = response.read()
        #可以根据编码格式进行编码
        #html = unicode(html,'utf-8')
        return html 
        
    url = 'https://www.cnblogs.com/'
    print getHtml(url)

    再添加ua和超时时间:

    def getHtml(url):
        #构造ua
        ua_header = {"User-Agent":"Mozzila/5.0(compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
        #url连同headers一起构造Request请求,这个请求将附带IE9.0浏览器的User-Agent
        request = urllib2.Request(url,headers=ua_header)
        #设置超时时间
        response = urllib2.urlopen(request,timeout=60)
        html = response.read()
        return html
        
    url = 'https://www.cnblogs.com/'
    print getHtml(url)

    添加header属性:

    def getHtml(url):
        ua = {"User-Agent":"Mozzila/5.0(compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
        request = urllib2.Request(url)
        #也可以通过Request.add_header()添加/修改一个特定的header
        request.add_header("Connection","keep-alive") 
        response = urllib2.urlopen(request)
        html = response.read()
        #查看响应码
        print '相应码为:',response.code
        #也可以通过Request.get_header()查看header信息
        print "Connection:",request.get_header("Connection")
        #或者
        print request.get_header(header_name = "Connection")
        #print html 
        return html

    添加随机ua

    #coding:utf-8
    import requests, json, time, re, os, sys, time
    import urllib2
    import random
    
    
    #设置为utf-8模式
    reload(sys)
    sys.setdefaultencoding( "utf-8" )
    
    def getHtml(url):
        #定义ua池,每次随机取出一个值
        ua_list = ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv2.0.1) Gecko/20100101 Firefox/4.0.1","Mozilla/5.0 (Windows NT 6.1; rv2.0.1) Gecko/20100101 Firefox/4.0.1","Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11","Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"]
        user_agent = random.choice(ua_list)
        #print user_agent
        request = urllib2.Request(url)
        request.add_header("Connection","keep-alive")
        request.add_header("User-Agent",user_agent)
        response = urllib2.urlopen(request,data=None,timeout=60)
        html = response.read()
        #print '响应码为:',response.code
        #print 'URL:',response.geturl()
        #print 'Info:',response.info()
  • 相关阅读:
    智能聊天机器人——基于RASA搭建
    十分钟学会写shell脚本
    浅谈并发并行异步同步
    C/S系统实现两数求和(非阻塞+epoll+心跳包检测用户在线状况+滚动日志+配置文件.)
    编程之美第一篇 01分数规划
    欧拉函数
    奇妙的算法之LCS妙解
    N种方法妙讲LIS算法
    基于FeignClient提供简单的用户查询服务
    SpringCloud简介
  • 原文地址:https://www.cnblogs.com/Jims2016/p/8440517.html
Copyright © 2011-2022 走看看