zoukankan      html  css  js  c++  java
  • python爬网页中文乱码问题

    再用python爬取网页时,用模拟浏览器登陆,得到的中文字符出现乱码,该怎么解决呢?

    url = “http://newhouse.hfhouse.com/”
        req = urllib2.Request(url,headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Firefox/24.0" })
        reqHtml = urllib2.urlopen(req).read()
        #print reqHtml
        songtasteHtmlEncoding='utf-8'
        soup = BeautifulSoup.BeautifulStoneSoup(reqHtml,fromEncoding=songtasteHtmlEncoding)
        #print soup
        re_h = re.compile('</?\w+[^>]*>')
        s = len(soup.findAll('a',{"class":"area_list"}))
        finda = soup.findAll('a',{"class":"area_list"}) 
        i = 0
        while(i<s):
            quyuz = re_h.sub('',str(finda[i])).strip()
            try:
                quyu = quyuz.decode('utf-8').encode('gbk')
            except:
                if quyuz[:3] == codecs.BOM_UTF8:
                    quyu = quyuz[3:]   
                    print quyu.decode("utf-8").encode('gbk')
            #quyu = quyu.decode('utf-8').encode('gbk')
            #number = int(filter(str.isdigit, quyuz))
            #dir2 = make_dir(dir1,quyu)
            value = finda[i]['val']
            houseid = finda[i]['href']
            print houseid,value,quyu

    总是报eUnicodeEncodeError: 'gbk' codec can't encode character u'\xe7' in position 0: illegal multibyte sequence,网页head里编码是utf-8该怎么办呢?

  • 相关阅读:
    js_sl 分享
    js_sl 延迟菜单
    jszs 历史管理
    jszs 对象引用
    jszs 快速排序
    jszs 枚举算法
    dom cookie记录用户名
    dom 拖拽回放
    dom div重合提示
    dom 输入文字模拟滚动
  • 原文地址:https://www.cnblogs.com/vampirejt/p/python1.html
Copyright © 2011-2022 走看看