zoukankan      html  css  js  c++  java
  • 网页采集中文乱码问题

    Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门

    https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6EmUbbW&id=564564604865

    源码

    # -*- coding: utf-8 -*-
    """
    Created on Tue Mar 15 08:53:08 2016
    采集化工标准补录项目
    @author: Administrator
    """
    import requests,bs4
    text=open("hb.txt",'w',encoding='utf-8')
    webpage="http://www.bzwxw.com/html/2016/1988_0116/9.html"
    res=requests.get(webpage)
    requests.codes.ok

    #中文显示全是乱码
    res.text

    #soup1=bs4.BeautifulSoup(res.text,"lxml",from_encoding="gb18030")
    soup1=bs4.BeautifulSoup(res.text,"lxml")

    elems=soup1.select('title')
    len(elems)
    content=elems[0].getText()

    #text.write("hello")
    text.write(content)

    text.close()

    bs4显示出来是乱码

    查看网页源码

    发现charset=gbk,这可能是中文编码

    增加一句话res.encoding = 'gbk'

    # -*- coding: utf-8 -*-
    """
    Created on Tue Mar 15 08:53:08 2016
    采集化工标准补录项目
    @author: Administrator
    """
    import requests,bs4
    text=open("hb.txt",'w',encoding='utf-8')
    webpage="http://www.bzwxw.com/html/2016/1988_0116/9.html"
    res=requests.get(webpage)
    res.encoding = 'gbk'
    requests.codes.ok

    #中文显示全是乱码
    res.text

    #soup1=bs4.BeautifulSoup(res.text,"lxml",from_encoding="gb18030")
    soup1=bs4.BeautifulSoup(res.text,"lxml")

    elems=soup1.select('title')
    len(elems)
    content=elems[0].getText()

    #text.write("hello")
    text.write(content)

    text.close()

    发现输出正常

    而且写入txt的中文也能正常显示

  • 相关阅读:
    升级Xcode之后VVDocumenter-Xcode不能用的解决办法
    iOS国际化
    display:table 表格布局
    display: run-in
    连续字符换行 溢出点点点 多行省略
    Number 类型
    Boolean 相关
    Browsing contexts 浏览器上下文
    return flase 作用
    JS外链
  • 原文地址:https://www.cnblogs.com/webRobot/p/5278179.html
Copyright © 2011-2022 走看看