zoukankan html css js c++ java

python3：爬取的内容包含中文，输出后乱码的问题

需求：想要实现这样的功能：用户输入喜欢的电影名字，程序即可在电影天堂https://www.ygdy8.com爬取电影所对应的下载链接，并将下载链接打印出来

遇到的问题：获取磁力的链接中包含中文，打印出来后乱码

解决办法：手动指定编码方式：

if res.encoding == 'ISO-8859-1':
    encodings = requests.utils.get_encodings_from_content(res.text)
    if encodings:
        encoding = encodings[0]
    else:
        encoding = res.apparent_encoding
else:
    encoding = res.encoding
encode_content = res.content.decode(encoding, 'replace').encode('utf-8', 'replace')

# 想要实现这样的功能：用户输入喜欢的电影名字，程序即可在电影天堂https://www.ygdy8.com爬取电影所对应的下载链接，并将下载链接打印出来

import requests
from bs4 import BeautifulSoup
from urllib.request import pathname2url

# 为躲避反爬机制，伪装成浏览器的请求头
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 OPR/65.0.3467.78 (Edition Baidu)'}

# 获取电影磁力链接
def getMovieDownloadLink(filmlink):
    res = requests.get(filmlink, headers=headers)
    if res.status_code == 200:

        # 请求后的内容中文乱码处理办法：
        # 当response编码是‘ISO-8859-1’，我们应该首先查找response header设置的编码；如果此编码不存在，查看返回的Html的header设置的编码
        if res.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(res.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = res.apparent_encoding
        else:
            encoding = res.encoding
        encode_content = res.content.decode(encoding, 'replace').encode('utf-8', 'replace')

        soup = BeautifulSoup(encode_content, 'html.parser')
        Zoom = soup.select_one('#Zoom')
        fileurl = Zoom.find('table').find('a').text
        with open('./17-电影天堂磁力.txt','a', newline='') as file:
            file.write(fileurl + '
')

    else:
        print('电影链接：{}请求失败！'.format(filmlink))

def main():
    dyurl = 'https://www.ygdy8.com'
    # movie = input('请输入电影名称：')
    movie = '沉睡魔咒'
    movie = movie.encode('gbk')
    url = 'http://s.ygdy8.com/plus/s0.php?typeid=1&keyword={0}'.format(pathname2url(movie))
    res = requests.get(url, headers=headers)
    if res.status_code == 200:
        htmltext = res.text
        soup = BeautifulSoup(htmltext, 'html.parser')
        co_content8 = soup.find('div', class_='co_content8')
        tables = co_content8.find('ul').find_all('table')
        if len(tables) <= 0:
            print('没有找到相关的资源，可到站点上搜索 {0}'.format(dyurl))
        else:
            for table in tables:
                filmlink = dyurl + table.find('a')['href']
                getMovieDownloadLink(filmlink)

    else:
        print('请求失败！')

main()

结果：

参考：

https://blog.csdn.net/guoxinian/article/details/82978067

http://blog.csdn.net/a491057947/article/details/47292923

http://docs.python-requests.org/en/latest/user/quickstart/#response-content

查看全文

相关阅读:
Leetcode练习(Python)：树类：第112题：路径总和：给定一个二叉树和一个目标和，判断该树中是否存在根节点到叶子节点的路径，这条路径上所有节点值相加等于目标和。说明: 叶子节点是指没有子节点的节点。
Leetcode练习(Python)：树类：第226题：翻转二叉树：翻转一棵二叉树。
Leetcode练习(Python)：树类：第108题：将有序数组转换为二叉搜索树：将一个按照升序排列的有序数组，转换为一棵高度平衡二叉搜索树。本题中，一个高度平衡二叉树是指一个二叉树每个节点的左右两个子树的高度差的绝对值不超过 1。
Leetcode练习(Python)：树类：第104题：二叉树的最大深度：给定一个二叉树，找出其最大深度。二叉树的深度为根节点到最远叶子节点的最长路径上的节点数。说明: 叶子节点是指没有子节点的节点。
Leetcode练习(python)：树类：第107题：二叉树的层次遍历 II：给定一个二叉树，返回其节点值自底向上的层次遍历。（即按从叶子节点所在层到根节点所在的层，逐层从左向右遍历）
Leetcode练习(Python)：树类：第102题：二叉树的层序遍历：给你一个二叉树，请你返回其按层序遍历得到的节点值。（即逐层地，从左到右访问所有节点）。
Leetcode练习(Python)：树类：第101题：对称二叉树：给定一个二叉树，检查它是否是镜像对称的。
Leetcode练习(Python)：树类：第100题：相同的树：给定两个二叉树，编写一个函数来检验它们是否相同。如果两个树在结构上相同，并且节点具有相同的值，则认为它们是相同的。
高可用Kubernetes集群-2. ca证书与秘钥
 高可用Kubernetes集群-1. 集群环境

原文地址：https://www.cnblogs.com/KeenLeung/p/12160712.html