zoukankan html css js c++ java

python处理转载博客html

前景

在转载别人博客的时候通常我们会通过复制html然后放到编辑器里面，但是通常html里有很多杂七杂八的东西，比如script， svg这些标签导致排版出现问题

例如由lu标签引起的

在这里插入图片描述

由svg标签引起的
在这里插入图片描述
当然要说你直接把不要的东西删除也可以，但是作为一个程序员，能用电脑做的事当然是不用自己做啦，于是就有了下面一步

代码实现

代码采用Python，因为Python有BeautifulSoup，能很好的处理html文件，例如指定标签删除等，所以就采用Python3来写这些代码。

分析出现排版问题的原因

代码行下方出现数字是因为有
开头显示不正常是因为注释和

<svg>

在这里插入图片描述

如何去除指定标签和注释

#去除属性ul
[s.extract() for s in soup("ul")]
# 去除属性svg
[s.extract() for s in soup("svg")]
# 去除属性script
[s.extract() for s in soup("script")]

Python代码

# 输入网址把 html变成md
import requests
import time

from bs4 import BeautifulSoup, Comment
def get_page_source(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "failed"

if __name__ == '__main__':

    blogUrl = "https://blog.csdn.net/qq_36124194/article/details/83686823"

    #blogUrl = input("请输入转载地址
")


    blogText = get_page_source(blogUrl)

    soup = BeautifulSoup(blogText, 'html.parser')

    #去除属性ul
    [s.extract() for s in soup("ul")]
    # 去除属性svg
    [s.extract() for s in soup("svg")]
    # 去除属性script
    [s.extract() for s in soup("script")]
    #去除注释
    comments = soup.findAll(text=lambda text: isinstance(text, Comment))
    [comment.extract() for comment in comments]
    #得到正文
    articleText = soup.find('div', attrs={'class': 'markdown_views prism-atom-one-dark'})
    # 加入 转载地址说明
    finalStr = "## 转载地址   
" + "## " +blogUrl + "  
" + str(articleText)

    print(finalStr)

查看全文

相关阅读:
纸牌排序
 将年份转换成天干地支
 猜算式
 字符串的简单处理
 九宫格填数字
 扫雷
 嗨喽
 Input.GetAxis与Input.GetAxisRaw区别
 C#中(int)、int.Parse()、int.TryParse()和Convert.ToInt32()的区别
 开发游戏所需知识（知乎转载）

原文地址：https://www.cnblogs.com/qq874455953/p/10264451.html