zoukankan html css js c++ java

python beautifulsoup提取cdata数据

最近在玩爬虫，遇到一个网址，里面的内容有个CDATA的数据，然后beautifulesoup就受挫了，但是正则又写不好，该怎么办呢？

查了下资料，找到了解析这种数据的方法

import requests
from bs4 import BeautifulSoup,CData
import re

def get_Response(_url):
    temp_response=requests.get(_url)
    #print(response.content.decode('utf-8'))
    temp_response.encoding='utf-8'
    #print(temp_response.text)
    return temp_response
response=get_Response('http://www.ninghai.gov.cn/col/col111591/index.html')
html=response.text
soup=BeautifulSoup(html,'lxml')
msg=soup.find('table',attrs={'class':'btlb'})
#print(msg.find('a',attrs={'target':'_blank'}))
print(msg.text)

其中msg.text就是包含着那块CDATA数据的节点

然后可以

第一种方式

soup.find(text=lambda tag: isinstance(tag, CData)).string.strip()

但是这种写法如果解析出来的是乱码，那我又不知道该怎么转换文字编码，所以就用第二种

第二种写法

for cd in soup.findAll(text=True):
    if isinstance(cd, CData):
        print(cd)
        ss=BeautifulSoup(cd,'lxml')
        print('--------')
        print(ss.text)

其实我觉得，这样写还不如用正则，所以会正则的还是用正则吧

下面是参考网址

https://stackoverflow.com/questions/34639623/using-beautifulsoup-to-extract-cdata?noredirect=1

https://stackoverflow.com/questions/2032172/how-can-i-grab-cdata-out-of-beautifulsoup

查看全文

相关阅读:
NOIP 2011 DAY 2
NOIP 2011 DAY 1
扩展欧几里得算法（exgcd）
中国剩余定理
 线性同余方程的求解
 乘法逆元
 poj 1845 Sumdiv（约数和，乘法逆元）
欧拉-费马小定理定理（证明及推论）
求解范围中 gcd（a,b）== prime 的有序对数
 KindEditor解决上传视频不能在手机端显示的问题

原文地址：https://www.cnblogs.com/lingLuoChengMi/p/9473313.html