zoukankan      html  css  js  c++  java
  • python识别html主要文本框

    在抓取网页的时候只想抓取主要的文本框,例如 csdn 中的主要文本框为下图红色框:

    抓取的思想是,利用 bs4 查找所有的 div,用正则筛选出每个 div 里面的中文,找到中文字数最多的 div 就是属于正文的 div 了。定义一个抓取的头部抓取网页内容:

    import requests
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36',
        'Host': 'blog.csdn.net'}
    session = requests.session()
     
    def getHtmlByRequests(url):
        headers.update(
            dict(Referer=url, Accept="*/*", Connection="keep-alive"))
        htmlContent = session.get(url=url, headers=headers).content
        return htmlContent.decode("utf-8", "ignore")
    

    识别每个 div 中文字的正则:

    import re
    # 统计中文字数
    def countContent(string):
        pattern = re.compile(u'[u1100-uFFFD]+?')
        content = pattern.findall(string)
        return content
    

    遍历每一个 div ,利用正则判断里面中文的字数长度,找到长度最长的 div

    # 分析页面信息
    def analyzeHtml(html):
        # 初始化网页
        soup = BeautifulSoup(html, "html.parser")
        part = soup.select('div')
        match = ""
        for paragraph in part:
            content = countContent(str(paragraph))
            if len(content) > len(match):
                match = str(paragraph)
        return match
    

    得到主要的 div 后,提取里面的文字出来:

    def main():
        url = "http://blog.csdn.net/"
        html = getHtmlByRequests(url)
        mainContent = analyzeHtml(html)
        soup = BeautifulSoup(mainContent, "html.parser")
        print(soup.select('div')[0].text)
    

    完整的代码如下:

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    from bs4 import BeautifulSoup
    import requests
    import re
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36',
        'Host': 'blog.csdn.net'}
    session = requests.session()
    
    
    def getHtmlByRequests(url):
        headers.update(
            dict(Referer=url, Accept="*/*", Connection="keep-alive"))
        htmlContent = session.get(url=url, headers=headers).content
        return htmlContent.decode("utf-8", "ignore")
    
    # 统计中文字数
    def countContent(string):
        pattern = re.compile(u'[u1100-uFFFD]+?')
        content = pattern.findall(string)
        return content
    
    # 分析页面信息
    def analyzeHtml(html):
        # 初始化网页
        soup = BeautifulSoup(html, "html.parser")
        part = soup.select('div')
        match = ""
        for paragraph in part:
            content = countContent(str(paragraph))
            if len(content) > len(match):
                match = str(paragraph)
        return match
    
    
    def main():
        url = "http://blog.csdn.net/"
        html = getHtmlByRequests(url)
        mainContent = analyzeHtml(html)
        soup = BeautifulSoup(mainContent, "html.parser")
        print(soup.select('div')[0].text)
    
    
    if __name__ == '__main__':
        main()
    
    
  • 相关阅读:
    httpclient 使用问题记录:org.apache.http.HttpException: Unsupported Content-Coding:GLZip
    Gitserver端密码变更,但是本地gitconfig配置未变更账号和密码问题解决
    线程池ThreadPoolExecutor学习
    Java 网络编程
    org.apache.ibatis.binding.BindingException: Invalid bound statement Mybatis绑定错误问题解决
    Java string类
    maven3.6.2 版本 在idea 2019.2.2下遇到的问题解决记录
    python
    django-URL与视图配置
    python 的datetime模块使用
  • 原文地址:https://www.cnblogs.com/TTyb/p/7723431.html
Copyright © 2011-2022 走看看