zoukankan      html  css  js  c++  java
  • python获取js里window对象

    python环境依赖

    pip install PyExecJS
    pip install lxml
    pip install beautifulsoup4
    pip install requests

    nodejs环境依赖

    全局安装命令

    npm install jsdom -g
    或者
    yarn add jsdom -g

    安装后下面这些代码可以正常执行了

    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;

    在全局安装jsdom后,在node里按上面的写法是没有问题的,但是我们要在python中使用的话,不能在全局安装
    如果在全局安装,使用时会报如下错误,说找不到jsdom

    execjs._exceptions.ProgramError: Error: Cannot find module 'jsdom'

    解决办法有两种
    1.就是在python执行文件所在的运行目录下,使用npm安装jsdom
    2. 使用cwd参数,指定模块的所在目录,比如,我们在全局安装的jsdom,在cmd里通过npm root -g 可以查看全局模块安装路径: C:Usersw001AppDataRoaming pm ode_modules
    我们使用时,代码可以按下面的写法写

    import execjs
    with open(r'要运行的.js','r',encoding='utf-8') as f:
        js = f.read()
    ct = execjs.compile(js,cwd=r'C:Usersw001AppDataRoaming
    pm
    ode_modules')
    print(ct.call('Rohr_Opt.reload','1'))
    print(js.eval("window.pageData"))

    python 爬虫的例子

    #!/usr/bin/env python
    # -*- coding:utf-8 -*-
    # @Author: Irving Shi
    
    import execjs
    import json
    import requests
    from bs4 import BeautifulSoup
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
    }
    
    
    def get_company(key):
        res = requests.get("https://aiqicha.baidu.com/s?q=" + key, headers=headers)
        soup = BeautifulSoup(res.text, features="lxml")
        tag = soup.find_all("script")[2].decode_contents()
        tag = """const jsdom = require("jsdom");
        const { JSDOM } = jsdom;
        const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
        window = dom.window;
        document = window.document;
        XMLHttpRequest = window.XMLHttpRequest; """ + tag
        js = execjs.compile(tag, cwd=r'C:UsersAdministratorAppDataRoaming
    pm
    ode_modules')
    
        res = js.eval("window.pageData").get("result").get("resultList")[0]
        return res
    
    
    res = get_company("91360000158304717T")
    # for i in res.items():
    #     print(i)
    
    pid = res.get("pid")
    r = requests.get("https://aiqicha.baidu.com/detail/basicAllDataAjax?pid=" + pid, headers=headers)
    data = json.loads(r.text).get("data").get("basicData")
    for i in data.items():
        print(i)

    使用python的execjs执行js,会有这个错误:

    UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 41: illegal multibyte sequence

    这个问题原因是文件编码问题,具体可以 Google 一下,这里直接解决方法是通过修改 subprocess.py 中的 Popen 类的构造方法 __init__ 中 encoding 参数的默认值为 utf-8

    改前

        _child_created = False  # Set here since __del__ checks it
    
        def __init__(self, args, bufsize=-1, executable=None,
                     stdin=None, stdout=None, stderr=None,
                     preexec_fn=None, close_fds=_PLATFORM_DEFAULT_CLOSE_FDS,
                     shell=False, cwd=None, env=None, universal_newlines=False,
                     startupinfo=None, creationflags=0,
                     restore_signals=True, start_new_session=False,
                     pass_fds=(), *, encoding=None, errors=None):
            """Create new Popen instance."""
            _cleanup()
            # Held while anything is calling waitpid before returncode has been
            # updated to prevent clobbering returncode if wait() or poll() are
            # called from multiple threads at once.  After acquiring the lock,
            # code must re-check self.returncode to see if another thread just
            # finished a waitpid() call.
            self._waitpid_lock = threading.Lock()

    改后

        _child_created = False  # Set here since __del__ checks it
    
        def __init__(self, args, bufsize=-1, executable=None,
                     stdin=None, stdout=None, stderr=None,
                     preexec_fn=None, close_fds=_PLATFORM_DEFAULT_CLOSE_FDS,
                     shell=False, cwd=None, env=None, universal_newlines=False,
                     startupinfo=None, creationflags=0,
                     restore_signals=True, start_new_session=False,
                     pass_fds=(), *, encoding="utf-8", errors=None):
            """Create new Popen instance."""
            _cleanup()
            # Held while anything is calling waitpid before returncode has been
            # updated to prevent clobbering returncode if wait() or poll() are
            # called from multiple threads at once.  After acquiring the lock,
            # code must re-check self.returncode to see if another thread just
            # finished a waitpid() call.
            self._waitpid_lock = threading.Lock()

    因为修改源码的缘故建议大家在虚拟环境venv中用

    pip install virtualenv
  • 相关阅读:
    智能移动机器人背后蕴含的技术——激光雷达
    Kalman Filters
    Fiddler抓HttpClient的包
    VSCode开发WebApi EFCore的坑
    WPF之小米Logo超圆角的实现
    windows react打包发布
    jenkins in docker踩坑汇总
    Using ML.NET in Jupyter notebooks 在jupyter notebook中使用ML.NET ——No design time or full build available
    【Linux知识点】CentOS7 更换阿里云源
    【Golang 报错】exec gcc executable file not found in %PATH%
  • 原文地址:https://www.cnblogs.com/shizhengwen/p/14092614.html
Copyright © 2011-2022 走看看