zoukankan      html  css  js  c++  java
  • spark-submit python egg 解决三方件依赖问题

     假设spark里用到了purl这个三方件,https://github.com/ultrabluewolf/p.url,他还额外依赖futures这个三方件(six的话,anaconda2自带)。

    pyspark 代码如下:

    from pyspark import SparkConf, SparkContext
    conf = SparkConf().setMaster("local").setAppName("My test App")
    sc = SparkContext(conf=conf)
    #from purl import Purl
    
    def get_purl(x):
        from purl import Purl
        url = Purl('https://github.com/search?q={}'.format(x))
        return str(url.add_query('name', 'dog'))
    
    int_rdd = sc.parallelize([1, 2, 3, 4])
    r =int_rdd.map(lambda x: get_purl(x))
    print(r.collect())
    

    下面说明如何编译打包egg。

    通过https://pypi.org/project/p.url/#files 下载源码。然后解压:

    python setup.py  bdist_egg

    在dist目录下可以看到有egg文件生成。

    同理,下载https://pypi.org/project/future/#files futures的源码,然后解压生成egg文件。

    最终运行:

    spark-submit --py-files p.url-0.1.0a4-py2.7.egg,future-0.17.1-py2.7.egg main_dep.py

     结果输出:

    ['https://github.com/search?q=1&name=dog', 'https://github.com/search?q=2&name=dog', 'https://github.com/search?q=3&name=dog', 'https://github.com/search?q=4&name=dog']
    

    补充官方文档,比较蛋疼,没有说具体操作:

    Complex Dependencies

    Some operations rely on complex packages that also have many dependencies. For example, the following code snippet imports the Python pandas data analysis library:

    def import_pandas(x):
     import pandas
     return x
    
    int_rdd = sc.parallelize([1, 2, 3, 4])
    int_rdd.map(lambda x: import_pandas(x))
    int_rdd.collect()

    pandas depends on NumPy, SciPy, and many other packages. Although pandas is too complex to distribute as a *.py file, you can create an egg for it and its dependencies and send that to executors.

    Limitations of Distributing Egg Files

    In both self-contained and complex dependency scenarios, sending egg files is problematic because packages that contain native code must be compiled for the specific host on which it will run. When doing distributed computing with industry-standard hardware, you must assume is that the hardware is heterogeneous. However, because of the required C compilation, a Python egg built on a client host is specific to the client CPU architecture. Therefore, distributing an egg for complex, compiled packages like NumPy, SciPy, and pandas often fails. Instead of distributing egg files you should install the required Python packages on each host of the cluster and specify the path to the Python binaries for the worker hosts to use.

     
  • 相关阅读:
    H5应用加固防破解-js虚拟机保护方案浅谈
    Hijack chrome browser
    端口复用正向后门
    Django框架的一些漏洞
    07_简单的LISP加减乘除(基本计算器)
    git error:invalid path问题解决(win下)
    配置win10支持文件夹内区分大小写
    win10启用自带ubuntu虚拟机并升级至wsl2
    【进程调度】关于CPU的sockets、dies、cores、threads含义理解
    06_最长回文子串长度
  • 原文地址:https://www.cnblogs.com/bonelee/p/11125481.html
Copyright © 2011-2022 走看看