zoukankan html css js c++ java

[Spark] Spark读取gbk编码文件

def output_mapper(line):
    """ 输入文件是gbk编码，
        使用spark的GBKFileInputFormat读取后自动转为utf-8编码.
        Keys are the position in the file,
        and values are the line of text,
        and will be converted to UTF-8 Text.
    Args:
        line    (position, bidword 	 sp 	 tag_info)
    Returns:
        list    [bidword, sp, tag_info, theDate]
    """
    try:
        global theDate
        value = line[1]
        bidword, sp, tag_info = value.strip().split('	')
        return [bidword, sp, tag_info, theDate]
    except Exception as e:
        logging.error("add_date_mapper error: {}".format(e))
        return None

test_df = sc.hadoopFile(test_file,
                        "org.apache.spark.input.GBKFileInputFormat",
                        "org.apache.hadoop.io.LongWritable",
                        "org.apache.hadoop.io.Text")
                   .map(output_mapper)
                   .filter(lambda x: x is not None)
                   .toDF()

参考链接：

https://www.wangt.cc/2019/11/feature%EF%BC%9Aspark%E6%94%AF%E6%8C%81gbk%E6%96%87%E4%BB%B6%E8%AF%BB%E5%8F%96%E5%8A%9F%E8%83%BD/

/**

* FileInputFormat for gbk encoded files. Files are broken into lines.Either linefeed

* or carriage-return are used to signal end of line. Keys are the position in the file,

* and values are the line of text and will be converted to UTF-8 Text.

*/

查看全文

相关阅读:
二分图那套理论
 洛谷P4351 [CERC2015]Frightful Formula【组合计数】
「AGC023E」Inversions【组合计数】
类欧几里得算法及其拓展
 OLAP 一些扯淡
 auto vectorized case shift
备忘录
 lambda function pointer
C++ atomic
gdb 使用了 O0 但是还是有 <optimized out>

原文地址：https://www.cnblogs.com/shiyublog/p/14368979.html