解决ValueError: Some of types cannot be determined by the first 100 rows,

zoukankan html css js c++ java

解决ValueError: Some of types cannot be determined by the first 100 rows,
在spark中试图将RDD转换成DataFrame时，有时会提示ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling

原因

RDD中元素的内部结构是未知的、不明确的，也就是说每个元素里面有哪些字段，每个字段是什么类型，这些都是不知道的，而DataFrame则要求对元素的内部结构有完全的知情权。

但是在前100行的数据采样过程中还是无法确定字段的类型，所以就会提示这个。

解决办法

一、提高数据采样率(sampling ratio)
sqlContext.createDataFrame(rdd, samplingRatio=0.2)
其中的samplingRatio参数就是数据采样率，可以先设置为0.2试试，如果不行，可以继续增加。

该方法的缺点在于，数据抽样确定类型之后，如果后续类型发生变化，则会导致程序崩溃，抽样检测完成还是无法确定类型，依旧会崩溃

所以就有了下面的解决方案。

二、显式声明要创建的DataFrame的数据结构，即schema。

　　
#首先引入类型和方法，具体有StructType, StructField, StringType, IntegerType等方法，处理不同的数据类型
from pyspark.sql.types import *
#构建 schema
schema = StructType([ StructField("column_1", StringType(), True), StructField("column_2", IntegerType(), True) . . . ])
#传入声明
df = sqlContext.createDataFrame(rdd, schema=schema)
当显式声明schema并应用到createDataFrame方法中后，就不再需要samplingRatio参数了。

实际开发工程中建议使用显式声明schema的方案，这样可以避免出现因奇葩数据导致的错误。

作者：旧旧的 <393210556@qq.com> 解决问题的方式，就是解决它一次
查看全文

相关阅读:
总结一下vue里一些小技巧
 vue使用过程常见的一些问题
 Vue.js 的几点总结Watchers/router key/render
Hibernate-3
Hibernate-2
Hibernate-1
百词斩一面9.17
vivo一面凉经
 中兴技术面被怼面经
 红黑树

原文地址：https://www.cnblogs.com/widgetbox/p/13151166.html