关键词:carbondata spark thrift 数据仓库
【Install thrift 0.9.3】
注意 要装thrift-java必须先装ant 。
有人说要装boost,我在centos6上没有装一样可以运行,猜测可能是c/cpp需要,java/python的不需要
thrift安装包可以在thrift官网下载,注意版本,手动下载地址:http://www.apache.org/dyn/closer.cgi?path=/thrift/0.9.3。
sudo yum -y install ant libevent-devel zlib-devel openssl-devel # Install bison wget http://ftp.gnu.org/gnu/bison/bison-2.5.1.tar.gz tar xvf bison-2.5.1.tar.gz cd bison-2.5.1 ./configure --prefix=/usr make sudo make install cd .. # Install libevent wget --no-check-certificate https://github.com/libevent/libevent/releases/download/release-2.0.22-stable/libevent-2.0.22-stable.tar.gz -O libevent-2.0.22-stable.tar.gz tar -xzvf libevent-2.0.22-stable.tar.gz cd libevent-2.0.22-stable ./configure --prefix=/usr make sudo make install cd .. # Install thrift
wget http://apache.parentingamerica.com/thrift/0.9.3/thrift-0.9.3.tar.gz tar -xzvf thrift-0.9.3.tar.gz cd thrift-0.9.3 ./configure --prefix=/usr --with-libevent=/usr --with-java sudo make sudo make install cd ..
如果是其他语言的,首先得安装该语言的环境和其他相关的库。Java的需要jdk和ant。
【Package and Install CarbonData】
参考:https://github.com/apache/carbondata/tree/master/build
下载 carbondata 1.1.0,解压后在carbondata源码目录下执行 (同理其他spark版本改下profile和spark.version的参数即可)
mvn -DskipTests -Pspark-1.6 -Dspark.version=1.6.2 clean package
maven下载速度慢的,可以用aliyun mirror替代apache central,修改 ~/.m2/settings.xml。
<settings> ... <mirrors> <mirror> <id>alimaven</id> <name>aliyun maven</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> <mirrorOf>central</mirrorOf> </mirror> </mirrors> ... </settings>
【Run carbondata in spark-shell】
参考:http://carbondata.apache.org/quick-start-guide.html
准备数据文件
# in linux , prepare data example file cd carbondata cat > sample.csv << EOF id,name,city,age 1,david,shenzhen,31 2,eason,shenzhen,27 3,jarry,wuhan,35 EOF hdfs dfs -put sample.csv /tmp/
准备assembly jar包
# in linux, copy assembly jar to a lib directory cd $CARBONDATA_HOME mkdir -p lib cp assembly/target/scala-2.10/carbondata_2.10-1.1.0-shade-hadoop2.2.0.jar lib/ cp integration/spark/target/carbondata-spark-1.1.0.jar lib/
run spark in shell mode
spark-shell --jars $CARBONDATA_HOME/lib/carbondata_2.10-1.1.0-shade-hadoop2.2.0.jar,$CARBONDATA_HOME/lib/carbondata-spark-1.1.0.jar
SparkShell >
// in spark shell, cluster mode import org.apache.spark.sql.CarbonContext // remember to add hdfs:// if you want to use hdfs mode. val cc = new CarbonContext(sc, "hdfs:///tmp/carbon/data/") cc.sql("CREATE TABLE IF NOT EXISTS hdfs_sample ( id string, name string, city string, age Int) STORED BY 'carbondata'") cc.sql("LOAD DATA INPATH 'hdfs:///tmp/sample.csv' INTO TABLE hdfs_sample") cc.sql("SELECT * FROM hdfs_sample").show() cc.sql("SELECT city, avg(age), sum(age) FROM hdfs_sample GROUP BY city").show()