zoukankan      html  css  js  c++  java
  • hadoop

    1. 对象存储

    1.1 aliyun OSS

    1.2 aws S3

    2. Hadoop

    参考:

    2.1 Hadoop 支持 S3(minio)

    [root@node7131 hadoop-2.10.0]# cat etc/hadoop-minio/core-site.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
    
        http://www.apache.org/licenses/LICENSE-2.0
    
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License. See accompanying LICENSE file.
    -->
    
    <!-- Put site-specific property overrides in this file. -->
    
    <configuration>
    <property>
            <name>fs.defaultFS</name>
            <!--value>hdfs://localhost:9000</value-->
            <value>s3a://minio-buc</value>
    </property>
    
    <property>
      <name>fs.s3a.endpoint</name>
      <value>http://10.192.71.32:9000</value>
      <description>AWS S3 endpoint to connect to. An up-to-date list is
        provided in the AWS Documentation: regions and endpoints. Without this
        property, the standard region (s3.amazonaws.com) is assumed.
      </description>
    </property>
    
    <property>
      <name>fs.s3a.access.key</name>
      <value>ak_123456</value>
      <description>AWS access key ID.
       Omit for IAM role-based or provider-based authentication.</description>
    </property>
    
    <property>
      <name>fs.s3a.secret.key</name>
      <value>sk_123456</value>
      <description>AWS secret key.
       Omit for IAM role-based or provider-based authentication.</description>
    </property>
    
    <property>
      <name>fs.s3a.aws.credentials.provider</name>
      <description>
        Comma-separated class names of credential provider classes which implement
        com.amazonaws.auth.AWSCredentialsProvider.
    
        These are loaded and queried in sequence for a valid set of credentials.
        Each listed class must implement one of the following means of
        construction, which are attempted in order:
        1. a public constructor accepting java.net.URI and
            org.apache.hadoop.conf.Configuration,
        2. a public static method named getInstance that accepts no
           arguments and returns an instance of
           com.amazonaws.auth.AWSCredentialsProvider, or
        3. a public default constructor.
    
        Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows
        anonymous access to a publicly accessible S3 bucket without any credentials.
        Please note that allowing anonymous access to an S3 bucket compromises
        security and therefore is unsuitable for most use cases. It can be useful
        for accessing public data sets without requiring AWS credentials.
    
        If unspecified, then the default list of credential provider classes,
        queried in sequence, is:
        1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:
           Uses the values of fs.s3a.access.key and fs.s3a.secret.key.
        2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports
            configuration of AWS access key ID and secret access key in
            environment variables named AWS_ACCESS_KEY_ID and
            AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.
        3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use
            of instance profile credentials if running in an EC2 VM.
      </description>
    </property>
    
    <property>
      <name>fs.s3a.impl</name>
      <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
      <description>The implementation class of the S3A Filesystem</description>
    </property>
    
    </configuration>
    
    

    2.2 Hadoop 支持 HAS S3

    <property>
      <name>fs.s3a.signing-algorithm</name>
      <value>S3SignerType</value>
      <description>How.</description>
    </property>
    

    2.3 Hadoop 支持 aliyun OSS

    [root@node7131 hadoop-2.10.0]# cat etc/hadoop-aliyun/core-site.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
    
        http://www.apache.org/licenses/LICENSE-2.0
    
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License. See accompanying LICENSE file.
    -->
    
    <!-- Put site-specific property overrides in this file. -->
    
    <configuration>
    <property>
            <name>fs.defaultFS</name>
            <value>oss://buc-pic</value>
    </property>
    
    <property>
      <name>fs.oss.endpoint</name>
      <value>oss-cn-hangzhou.aliyuncs.com</value>
      <description>Aliyun OSS endpoint to connect to. An up-to-date list is
        provided in the Aliyun OSS Documentation.
       </description>
    </property>
    
    <property>
       <name>fs.oss.impl</name>
       <value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value>
    </property>
    
    <property>
      <name>fs.oss.accessKeyId</name>
      <value>LTAI4FgP65SyfaP5BmaV1x5C</value>
      <description>Aliyun access key ID</description>
    </property>
    
    <property>
      <name>fs.oss.accessKeySecret</name>
      <value>zRRfmJuGbuyz4laXbkRfzDKq8ugeTw</value>
      <description>Aliyun access key secret</description>
    </property>
    
    <property>
      <name>fs.oss.credentials.provider</name>
      <description>
        Class name of a credentials provider that implements
        com.aliyun.oss.common.auth.CredentialsProvider. Omit if using access/secret keys
        or another authentication mechanism. The specified class must provide an
        accessible constructor accepting java.net.URI and
        org.apache.hadoop.conf.Configuration, or an accessible default constructor.
      </description>
    </property>
    
    </configuration>
    
    

    3. Hive

    参考:

    hive 使用 S3 的配置和 HDFS 一致。

    存在问题:

    1. hive CLI 对接S3,能够在S3中创建相应的 database/table 目录,但Insert 数据接口还有些问题,继续调试中。
      问题在于Insert数据时,会创建一些临时目录,但是object key长度超过127导致上传失败。

    hive 上传数据时的步骤:

    hive> insert into users values(1, "liudong", "pass", "ak", "sk");
    
    1. 创建目录 PUT /buc-ld/tmp/hive/root/92f872b5-0c96-4208-907a-747133d4522c/_tmp_space.db/Values__Tmp__Table__4/
      同时会删除:
      tmp/hive/root/92f872b5-0c96-4208-907a-747133d4522c/_tmp_space.db/
      tmp/hive/root/92f872b5-0c96-4208-907a-747133d4522c/
      tmp/hive/root/
      tmp/hive/
      tmp/

    2. HEAD /buc-ld/tmp/hive/root/92f872b5-0c96-4208-907a-747133d4522c/_tmp_space.db/Values__Tmp__Table__4/data_file

    3. HEAD /buc-ld/tmp/hive/root/92f872b5-0c96-4208-907a-747133d4522c/_tmp_space.db/Values__Tmp__Table__4/data_file/

    4. HEAD /buc-ld/tmp/hive/root/92f872b5-0c96-4208-907a-747133d4522c/_tmp_space.db/Values__Tmp__Table__4
      NOT FOUND

    5. HEAD /buc-ld/tmp/hive/root/92f872b5-0c96-4208-907a-747133d4522c/_tmp_space.db/Values__Tmp__Table__4/
      200 OK

    6. PUT /buc-ld/tmp/hive/root/92f872b5-0c96-4208-907a-747133d4522c/_tmp_space.db/Values__Tmp__Table__4/data_file HTTP/1.1
      Host: 10.192.71.31
      Authorization: AWS HIK3xk48642XJ88h7d70nW6613us4x28:RF0GxG7CrW6ohifnoZKItxCXZfs=
      User-Agent: Hadoop 2.10.0, aws-sdk-java/1.11.271 Linux/3.10.0-123.el7.x86_64 OpenJDK_64-Bit_Server_VM/11.0.3+7 java/11.0.3 groovy/2.4.4 com.amazonaws.services.s3.transfer.TransferManager/1.11.271
      amz-sdk-invocation-id: a6433047-696f-0143-2216-475c1ce77bb0
      amz-sdk-retry: 0/0/500
      Date: Sun, 19 Jan 2020 02:42:13 GMT
      Content-MD5: QlifiGByqjSTxcfTVojBcg==
      Content-Type: application/octet-stream
      Content-Length: 21
      Connection: Keep-Alive
      Expect: 100-continue

    HTTP/1.1 100 Continue

    1.liudong.pass.ak.sk
    HTTP/1.1 200 OK
    ETag: 42589f886072aa3493c5c7d35688c172
    Date: Sun, 19 Jan 2020 02:42:13 GMT
    Connection: keep-alive
    Content-Length: 0

    1. Delete tmp/hive/root/92f872b5-0c96-4208-907a-747133d4522c/_tmp_space.db/Values__Tmp__Table__4/ 及各级父目录

    2. HEAD /buc-ld/user/hive/warehouse/ld.db/users/.hive-staging_hive_2020-01-19_10-42-12_891_8904992031068082033-1
      NOT FOUND

    HEAD /buc-ld/user/hive/warehouse/ld.db/users/.hive-staging_hive_2020-01-19_10-42-12_891_8904992031068082033-1/
    NOT FOUND

    1. GET /buc-ld/?prefix=user/hive/warehouse/ld.db/users/.hive-staging_hive_2020-01-19_10-42-12_891_8904992031068082033-1/&delimiter=/&max-keys=1&encoding-type=url

    HEAD /buc-ld/user/hive/warehouse/ld.db/users
    NOT FOUND

    HEAD /buc-ld/user/hive/warehouse/ld.db/users/
    NOT FOUND

    GET /buc-ld/?prefix=user/hive/warehouse/ld.db/users/&delimiter=/&max-keys=1&encoding-type=url
    OK

    HEAD /buc-ld/user/hive/warehouse/ld.db/users/.hive-staging_hive_2020-01-19_10-42-12_891_8904992031068082033-1
    NOT FOUND

    HEAD /buc-ld/user/hive/warehouse/ld.db/users/.hive-staging_hive_2020-01-19_10-42-12_891_8904992031068082033-1/
    NOT FOUND

    PUT /buc-ld/user/hive/warehouse/ld.db/users/.hive-staging_hive_2020-01-19_10-42-12_891_8904992031068082033-1/ HTTP/1.1

    1. PUT /buc-ld/user/hive/warehouse/ld.db/users/.hive-staging_hive_2020-01-19_10-42-12_891_8904992031068082033-1/_tmp.-ext-10002/

    2. GET /buc-ld/?prefix=tmp/hive/root/**872b5-0c96-4208-907a-747133d4522c/_tmp_space.db/Values__Tmp__Table__4/&delimiter=/&max-keys=1&encoding-type=url HTTP/1.1
      OK

    HEAD /buc-ld/tmp/hive/root/92f872b5-0c96-4208-907a-747133d4522c/_tmp_space.db/Values__Tmp__Table__4/data_file HTTP/1.1
    OK

    GET /buc-ld/tmp/hive/root/92f872b5-0c96-4208-907a-747133d4522c/_tmp_space.db/Values__Tmp__Table__4/data_file HTTP/1.1
    1.liudong.pass.ak.sk

    HEAD /buc-ld/user/hive/warehouse/ld.db/users/.hive-staging_hive_2020-01-19_10-42-12_891_8904992031068082033-1/_task_tmp.-ext-10002/_tmp.000000_0 HTTP/1.1
    Host: 10.192.71.31
    Authorization: AWS HIK3xk48642XJ88h7d70nW6613us4x28:wFAXYdh95S9PDQAIepKB5X/jrkU=
    User-Agent: Hadoop 2.10.0, aws-sdk-java/1.11.271 Linux/3.10.0-123.el7.x86_64 OpenJDK_64-Bit_Server_VM/11.0.3+7 java/11.0.3 groovy/2.4.4
    amz-sdk-invocation-id: 70ecf423-2f71-d1ae-9138-60ebf6864c38
    amz-sdk-retry: 0/0/500
    Date: Sun, 19 Jan 2020 02:42:14 GMT
    Content-Type: application/octet-stream
    Connection: Keep-Alive
    
    HTTP/1.1 400 Bad Request
    Content-Length: 171
    Date: Sun, 19 Jan 2020 02:42:14 GMT
    Connection: close
    
    <?xml version="1.0" encoding="utf-8"?>
    <Error>
    	<Code>InvalidArgument</Code>
    	<Message>Invalid Argument</Message>
    	<Resource></Resource>
    	<RequestId></RequestId>
    </Error>
    
    

    DELETE /buc-ld/user/hive/warehouse/ld.db/users/.hive-staging_hive_2020-01-19_10-42-12_891_8904992031068082033-1/ HTTP/1.1

    1. 上传到临时文件 /tmp/hive/root/92f872b5-0c96-4208-907a-747133d4522c/_tmp_space.db/Values__Tmp__Table__2/data_file/
    2. 删除临时文件

    hive 支持删除操作的配置

    1. 配置文件
    <property>
    <name>hive.support.concurrency</name>
    <value>true</value>
    </property>
    
    <property>
    <name>hive.enforce.bucketing</name>
    <value>true</value>
    </property>
    <property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
    </property>
    <property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
    </property>
    <property>
    <name>hive.compactor.initiator.on</name>
    <value>true</value>
    </property>
    <property>
    <name>hive.compactor.worker.threads</name>
    <value>1</value>
    </property>
    <property>
    <name>hive.in.test</name>
    <value>true</value>
    </property>
    <property>
    <name>hive.auto.convert.join.noconditionaltask.size</name>
    <value>10000000</value>
    </property>
    

    创建表,支持update/delete

    create table test(id int, name string) clustered by (id) into 5 buckets row format de                                                                   limited fields terminated by ',' lines terminated by '
    ' stored as orc tblproperties('tran                                                                   sactional'='true');
    

    4. hbase

    参考:

    hbase 配置文件

    <configuration>
    
      <property>
        <name>hbase.rootdir</name>
        <!--value>s3a://10.192.71.31:80/</value-->
        <value>s3a://hbase-1/</value>
      </property>
      <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/data/hbase-2.2.2/data/zookeeper</value>
      </property>
      <property>
        <name>hbase.unsafe.stream.capability.enforce</name>
        <value>false</value>
        <description>
          Controls whether HBase will check for stream capabilities (hflush/hsync).
    
          Disable this if you intend to run on LocalFileSystem, denoted by a rootdir
          with the 'file://' scheme, but be mindful of the NOTE below.
    
          WARNING: Setting this to false blinds you to potential data loss and
          inconsistent system state in the event of process and/or node failures. If
          HBase is complaining of an inability to use hsync or hflush it's most
          likely not a false positive.
        </description>
      </property>
    
    
    </configuration>
    
    
    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
    
        http://www.apache.org/licenses/LICENSE-2.0
    
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License. See accompanying LICENSE file.
    -->
    
    <!-- Put site-specific property overrides in this file. -->
    
    <configuration>
    <property>
            <name>fs.defaultFS</name>
            <!--value>hdfs://localhost:9000</value-->
            <value>s3a://hbase-1</value>
    </property>
    
    <property>
      <name>fs.s3a.endpoint</name>
      <value>http://10.192.71.31:80</value>
      <description>AWS S3 endpoint to connect to. An up-to-date list is
        provided in the AWS Documentation: regions and endpoints. Without this
        property, the standard region (s3.amazonaws.com) is assumed.
      </description>
    </property>
    
    <property>
      <name>fs.s3a.access.key</name>
      <value>HIK3xk48642XJ88h7d70nW6613us4x28</value>
      <description>AWS access key ID.
       Omit for IAM role-based or provider-based authentication.</description>
    </property>
    
    <property>
      <name>fs.s3a.secret.key</name>
      <value>HIKco547Q032JB34Q200J16xt0Q5UKE4</value>
      <description>AWS secret key.
       Omit for IAM role-based or provider-based authentication.</description>
    </property>
    
    <property>
      <name>fs.s3a.aws.credentials.provider</name>
      <description>
        Comma-separated class names of credential provider classes which implement
        com.amazonaws.auth.AWSCredentialsProvider.
    
        These are loaded and queried in sequence for a valid set of credentials.
        Each listed class must implement one of the following means of
        construction, which are attempted in order:
        1. a public constructor accepting java.net.URI and
            org.apache.hadoop.conf.Configuration,
        2. a public static method named getInstance that accepts no
           arguments and returns an instance of
           com.amazonaws.auth.AWSCredentialsProvider, or
        3. a public default constructor.
    
        Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows
        anonymous access to a publicly accessible S3 bucket without any credentials.
        Please note that allowing anonymous access to an S3 bucket compromises
        security and therefore is unsuitable for most use cases. It can be useful
        for accessing public data sets without requiring AWS credentials.
    
        If unspecified, then the default list of credential provider classes,
        queried in sequence, is:
        1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:
           Uses the values of fs.s3a.access.key and fs.s3a.secret.key.
        2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports
            configuration of AWS access key ID and secret access key in
            environment variables named AWS_ACCESS_KEY_ID and
            AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.
        3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use
            of instance profile credentials if running in an EC2 VM.
      </description>
    </property>
    
    <property>
      <name>fs.s3a.impl</name>
      <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
      <description>The implementation class of the S3A Filesystem</description>
    </property>
    
    <property>
      <name>fs.s3a.attempts.maximum</name>
      <value>1</value>
      <description>How many times we should retry commands on transient errors.</description>
    </property>
    
    <<property>
      <name>fs.s3a.signing-algorithm</name>
      <value>S3SignerType</value>
      <description>How.</description>
    </property>
    
    <property>
      <name>fs.s3a.paging.maximum</name>
      <value>999</value>
      <description>How.</description>
    </property>
    
    </configuration>
    
    
  • 相关阅读:
    百度开源其NLP主题模型工具包,文本分类等场景可直接使用L——LDA进行主题选择本质就是降维,然后用于推荐或者分类
    谷歌开源可视化工具Facets,将用于人+AI协作项目研究——无非就是一个用于特征工程探索的绘图工具集,pandas可以做的
    机器学习案例 特征组合——高帅富 冷启动——从微博等其他渠道搜集数据进行机器学习 用户年龄——线性分段处理
    pyspark MLlib踩坑之model predict+rdd map zip,zip使用尤其注意啊啊啊!
    [038] 微信公众帐号开发教程第14篇-自定义菜单的创建及菜单事件响应
    [037] 微信公众帐号开发教程第13篇-图文消息全攻略
    [036] 微信公众帐号开发教程第12篇-符号表情的发送(下)(转)
    [035] 微信公众帐号开发教程第11篇-符号表情的发送(上)(转)
    [034] 微信公众帐号开发教程第10篇-解析接口中的消息创建时间CreateTime(转)
    [033] 微信公众帐号开发教程第9篇-QQ表情的发送与接收(转)
  • 原文地址:https://www.cnblogs.com/walkinginthesun/p/12382074.html
Copyright © 2011-2022 走看看