zoukankan      html  css  js  c++  java
  • How to Reuse Old PCs for Solr Search Platform?

    家裡或公司的舊電腦不夠力? 效能慢到想砸爛它們? 朋友或同事有電腦要丟嗎? 我有一個廢物利用的方法, 我收集了四台舊電腦, 組了一個Fully Distributed Mode的Hadoop Cluster, 在Hadoop上架了Hbase, 執行Nutch, 儲存Solr的資料在Hbase。

    PC Specs

    NameCPURAM
    pigpigpig-client2 T2400 1.82GHz 2GB
    pigpigpig-client4 E7500 2.93GHz 4GB
    pigpigpig-client5 E2160 1.80GHz 4GB
    pigpigpig-client6 T7300 2.00GHz 2GB

    Roles

    NameRoles
    pigpigpig-client2 HQuorumPeer, SecondaryNameNode, ResourceManager, Solr
    pigpigpig-client4 NodeManager, HRegionServer, DataNode
    pigpigpig-client5 NodeManager, HRegionServer, DataNode
    pigpigpig-client6 NameNode, HMaster, Nutch

    Version

    Configuration

    剛開始執行Nutch時, 並沒有特別修改預設的設定檔, 每次經過大約10小時, RegionServer一定會發生隨機crash, 錯誤訊息大概都是Out Of Memory之類的, 我們的限制是資源有限, 舊電腦已經無法升級, 不像EC2是資源不夠就能升級, 所以performance tuning對我們是很重要的議題。

    in hadoop-env.sh

    記憶體很珍貴, 因為只有兩個DATANODE, 不需要預設的512MB那麼多, 全部減半

    export HADOOP_NAMENODE_OPTS=“-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS -Xmx256m”

    export HADOOP_DATANODE_OPTS=“-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS -Xmx256m”

    export HADOOP_SECONDARYNAMENODE_OPTS=“-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS -Xmx256m”

    export HADOOP_PORTMAP_OPTS=“-Xmx256m $HADOOP_PORTMAP_OPTS”

    export HADOOP_CLIENT_OPTS=“-Xmx256m $HADOOP_CLIENT_OPTS”

    in hdfs-site.xml

    為了避免hdfs timeout errors, 延長timeout的時間

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
    <property>
      <name>dfs.datanode.socket.write.timeout</name>
      <value>1200000</value>
    </property>
    
    <property>
      <name>dfs.socket.timeout</name>
      <value>1200000</value>
    </property>
    
    <property>
      <name>dfs.client.socket-timeout</name>
      <value>1200000</value>
    </property>
    

    in mapred-env.sh

    export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=256

    in mapred-site.xml

    CPU效能不好, node不夠多, mapred.task.timeout調高一點, 免得mapreduce來不及做完, 尤其nutch inject、generate、fetch、parse、updatedb執行幾輪之後, 每次處理的資料都幾百萬筆, timeout太低會做不完。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    
      <property>
        <name>mapred.task.timeout</name>
        <value>216000000</value> <!-- 60 hours -->
      </property>
    
      <property>
        <name>mapreduce.map.output.compress</name>
        <value>true</value>
      </property>
    
      <property>
        <name>mapreduce.map.output.compress.codec</name>
        <value>com.hadoop.compression.lzo.LzoCodec</value>
      </property>
    
      <property>
        <name>mapreduce.map.memory.mb</name>
        <value>1024</value>
      </property>
    
      <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>1024</value>
      </property>
    
      <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx200M</value>
      </property>
    
      <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx200M</value>
      </property>
    
      <property>
        <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>1024</value>
      </property>
    
      <property>
        <name>yarn.app.mapreduce.am.command-opts</name>
        <value>-Xmx200M</value>
      </property>
    

    in yarn-env.sh

    JAVA_HEAP_MAX=-Xmx256m

    YARN_HEAPSIZE=256

    in yarn-site.xml

    4GB的RAM要分配給OS、NodeManager、HRegionServer和DataNode, 資源實在很緊。分派一半的記憶體給YARN, 所以yarn.nodemanager.resource.memory-mb設成2048; 每個CPU有2個core, 所以mapreduce.map.memory.mb、mapreduce.reduce.memory.mb和yarn.scheduler.maximum-allocation-mb設成1024。yarn.nodemanager.vmem-pmem-ratio設高一點避免出現類似 “running beyond virtual memory limits. Killing container"之類的錯誤。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    
    <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>2048</value>
    </property>
    
    <property>
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>128</value>
    </property>
    
    <property>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>1024</value>
    </property>
    
    <property>
      <name>yarn.nodemanager.resource.cpu-vcores</name>
      <value>2</value>
    </property>
    
    <property>
      <name>yarn.nodemanager.vmem-check-enabled</name>
      <value>true</value>
    </property>
    
    <property>
      <name>yarn.nodemanager.vmem-pmem-ratio</name>
      <value>3.15</value>
    </property>
    

    in hbase-env.sh

    # export HBASE_HEAPSIZE=1000

    export HBASE_MASTER_OPTS=“$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Xmx192m -Xms192m -Xmn72m”

    export HBASE_REGIONSERVER_OPTS=“$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Xmx1024m -Xms1024m -verbose:gc -Xloggc:/mnt/hadoop-2.4.1/hbase/logs/hbaseRgc.log -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/mnt/hadoop-2.4.1/hbase/logs/java_pid{$$}.hprof”

    export HBASE_ZOOKEEPER_OPTS=“$HBASE_ZOOKEEPER_OPTS $HBASE_JMX_BASE -Xmx192m -Xms72m”

    in hbase-site.xml

    RegionServer發生out of memory閃退跟hbase.hregion.max.filesize、hbase.hregion.memstore.flush.size和hbase.hregion.memstore.block.multiplier有關。

    hbase.hregion.max.filesize太小的缺點

    1. 每台ResrionServer的Regions會太多 (P.S. 每個region的每個ColumnFamily會占用2MB的MSLAB)
    2. 造成頻繁的split和compact
    3. 開啟的storefile數量太多 (P.S. Potential Number of Open Files = (StoreFiles per ColumnFamily) x (regions per RegionServer))

    hbase.hregion.max.filesize太大的缺點

    1. 太少Region, 沒有Distributed Mode的效果了
    2. split和compact時的pause也會過久

    write buffer在server-side memory-used是(hbase.client.write.buffer) * (hbase.regionserver.handler.count), 所以hbase.client.write.buffer和hbase.regionserver.handler.count太高會吃掉太多記憶體, 但是太少會增加RPC的數量。

    hbase.zookeeper.property.tickTime和zookeeper.session.timeout太短會造成ZooKeeper SessionExpired。hbase.ipc.warn.response.time設長一點可以suppress responseTooSlow warning。

    hbase.hregion.memstore.flush.size和hbase.hregion.memstore.block.multiplier也會影響split和compact的頻率。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    
    <property>
      <name>hbase.client.scanner.timeout.period</name>
      <value>1200000</value>
    </property>
    
    <property>
      <name>hbase.zookeeper.property.tickTime</name>
      <value>60000</value>
    </property>
    
    <property>
      <name>zookeeper.session.timeout</name>
      <value>1200000</value>
    </property>
    
    <property>
      <name>hbase.rpc.timeout</name>
      <value>1800000</value>
    </property>
    
    <property>
      <name>hbase.ipc.warn.response.time</name>
      <value>1200000</value>
    </property>
    
    <property>
      <name>hbase.regionserver.handler.count</name>
      <value>15</value>
    </property>
    
    <property>
      <name>hbase.hregion.max.filesize</name>
      <value>10737418240</value>
    </property>
    
    <property>
      <name>hbase.hregion.memstore.flush.size</name>
      <value>67108864</value>
    </property>
    
    <property>
      <name>hbase.hregion.memstore.block.multiplier</name>
      <value>8</value>
    </property>
    

    Start Servers

    1. run hdfs namenode -format on pigpigpig-client6
    2. run start-dfs.sh on pigpigpig-client6
    3. run start-yarn.sh on pigpigpig-client2
    4. run start-yarn.sh on pigpigpig-client4
    5. run start-hbase.sh on pigpigpig-client6
    6. run java -Xmx1024m -Xms1024m -XX:+UseConcMarkSweepGC -jar start.jar in solr folder on pigpigpig-client2
    7. run hadoop fs -mkdir /user;hadoop fs -mkdir /user/pigpigpig;hadoop fs -put urls /user/pigpigpig in nutch folder on pigpigpig-client6
    8. run hadoop jar apache-nutch-2.4-SNAPSHOT.job org.apache.nutch.crawl.InjectorJob urls -crawlId webcrawl in nutch folder on pigpigpig-client6
    9. run hadoop jar apache-nutch-2.4-SNAPSHOT.job org.apache.nutch.crawl.GeneratorJob -crawlId webcrawl in nutch folder on pigpigpig-client6
    10. run hadoop jar apache-nutch-2.4-SNAPSHOT.job org.apache.nutch.fetcher.FetcherJob -all -crawlId webcrawl in nutch folder on pigpigpig-client6
    11. run hadoop jar apache-nutch-2.4-SNAPSHOT.job org.apache.nutch.parse.ParserJob -all -crawlId webcrawl in nutch folder on pigpigpig-client6
    12. run hadoop jar apache-nutch-2.4-SNAPSHOT.job org.apache.nutch.crawl.DbUpdaterJob -all -crawlId webcrawl in nutch folder on pigpigpig-client6
    13. run hadoop jar apache-nutch-2.4-SNAPSHOT.job org.apache.nutch.indexer.IndexingJob -D solr.server.url=http://pigpigpig-client2/solr/nutch/ -all -crawlId webcrawl in nutch folder on pigpigpig-client6

    Stop Servers

    1. run stop-hbase.sh on pigpigpig-client6
    2. run stop-yarn.sh on pigpigpig-client2
    3. run stop-yarn.sh on pigpigpig-client4
    4. run stop-dfs.sh on pigpigpig-client6

    Screenshots

      

    Resources

    1. 完整Configuration files請到https://github.com/EugenePig/Experiment1下載
    2. https://github.com/EugenePig/Gora/tree/Gora-0.6.1-SNAPSHOT-Hadoop27-Solr5
    3. https://github.com/EugenePig/nutch/tree/2.4-SNAPSHOT-Hadoop27-Solr5
    4. https://github.com/EugenePig/ik-analyzer-solr5
  • 相关阅读:
    paxos算法
    List
    es资料汇总
    尚硅谷Kafka
    lostach安装配置
    zookeeper安装
    zookeeper配置详解
    C# 微信企业付款给个人之相关配置
    JS--正则表达式验证
    uniapp小程序--自定义分享标题
  • 原文地址:https://www.cnblogs.com/seaspring/p/5587016.html
Copyright © 2011-2022 走看看