zoukankan      html  css  js  c++  java
  • Injector Job深入分析



    Injector Job的主要功能是根据crawlId在hbase中创建一个表,将将文本中的seed注入表中。
    (一)命令执行
    1、运行命令
    [jediael@master local]$ bin/nutch inject seeds/ -crawlId sourcetest
    InjectorJob: starting at 2015-03-10 14:59:19
    InjectorJob: Injecting urlDir: seeds
    InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
    InjectorJob: total number of urls rejected by filters: 0
    InjectorJob: total number of urls injected after normalization and filtering: 1
    Injector: finished at 2015-03-10 14:59:26, elapsed: 00:00:06

    2、查看表中内容
    hbase(main):004:0> scan 'sourcetest_webpage'
    ROW                                       COLUMN+CELL                                                                                                           
     com.163.money:http/                      column=f:fi, timestamp=1425970761871, value=x00'x8Dx00                                                            
     com.163.money:http/                      column=f:ts, timestamp=1425970761871, value=x00x00x01Lx02{x08_                                                   
     com.163.money:http/                      column=mk:_injmrk_, timestamp=1425970761871, value=y                                                                 
     com.163.money:http/                      column=mk:dist, timestamp=1425970761871, value=0                                                                      
     com.163.money:http/                      column=mtdt:_csh_, timestamp=1425970761871, value=?x80x00x00                                                       
     com.163.money:http/                      column=s:s, timestamp=1425970761871, value=?x80x00x00                                                             
    1 row(s) in 0.0430 seconds

    3、读取数据库中的内容
    由于hbase表使用了字节码表示内容,因此需要通过以下命令来查看具体内容
    [jediael@master local]$ bin/nutch readdb  -dump ./test -crawlId sourcetest -content
    WebTable dump: starting
    WebTable dump: done
    [jediael@master local]$ cat test/part-r-00000
    http://money.163.com/   key:    com.163.money:http/
    baseUrl:        null
    status: 0 (null)
    fetchTime:      1425970759775
    prevFetchTime:  0
    fetchInterval:  2592000
    retriesSinceFetch:      0
    modifiedTime:   0
    prevModifiedTime:       0
    protocolStatus: (null)
    parseStatus:    (null)
    title:  null
    score:  1.0
    marker _injmrk_ :       y
    marker dist :   0
    reprUrl:        null
    metadata _csh_ :        ?锟


    (二)源码流程分析
    类:org.apache.nutch.crawl.InjectorJob
    1、程序入口
     
    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(NutchConfiguration.create(), new InjectorJob(),
            args);
        System.exit(res);
      }

    2、ToolRunner.run(String[] args)
    此步骤主要是调用inject方法,其余均是一些参数合规性的检查
     
    public int run(String[] args) throws Exception {
      …………
        inject(new Path(args[0]));
       …………
      }


    3、inject()方法
    nutch均使用 Map<String, Object> run(Map<String, Object> args)来运行具体的job,即其使用Map类参数,并返回Map类参数。
    <pre name="code" class="java">public void inject(Path urlDir) throws Exception {
    
        run(ToolUtil.toArgMap(Nutch.ARG_SEEDDIR, urlDir));
    
      }


    
    

    4、job的具体配置,并创建hbase中的表格
    public Map<String, Object> run(Map<String, Object> args) throws Exception {
       
        numJobs = 1;
        currentJobNum = 0;
        currentJob = new NutchJob(getConf(), "inject " + input);
        FileInputFormat.addInputPath(currentJob, input);
        currentJob.setMapperClass(UrlMapper.class);
        currentJob.setMapOutputKeyClass(String.class);
        currentJob.setMapOutputValueClass(WebPage.class);
        currentJob.setOutputFormatClass(GoraOutputFormat.class);
    
        DataStore<String, WebPage> store = StorageUtils.createWebStore(
            currentJob.getConfiguration(), String.class, WebPage.class);
        GoraOutputFormat.setOutput(currentJob, store, true);
    
        currentJob.setReducerClass(Reducer.class);
        currentJob.setNumReduceTasks(0);
    
        currentJob.waitForCompletion(true);
        ToolUtil.recordJobStatus(null, currentJob, results);
    }
    
    
      


    5、mapper方法
    由于Injector Job中无reducer,因此只要关注mapper即可。
    mapper主要完成以下几项工作:
    (1)对文本中的内容进行分析,并提取其中的参数
    (2)根据filter过滤url
    (3)反转url作为key,创建Webpage对象作为value,然后将之写入表中。
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
          String url = value.toString().trim(); // value is line of text
    
          if (url != null && (url.length() == 0 || url.startsWith("#"))) {
            /* Ignore line that start with # */
            return;
          }
    
          // if tabs : metadata that could be stored
          // must be name=value and separated by 	
          float customScore = -1f;
          int customInterval = interval;
          Map<String, String> metadata = new TreeMap<String, String>();
          if (url.indexOf("	") != -1) {
            String[] splits = url.split("	");
            url = splits[0];
            for (int s = 1; s < splits.length; s++) {
              // find separation between name and value
              int indexEquals = splits[s].indexOf("=");
              if (indexEquals == -1) {
                // skip anything without a =
                continue;
              }
              String metaname = splits[s].substring(0, indexEquals);
              String metavalue = splits[s].substring(indexEquals + 1);
              if (metaname.equals(nutchScoreMDName)) {
                try {
                  customScore = Float.parseFloat(metavalue);
                } catch (NumberFormatException nfe) {
                }
              } else if (metaname.equals(nutchFetchIntervalMDName)) {
                try {
                  customInterval = Integer.parseInt(metavalue);
                } catch (NumberFormatException nfe) {
                }
              } else
                metadata.put(metaname, metavalue);
            }
          }
          try {
            url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
            url = filters.filter(url); // filter the url
          } catch (Exception e) {
            LOG.warn("Skipping " + url + ":" + e);
            url = null;
          }
          if (url == null) {
            context.getCounter("injector", "urls_filtered").increment(1);
            return;
          } else { // if it passes
            String reversedUrl = TableUtil.reverseUrl(url); // collect it
            WebPage row = WebPage.newBuilder().build();
            row.setFetchTime(curTime);
            row.setFetchInterval(customInterval);
    
            // now add the metadata
            Iterator<String> keysIter = metadata.keySet().iterator();
            while (keysIter.hasNext()) {
              String keymd = keysIter.next();
              String valuemd = metadata.get(keymd);
              row.getMetadata().put(new Utf8(keymd),
                  ByteBuffer.wrap(valuemd.getBytes()));
            }
    
            if (customScore != -1)
              row.setScore(customScore);
            else
              row.setScore(scoreInjected);
    
            try {
              scfilters.injectedScore(url, row);
            } catch (ScoringFilterException e) {
              if (LOG.isWarnEnabled()) {
                LOG.warn("Cannot filter injected score for url " + url
                    + ", using default (" + e.getMessage() + ")");
              }
            }
            context.getCounter("injector", "urls_injected").increment(1);
            row.getMarkers()
                .put(DbUpdaterJob.DISTANCE, new Utf8(String.valueOf(0)));
            Mark.INJECT_MARK.putMark(row, YES_STRING);
            context.write(reversedUrl, row);
          }
        }



    (三)重点源码学习


  • 相关阅读:
    函数式宏定义与普通函数
    linux之sort用法
    HDU 4390 Number Sequence 容斥原理
    HDU 4407 Sum 容斥原理
    HDU 4059 The Boss on Mars 容斥原理
    UVA12653 Buses
    UVA 12651 Triangles
    UVA 10892
    HDU 4292 Food
    HDU 4288 Coder
  • 原文地址:https://www.cnblogs.com/eaglegeek/p/4557808.html
Copyright © 2011-2022 走看看