zoukankan      html  css  js  c++  java
  • PHP中文分词

    最常见的词语二分法:

    $str = '这是我的网站www.7di.net!';  
    //$str = iconv('GB2312','UTF-8',$str);  
    $result = spStr($str);  
    print_r($result);  
      
    /** 
     * UTF-8版 中文二元分词 
     */  
    function spStr($str)  
    {  
        $cstr = array();  
      
        $search = array(",", "/", "\\", ".", ";", ":", "\"", "!", "~", "`", "^", "(", ")", "?", "-", "\t", "\n", "'", "<", ">", "\r", "\r\n", "{1}quot;", "&", "%", "#", "@", "+", "=", "{", "}", "[", "]", ":", ")", "(", ".", "。", ",", "!", ";", "“", "”", "‘", "’", "[", "]", "、", "—", " ", "《", "》", "-", "…", "【", "】",);  
      
        $str = str_replace($search, " ", $str);  
        preg_match_all("/[a-zA-Z]+/", $str, $estr);  
        preg_match_all("/[0-9]+/", $str, $nstr);  
      
        $str = preg_replace("/[0-9a-zA-Z]+/", " ", $str);  
        $str = preg_replace("/\s{2,}/", " ", $str);  
      
        $str = explode(" ", trim($str));  
      
        foreach ($str as $s) {  
            $l = strlen($s);  
      
            $bf = null;  
            for ($i= 0; $i< $l; $i=$i+3) {  
                $ns1 = $s{$i}.$s{$i+1}.$s{$i+2};  
                if (isset($s{$i+3})) {  
                    $ns2 = $s{$i+3}.$s{$i+4}.$s{$i+5};  
                    if (preg_match("/[\x80-\xff]{3}/",$ns2)) $cstr[] = $ns1.$ns2;  
                } else if ($i == 0) {  
                    $cstr[] = $ns1;  
                }  
            }  
        }  
      
        $estr = isset($estr[0])?$estr[0]:array();  
        $nstr = isset($nstr[0])?$nstr[0]:array();  
      
        return array_merge($nstr,$estr,$cstr);  
    } 
    

     執行結果是:

    Array ( [0] => 7 [1] => www [2] => di [3] => net [4] => 这是 [5] => 是我 [6] => 我的 [7] => 的网 [8] => 网站 ) 
    

     接下来,将以上结果转换为区位码,PHP代码是:

    foreach ($result as $s) {  
        $s = iconv('UTF-8','GB2312',$s);  
        $code[] = gbCode($s);  
    }  
    $code = implode(" ", $code);  
    echo $code;  
      
    function gbCode($str) {  
        $return = null;  
      
        if (!preg_match("/^[\x80-\xff]{2,}$/",$str)) return $str;  
      
        $len = strlen($str);  
        for ($i= 0; $i< $len; $i=$i+2) {  
            $return .= sprintf("%02d%02d",ord($str{$i})-160,ord($str{$i+1})-160);  
        }  
      
        return $return;  
    } 
    
  • 相关阅读:
    Git 处理tag和branch的命令
    手把手教您使用第三方登录
    iOS 中隐藏UITableView最后一条分隔线
    Android简易实战教程--第四十四话《ScrollView和HorizontalScrollView简单使用》
    iOS-改变UITextField的Placeholder颜色的三种方式
    react-native 关闭黄屏警告
    reactnative js onclick 模拟单击/双击事件
    reactnative 监听屏幕方向变化
    reactnative0.61.2 使用react-native-webrtc
    use react-navigation@2.18.2
  • 原文地址:https://www.cnblogs.com/see7di/p/2720668.html
Copyright © 2011-2022 走看看