zoukankan      html  css  js  c++  java
  • PHP代码-数据爬取(a标签和a标签所对应的内容)

    public function export(){
            set_time_limit(1000);
    // header("Content-type: text/html; charset=utf-8");
            $a = file_get_contents('http://chuangye.yjbys.com/zhengce/');
            $reg = '/</span><a href="(.*)" (.*)>(.*)</isU';
            $result = preg_match_all($reg,$a,$match_result);
            $arr = array();
            foreach($match_result[1] as $k=>$v){
                $tnum = strlen($match_result[3][$k]);
                if(substr($v,0,1) == 'h' && $tnum>21){
                    $arr[$k]['art_url'] = $v;
                    $arr[$k]['art_title'] = mb_convert_encoding($match_result[3][$k], "UTF-8",'gbk');
    //                    $match_result[3][$k];
                        mb_convert_encoding($match_result[3][$k], "UTF-8",'gbk');
                    $b = file_get_contents($v);
                    preg_match('/<div class="content">(.*)</div>/s',$b,$match);
                    $match[0] = iconv("gbk", "utf-8", $match[0]);
                    $num =  strpos($match[0],'<script type="text/javascript">a("content_body");</script>');
                    $cont = substr($match[0],0,$num)."</div>";
                    $cony = str_replace('<div class="ad_top_left"><script type="text/javascript">a("content_1");</script></div>',"",$cont);
                    $cont = str_replace('<div class="ad_top_left2"><script type="text/javascript">a("content_2");</script></div>',"",$cony);
    //                $cont = str_replace('“','“',$cont);
    //                $cont = str_replace('”','”',$cont);
    //                $cont = str_replace('…','~',$cont);
    //                $cont = str_replace('—','-',$cont);
    //                $cont = str_replace('"','“',$cont);
    //                $cont = str_replace('•','•',$cont);
                    $arr[$k]['art_content'] = html_entity_decode($cont);
                    $arr[$k]['state'] = 0;
                    $arr[$k]['type'] = 4;
                    $arr[$k]['userid'] = 4;
                }
            }
            $arr = array_values($arr);
    //        print_r($arr);die;
    //        $arr2=array_iconv("gbk","utf-8",$arr);
    //        print_r($arr);die;
            $article = M('cxpt_user_article');
            var_dump($article->addAll($arr));echo $article->getLastSql();die;
    //        foreach($arr as &$v){
    //            $b = file_get_contents($v['url']);
    //            preg_match('/<div class="content">(.*)</div>/s',$b,$match);
    //            $num =  strpos($match[0],'<script type="text/javascript">a("content_body");</script>');
    //            $v['content'] = substr($match[0],0,$num); 
    //        }
    //        foreach($arr as $v){
    //            $info['art_title'] = $v['title'];
    //            $info['art_content'] = $v['content'];
    //        }
    //        print_r($arr);die;
            
        }
    

      

  • 相关阅读:
    linux下详解shell中>/dev/null 2>&1
    关于使用sublime的一些报错异常退出的解决方法
    Linux下如何挂载文件,并设置开机自动挂载
    关于/var/log/maillog 时间和系统时间不对应的问题 -- 我出现的是日志时间比系统时间慢12个小时
    如何在含有json类型的字段上建立多列索引
    文件大小
    SVN
    索引
    MD5验证
    协议适配器错误的问题
  • 原文地址:https://www.cnblogs.com/renshi/p/6491418.html
Copyright © 2011-2022 走看看