zoukankan      html  css  js  c++  java
  • 采集的实现

    简介:这是采集的实现的详细页面,介绍了和php,有关的知识、技巧、经验,和一些php源码等。

    class='pingjiaF' frameborder='0' src='http://biancheng.dnbcw.info/pingjia.php?id=324434' scrolling='no'> 一般是本机运行,放到空间上是不明智的,因为不但很耗资源还需要支持远程抓取函数,比如file_get_contents($urls)file($url)等.
    1,文章列表页面的自动切换,以及文章路径的获得.
    2,获得:标题,内容
    3,入库
    4,问题
    1,文章列表页面的自动切换,以及文章路径的获得.

    a,列表页面的自动切换一般依赖动态页面来实现.比如
    <?
    //2004-11-22 clinch
    //$e=clinchgeturl("[url]im286.com/forumdisplay.php?fid=1");[/url]

    //var_dump($e);
    function clinchgeturl($url)
    {

    //$url="[url]127.0.0.1/1.htm";[/url]
    //$rootpath="[url]fsrootpathfsfsf/yyyyyy/";                           [/url]
    //var_dump($rrr);
    if(eregi('(.)*[\.](.)*',$url)){
                                         $roopath=split("\/",$url);
                                           $rootpath="[url]"[/url].$roopath[2]."/";
                                       $nnn=count($roopath)-1;for($yu=3;$yu<$nnn;$yu++){$rootpath.=$roopath[$yu]."/";}
                                           // var_dump($rootpath); //http: ,'',127.0.0.1,xnml,index.php     
                                        }
              else{$rootpath=$url;  //var_dump($rootpath);
    }
    if(isset($url)){
    echo "$url 有下列裢接:<br>";
    $fcontents = file($url);
    while(list(,$line)=each($fcontents)){
    while(eregi('(href[[:space:]]*=[[:space:]]*"?[[:alnum:]:@/._-]+[\?]?[^\"]*"?)',$line,$regs)){
    //$regs[1] = eregi_replace('(href[[:space:]]*=[[:space:]]*\"?)([[:alnum:]:@/._-]+)(\"?)',"\\2",$regs[1]);
    $regs[1] = eregi_replace('(href[[:space:]]*=[[:space:]]*[\"]?)([[:alnum:]:@/._-]+[\?]?[^\"]*)(\.*)[^\"\/]*([\"]?)',"\\2",$regs[1]);

    if(!eregi('^http:\/\/',$regs[1])){

            if(eregi('^\.\.',$regs[1])){
                                    //   $roopath=eregi_replace('(http:\/\/)?([[:alnum:]:@/._-]+)[[:alnum:]+](\.*)[[:alnum:]+]',"http:\/\/\\2",$url);
           
                                         $roopath=split("\/",$rootpath);
                                           $rootpath="[url]"">http://www.im286.com/foru[/url] ... d=1&page=$i
    可以在后面利用$i的自动增加或范围来实现,比如$i++;
    也可以像penzi演示的那个一样,要从第几页到第几页,代码方面控制$i的范围就可以.

    b,文章路径的获得分需要填正则和无需填正则2种:
    1)无需填正则就是获得上面的文章列表页面的所有连接
      但是最好对连接进行过滤,处理---判断重复连接,只留一个,处理相对路径,变成绝对路径.比如../ 和./等.
    以下是我写的乱七八糟的实现函数:
    PHP:  [Copy to clipboard]
    --------------------------------------------------------------------------------

    <?
    //2004-11-22 clinch
    //$e=clinchgeturl("[url]im286.com/forumdisplay.php?fid=1");[/url]

    //var_dump($e);
    function clinchgeturl($url)
    {

    //$url="[url]127.0.0.1/1.htm";[/url]
    //$rootpath="[url]fsrootpathfsfsf/yyyyyy/";                           [/url]
    //var_dump($rrr);
    if(eregi('(.)*[\.](.)*',$url)){
                                         $roopath=split("\/",$url);
                                           $rootpath="[url]"[/url].$roopath[2]."/";
                                       $nnn=count($roopath)-1;for($yu=3;$yu<$nnn;$yu++){$rootpath.=$roopath[$yu]."/";}
                                           // var_dump($rootpath); //http: ,'',127.0.0.1,xnml,index.php     
                                        }
              else{$rootpath=$url;  //var_dump($rootpath);
    }
    if(isset($url)){
    echo "$url 有下列裢接:<br>";
    $fcontents = file($url);
    while(list(,$line)=each($fcontents)){
    while(eregi('(href[[:space:]]*=[[:space:]]*"?[[:alnum:]:@/._-]+[\?]?[^\"]*"?)',$line,$regs)){
    //$regs[1] = eregi_replace('(href[[:space:]]*=[[:space:]]*\"?)([[:alnum:]:@/._-]+)(\"?)',"\\2",$regs[1]);
    $regs[1] = eregi_replace('(href[[:space:]]*=[[:space:]]*[\"]?)([[:alnum:]:@/._-]+[\?]?[^\"]*)(\.*)[^\"\/]*([\"]?)',"\\2",$regs[1]);

    if(!eregi('^http:\/\/',$regs[1])){

            if(eregi('^\.\.',$regs[1])){
                                    //   $roopath=eregi_replace('(http:\/\/)?([[:alnum:]:@/._-]+)[[:alnum:]+](\.*)[[:alnum:]+]',"http:\/\/\\2",$url);
           
                                         $roopath=split("\/",$rootpath);
                                           $rootpath="[url]".$roopath[2]."/";
                                            //echo "这是根本d :"."\n";     
                                    $nnn=count($roopath)-1;for($yu=3;$yu<$nnn;$yu++){$rootpath.=$roopath[$yu]."/";}
                                            //var_dump($rootpath);
                                       if(eregi('^\.\.[\/[:alnum:]]',$regs[1])){
                                           //echo "这是../目录/ :"."\n";     
                                         //$regs[1]="../xx/xxxxxx.xx";
                                       // $rr=split("\/",$regs[1]);                                          
                                          //for($oooi=1;$oooi<count($rr);$oooi++)
    $rrr=$regs[1];
                                                                            //   {$rrr.="/".$rr[$oooi];
                                                             $rrr = eregi_replace("^[\.][\.][\/]",'',$rrr); /

    “采集的实现”的更多相关文章 》

    爱J2EE关注Java迈克尔杰克逊视频站JSON在线工具

    http://biancheng.dnbcw.info/php/324434.html pageNo:14
  • 相关阅读:
    python 汇总
    python 异常处理、文件常用操作
    python类中super()和__init__()的区别
    百度搜索结果爬虫
    BS4爬虫实例应用-CISP
    Java类WebServer及中间件拿webshell方法总结
    建模分析之机器学习算法(附python&R代码)
    [原创]代理转发工具汇总分析
    代码审计之文件操作
    PHP自带防SQL攻击函数区别
  • 原文地址:https://www.cnblogs.com/ooooo/p/2253785.html
Copyright © 2011-2022 走看看