zoukankan      html  css  js  c++  java
  • PHP登入网站抓取并且抓取数据

    有时候需要登入网站,然后去抓取一些有用的信息,人工做的话,太累了。有的人可以很快的做到登入,但是需要在登入后再去访问其他页面始终都访问不了,因为他们没有带Cookie进去而被当做是两次会话。下面看看代码

    <?php  //test.php
    function getWebContent($host,$page="/",$paramstr="",$cookies='',$medth="POST",$port=80){
        $fp = fsockopen($host,$port);
        if(!$fp){
            return false;
        }
        $medth = strtoupper($medth);
        $medth = $medth=="POST" ? "POST":"GET";
        $length = strlen($paramstr);
        if($medth == "GET" && $paramstr){
            $page .= "?".$paramstr;
        }
        $out = "$medth $page  HTTP/1.1 ";
        $out .= "Accept: */* "; 
        $out .= "Host: www.exaple.com "; 
        $out .= "Content-Length: ".$length." ";
        $out .= "Content-Type: application/x-www-form-urlencoded ";
        if($cookies){
            $out .= "Cookie: ".$cookies." ";
        }
        $out .= "Connection: Keep-Alive ";
        if($medth=='POST' && $paramstr){
            $out .= $paramstr." ";
        }
        fwrite($fp, $out);
        $cookie = "";
        $content = "";
        while (!feof($fp)) {
            $str = fgets($fp);
            if(preg_match("/Set-Cookie:([^ ]*)/",$str,$matchs)){
                if($cookie){
                    $cookie .= ";".$matchs[1];
                }else{
                    $cookie = $matchs[1];
                }
            }
            $content .= $str;
            echo $str;
        }
        fclose($fp);
        return array('content'=>$content,'cookie'=>$cookie);
    }

    $params = "name=admin&pwd=admin";
    $rs = getWebContent("127.0.0.1","/test/login.php",$params,"","POST",8080);
    echo $rs['content'];
    $rs = getWebContent("127.0.0.1","/test/index.php","",$rs['cookie'],"POST",8080);
    //这里传入上次cookie是关键,否则会被当成两次会话
    echo $rs['content'];
    ?>

    <?php //login.php
        $name = $_REQUEST['name'];
        $pwd = $_REQUEST['pwd'];
        if($name == "admin" && $pwd == "admin"){
            setcookie("cname",$name);
            echo "success";
        }else{
            echo "failed";   
        }
    ?>

    <?php //index.php
    if(isset($_COOKIE['cname']) && $_COOKIE['cname']){
        echo "<ul><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul>";
    }else{
        echo "please login first!";
    }
    ?>

    将上面三个文件分别保存,login.php和index.php放在root目录下的test目录下。然后test.php放在任意目录,然后去命令行运行php test.php,结果就能出来。

    还有一种更简单的方式,就是用curl,代码如下,可以用下面的代码替换test.php
    <?php
    $post_data = array (
        "name" => "admin",
        "pwd" => "admin",
    );
    $cookie_jar = tempnam('./', 'cookie');//新建cookie文件
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, "http://localhost:8080/test/login.php");
    //设定返回的数据是否自动显示
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    // 我们在POST数据哦!
    curl_setopt($ch, CURLOPT_POST, 1);
    // 把post的变量加上
    curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
    //把返回来的cookie信息保存在$cookie_jar文件中
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_jar);
    echo curl_exec($ch);
    curl_close($ch);

    $ch2 = curl_init();
    curl_setopt($ch2, CURLOPT_URL, "http://localhost:8080/test/index.php");
    curl_setopt($ch2, CURLOPT_HEADER, false);
    curl_setopt($ch2, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch2, CURLOPT_COOKIEFILE, $cookie_jar);
    echo curl_exec($ch2);
    unlink($cookie_jar);
    curl_close($ch2);
    ?>

  • 相关阅读:
    C++ 引用与常量
    树的三种存储方法
    二叉树 基本运算
    Win10阻止电脑关机时弹出正在关闭应用的方法及恢复
    使用spring+quartz配置多个定时任务(包括项目启动先执行一次,之后再间隔时间执行)
    MySQL为某数据库中所有的表添加相同的字段
    Mysql定时任务详情
    MYSQL存储过程即常用逻辑
    后台发送请求,HttpClient的post,get各种请求,带header的请求
    java后台发送请求获取数据,并解析json数据
  • 原文地址:https://www.cnblogs.com/grimm/p/5048993.html
Copyright © 2011-2022 走看看