zoukankan      html  css  js  c++  java
  • Android(Java) 模拟登录知乎并抓取用户信息

    前不久。看到一篇文章我用爬虫一天时间“偷了”知乎一百万用户。仅仅为证明PHP是世界上最好的语言,该文章中使用的登录方式是直接复制cookie到代码中,这里呢,我不以爬信息为目的。仅仅是简单的介绍使用java来进行模拟登录的基本过程。之前写过的文章android 项目实战——打造超级课程表一键提取课表功能事实上就是模拟登录的范畴。再加上近期在知乎上看到非常多人问关于超级课程表的实现,事实上本质就是模拟登录,掌握了这篇文章的内容,你不再操心抓不到信息了。然后,这篇文章会使用到之前的一篇Cookie保持的文章Android OkHttp的Cookie自己主动化管理,还有Jsoup的使用 Jsoup库使用全然解析,为了简单处理,直接使用javaSE来,而不再使用Android进行。假设要移植到Android,唯一的处理可能就是把网络请求工作扔到子线程中去 。

    首先使用Chrome打开知乎首页 , 点击登录,你会看到以下这个界面
    这里写图片描写叙述

    在Chorme中按F12,调出开发人员工具,切到Network选项卡,勾选Preserve Log。记得一定要勾选,不然你会看不到信息。

    这里写图片描写叙述

    一切就绪后,在输入框中输出账号密码点击登录。登录成功后你会看到这么一条记录

    这里写图片描写叙述

    点击图中的email,在最下方你会看到本次请求提交了4个參数,以及在上方,你会看到本次请求的地址是http://www.zhihu.com/login/email

    这里写图片描写叙述

    这里写图片描写叙述

    你会吃惊的发现知乎的密码是明文传输的,提交的參数的意思也非常easy,email就是账号,password就是密码。remember_me就是是否记住,这里传true就能够了,另一个_xsrf參数,这个毛估估应该是防爬虫的。

    因此在提交前我们要从源代码中将这个值抓取下来。该值在表单的隐藏域中

    这里写图片描写叙述

    一切准备就绪后,你就兴高採烈的用代码去模拟登录,然后你会发现会返回一个验证码错误的信息。事实上,我们还须要提交一个验证码,其參数名为captcha,验证码的地址为,

    http://www.zhihu.com/captcha.gif?r=时间戳

    于是我们得出了这种一个数据。

    • 请求地址
    http://www.zhihu.com/login/email
    • 请求參数
    _xsrf 表单中提取的隐藏域的值
    captcha 验证码
    email 邮箱
    password 密码
    remember_me 记住我

    另一个问题。验证码的值怎么得到呢。答案是人工输入。将验证码保存到本地进行觉得识别,输入后进行登陆就可以。

    这里的网络请求使用OkHttp。以及解析使用Jsoup,然后我们会使用到Gson,将他们增加maven依赖

        <dependencies>
            <dependency>
                <groupId>com.squareup.okhttp</groupId>
                <artifactId>okhttp</artifactId>
                <version>2.4.0</version>
            </dependency>
            <dependency>
                <groupId>org.jsoup</groupId>
                <artifactId>jsoup</artifactId>
                <version>1.8.3</version>
            </dependency>
            <dependency>
                <groupId>com.google.code.gson</groupId>
                <artifactId>gson</artifactId>
                <version>2.3.1</version>
            </dependency>
        </dependencies>

    在编码之前。我们得想想怎么维持登陆状态。没错,就是Cookie怎样保持,我们仅仅进行登陆一次,兴许都直接採集数据就能够了,因此须要将cookie持久化。对之前的文章中的一个Android类进行改造。使其变成java平台可用的类。能够看到我们将它从之前保存到SharePrefrences中改成了保存到文件里,并以json形式存储,这就是为什么会用到Gson的原因了

    package cn.edu.zafu.zhihu;
    
    
    
    import com.google.gson.Gson;
    import com.google.gson.GsonBuilder;
    import com.google.gson.reflect.TypeToken;
    
    import java.io.*;
    import java.net.CookieStore;
    import java.net.HttpCookie;
    import java.net.URI;
    import java.net.URISyntaxException;
    import java.util.*;
    import java.util.concurrent.ConcurrentHashMap;
    
    /**
     * User:lizhangqu(513163535@qq.com)
     * Date:2015-07-18
     * Time: 16:54
     */
    public class PersistentCookieStore implements CookieStore {
        private static final Gson gson= new GsonBuilder().setPrettyPrinting().create();
        private static final String LOG_TAG = "PersistentCookieStore";
        private static final String COOKIE_PREFS = "CookiePrefsFile";
        private static final String COOKIE_NAME_PREFIX = "cookie_";
    
        private final HashMap<String, ConcurrentHashMap<String, HttpCookie>> cookies;
        private  Map<String,String> cookiePrefs=new HashMap<String, String>();
    
        /**
         * Construct a persistent cookie store.
         *
         */
        public PersistentCookieStore() {
            String cookieJson = readFile("cookie.json");
            Map<String,String> fromJson = gson.fromJson(cookieJson,new TypeToken<Map<String, String>>() {}.getType());  
            if(fromJson!=null){
                System.out.println(fromJson);
                cookiePrefs=fromJson;
            }
    
    
            cookies = new HashMap<String, ConcurrentHashMap<String, HttpCookie>>();
    
            // Load any previously stored cookies into the store
    
            for(Map.Entry<String, ?> entry : cookiePrefs.entrySet()) {
                if (((String)entry.getValue()) != null && !((String)entry.getValue()).startsWith(COOKIE_NAME_PREFIX)) {
                    String[] cookieNames = split((String) entry.getValue(), ",");
                    for (String name : cookieNames) {
                        String encodedCookie = cookiePrefs.get(COOKIE_NAME_PREFIX + name);
                        if (encodedCookie != null) {
                            HttpCookie decodedCookie = decodeCookie(encodedCookie);
                            if (decodedCookie != null) {
                                if(!cookies.containsKey(entry.getKey()))
                                    cookies.put(entry.getKey(), new ConcurrentHashMap<String, HttpCookie>());
                                cookies.get(entry.getKey()).put(name, decodedCookie);
                            }
                        }
                    }
    
                }
            }
        }
    
        public void add(URI uri, HttpCookie cookie) {
            String name = getCookieToken(uri, cookie);
    
            // Save cookie into local store, or remove if expired
            if (!cookie.hasExpired()) {
                if(!cookies.containsKey(uri.getHost()))
                    cookies.put(uri.getHost(), new ConcurrentHashMap<String, HttpCookie>());
                cookies.get(uri.getHost()).put(name, cookie);
            } else {
                if(cookies.containsKey(uri.toString()))
                    cookies.get(uri.getHost()).remove(name);
            }
            cookiePrefs.put(uri.getHost(), join(",", cookies.get(uri.getHost()).keySet()));
            cookiePrefs.put(COOKIE_NAME_PREFIX + name, encodeCookie(new SerializableHttpCookie(cookie)));
    
            String json=gson.toJson(cookiePrefs);
            saveFile(json.getBytes(), "cookie.json");
    
        }
    
        protected String getCookieToken(URI uri, HttpCookie cookie) {
            return cookie.getName() + cookie.getDomain();
        }
    
        public List<HttpCookie> get(URI uri) {
            ArrayList<HttpCookie> ret = new ArrayList<HttpCookie>();
            if(cookies.containsKey(uri.getHost()))
                ret.addAll(cookies.get(uri.getHost()).values());
            return ret;
        }
    
        public boolean removeAll() {
            cookiePrefs.clear();
            cookies.clear();
            return true;
        }
    
    
        public boolean remove(URI uri, HttpCookie cookie) {
            String name = getCookieToken(uri, cookie);
    
            if(cookies.containsKey(uri.getHost()) && cookies.get(uri.getHost()).containsKey(name)) {
                cookies.get(uri.getHost()).remove(name);
                if(cookiePrefs.containsKey(COOKIE_NAME_PREFIX + name)) {
                    cookiePrefs.remove(COOKIE_NAME_PREFIX + name);
                }
                cookiePrefs.put(uri.getHost(), join(",", cookies.get(uri.getHost()).keySet()));
    
                return true;
            } else {
                return false;
            }
        }
    
        public List<HttpCookie> getCookies() {
            ArrayList<HttpCookie> ret = new ArrayList<HttpCookie>();
            for (String key : cookies.keySet())
                ret.addAll(cookies.get(key).values());
    
            return ret;
        }
    
        public List<URI> getURIs() {
            ArrayList<URI> ret = new ArrayList<URI>();
            for (String key : cookies.keySet())
                try {
                    ret.add(new URI(key));
                } catch (URISyntaxException e) {
                    e.printStackTrace();
                }
    
            return ret;
        }
    
        /**
         * Serializes Cookie object into String
         *
         * @param cookie cookie to be encoded, can be null
         * @return cookie encoded as String
         */
        protected String encodeCookie(SerializableHttpCookie cookie) {
            if (cookie == null)
                return null;
            ByteArrayOutputStream os = new ByteArrayOutputStream();
            try {
                ObjectOutputStream outputStream = new ObjectOutputStream(os);
                outputStream.writeObject(cookie);
            } catch (IOException e) {
                System.out.println("IOException in encodeCookie"+ e);
                return null;
            }
    
            return byteArrayToHexString(os.toByteArray());
        }
    
        /**
         * Returns cookie decoded from cookie string
         *
         * @param cookieString string of cookie as returned from http request
         * @return decoded cookie or null if exception occured
         */
        protected HttpCookie decodeCookie(String cookieString) {
            byte[] bytes = hexStringToByteArray(cookieString);
            ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);
            HttpCookie cookie = null;
            try {
                ObjectInputStream objectInputStream = new ObjectInputStream(byteArrayInputStream);
                cookie = ((SerializableHttpCookie) objectInputStream.readObject()).getCookie();
            } catch (IOException e) {
                System.out.println("IOException in decodeCookie"+e);
            } catch (ClassNotFoundException e) {
                System.out.println("ClassNotFoundException in decodeCookie"+e);
            }
    
            return cookie;
        }
    
        /**
         * Using some super basic byte array &lt;-&gt; hex conversions so we don't have to rely on any
         * large Base64 libraries. Can be overridden if you like!
         *
         * @param bytes byte array to be converted
         * @return string containing hex values
         */
        protected String byteArrayToHexString(byte[] bytes) {
            StringBuilder sb = new StringBuilder(bytes.length * 2);
            for (byte element : bytes) {
                int v = element & 0xff;
                if (v < 16) {
                    sb.append('0');
                }
                sb.append(Integer.toHexString(v));
            }
            return sb.toString().toUpperCase(Locale.US);
        }
    
        /**
         * Converts hex values from strings to byte arra
         *
         * @param hexString string of hex-encoded values
         * @return decoded byte array
         */
        protected byte[] hexStringToByteArray(String hexString) {
            int len = hexString.length();
            byte[] data = new byte[len / 2];
            for (int i = 0; i < len; i += 2) {
                data[i / 2] = (byte) ((Character.digit(hexString.charAt(i), 16) << 4) + Character.digit(hexString.charAt(i + 1), 16));
            }
            return data;
        }
        public static String join(CharSequence delimiter, Iterable tokens) {
            StringBuilder sb = new StringBuilder();
            boolean firstTime = true;
            for (Object token: tokens) {
                if (firstTime) {
                    firstTime = false;
                } else {
                    sb.append(delimiter);
                }
                sb.append(token);
            }
            return sb.toString();
        }
        public static String[] split(String text, String expression) {
            if (text.length() == 0) {
                return new String[]{};
            } else {
                return text.split(expression, -1);
            }
        }
    
        public static void saveFile(byte[] bfile, String fileName) {
            BufferedOutputStream bos = null;
            FileOutputStream fos = null;
            File file = null;
            try {
                file = new File(fileName);
                fos = new FileOutputStream(file);
                bos = new BufferedOutputStream(fos);
                bos.write(bfile);
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                if (bos != null) {
                    try {
                        bos.close();
                    } catch (IOException e1) {
                        e1.printStackTrace();
                    }
                }
                if (fos != null) {
                    try {
                        fos.close();
                    } catch (IOException e1) {
                        e1.printStackTrace();
                    }
                }
            }
        }
        public static String readFile(String fileName) {
            BufferedInputStream bis = null;
            FileInputStream fis = null;
            File file = null;
            try {
                file = new File(fileName);
                fis = new FileInputStream(file);
                bis = new BufferedInputStream(fis);
    
                int available = bis.available();
                byte[] bytes=new byte[available];
                bis.read(bytes);
                String str=new String(bytes);
                return str;
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                if (bis != null) {
                    try {
                        bis.close();
                    } catch (IOException e1) {
                        e1.printStackTrace();
                    }
                }
                if (fis != null) {
                    try {
                        fis.close();
                    } catch (IOException e1) {
                        e1.printStackTrace();
                    }
                }
            }
            return "";
        }
    }

    然后新建一个OkHttp请求类,并设置其Cookie处理类为我们编写的类。

    private static OkHttpClient client = new OkHttpClient();
    client.setCookieHandler(new CookieManager(new PersistentCookieStore(), CookiePolicy.ACCEPT_ALL));

    好了。能够開始获取_xsrf以及验证码了。验证码保存在项目根文件夹下名为code.png的文件

    private static String xsrf;
    public static void getCode() throws IOException{
            Request request = new Request.Builder()
            .url("http://www.zhihu.com/")
            .addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
            .build();
    
            Response response = client.newCall(request).execute();
            String result = response.body().string();
    
            Document parse = Jsoup.parse(result);
            System.out.println(parse + "");
            result = parse.select("input[type=hidden]").get(0).attr("value")
                    .trim();
            xsrf=result;
            System.out.println("_xsrf:" + result);
            String codeUrl = "http://www.zhihu.com/captcha.gif?r=";
            codeUrl += System.currentTimeMillis();
            System.out.println("codeUrl:" + codeUrl);
            Request getcode = new Request.Builder()
                    .url(codeUrl)
                    .addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
                    .build();
    
            Response code = client.newCall(getcode).execute();
    
            byte[] bytes = code.body().bytes();
            saveCode(bytes, "code.png");
        }
        public static void saveCode(byte[] bfile, String fileName) {
            BufferedOutputStream bos = null;
            FileOutputStream fos = null;
            File file = null;
            try {
                file = new File(fileName);
                fos = new FileOutputStream(file);
                bos = new BufferedOutputStream(fos);
                bos.write(bfile);
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                if (bos != null) {
                    try {
                        bos.close();
                    } catch (IOException e1) {
                        e1.printStackTrace();
                    }
                }
                if (fos != null) {
                    try {
                        fos.close();
                    } catch (IOException e1) {
                        e1.printStackTrace();
                    }
                }
            }
        }

    然后将获取来的參数连同账号密码进行提交登录

        public static void login(String randCode,String email,String password) throws IOException{
            RequestBody formBody = new FormEncodingBuilder()
            .add("_xsrf", xsrf)
            .add("captcha", randCode)
            .add("email", email)
            .add("password", password)
            .add("remember_me", "true")
            .build();
            Request login = new Request.Builder()
            .url("http://www.zhihu.com/login/email")
            .post(formBody)
            .addHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
            .build();
    
    
            Response execute = client.newCall(login).execute();
            System.out.println(decode(execute.body().string()));
    
        }
    public static String decode(String unicodeStr) {
            if (unicodeStr == null) {
                return null;
            }
            StringBuffer retBuf = new StringBuffer();
            int maxLoop = unicodeStr.length();
            for (int i = 0; i < maxLoop; i++) {
                if (unicodeStr.charAt(i) == '\') {
                    if ((i < maxLoop - 5)
                            && ((unicodeStr.charAt(i + 1) == 'u') || (unicodeStr
                            .charAt(i + 1) == 'U')))
                        try {
                            retBuf.append((char) Integer.parseInt(
                                    unicodeStr.substring(i + 2, i + 6), 16));
                            i += 5;
                        } catch (NumberFormatException localNumberFormatException) {
                            retBuf.append(unicodeStr.charAt(i));
                        }
                    else
                        retBuf.append(unicodeStr.charAt(i));
                } else {
                    retBuf.append(unicodeStr.charAt(i));
                }
            }
            return retBuf.toString();
        }

    当看到以下的信息就代码登录成功了

    这里写图片描写叙述

    之后你就能够获取你想要的信息了。这里简单获取一些信息,比方我要获取轮子哥的followers的昵称。分页自己处理下就ok了。

    public static void getFollowers() throws IOException{
            Request request = new Request.Builder()
            .url("http://www.zhihu.com/people/zord-vczh/followees")
            .addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
            .build();
            Response response = client.newCall(request).execute();
    
            String result=response.body().string();
    
            Document parse = Jsoup.parse(result);
    
            Elements select = parse.select("div.zm-profile-card");
            StringBuilder builder=new StringBuilder();
            for (int i=0;i<select.size();i++){
                Element element = select.get(i);
                String name=element.select("h2").text();
                System.out.println(name+"");
                builder.append(name);
                builder.append("
    ");
            }
        }

    下图就是获取到的信息。当然。仅仅要你登录了。什么信息你都能够获取到。
    这里写图片描写叙述

    最后上源代码,Intelij的maven项目
    http://download.csdn.net/detail/sbsujjbcy/8984375

  • 相关阅读:
    D3制作力导向图
    page分页问题,根据页码获取对应页面的数据,接口调用
    python列表生成式、键盘输入及类型转换、字符串翻转、字母大小写、数组广播、循环语句等基础问题
    python中将已有链接的视频进行下载
    机器学习1
    python 排序算法
    LintCode 练习题
    python 装饰器的使用
    hive 学习笔记
    hive 操作
  • 原文地址:https://www.cnblogs.com/llguanli/p/7400421.html
Copyright © 2011-2022 走看看