zoukankan      html  css  js  c++  java
  • 大数据背景下互联网用户行为分析

    要求:

    编一个程序就是要能捕捉后一个上网期间做的事情,比如他浏览百度页面3分钟,然后浏览新浪6分钟,然后下面还浏览了其他网页等等,能用程序捕捉到他所有的上网行为

    准备工作:
    查看需要可能用到的包:
    pynput.mouse:包含控制和监控鼠标或者触摸板的类。
    
    pynput.keyboard:包含控制和监控键盘的类。
    

      

    鼠标事件监听器是一个线程,所有的回调函数都会在独立的线程中运行。

    调用pynput.mouse.Listener.stop,发起StopException异常,或者回调函数中返回False都会停止事件的监听。

    对鼠标的操作:

     1 #!/usr/bin/env python3
     2 #-*- coding:utf-8 -*-
     3 '''
     4 Administrator 
     5 2018/8/16 
     6 '''
     7 
     8 from pynput.mouse import Button, Controller
     9 import time
    10 
    11 mouse = Controller()
    12 print(mouse.position)
    13 time.sleep(3)
    14 print('The current pointer position is {0}'.format(mouse.position))
    15 
    16 
    17 #set pointer positon
    18 mouse.position = (277, 645)
    19 print('now we have moved it to {0}'.format(mouse.position))
    20 
    21 #鼠标移动(x,y)个距离
    22 mouse.move(5, -5)
    23 print(mouse.position)
    24 
    25 mouse.press(Button.left)
    26 mouse.release(Button.left)
    27 
    28 #Double click
    29 mouse.click(Button.left, 1)
    30 
    31 #scroll two  steps down
    32 mouse.scroll(0, 500)
    View Code

    对鼠标行为的监控:

     1 from pynput import mouse
     2 
     3 def on_move(x, y):
     4     print('Pointer moved to {0}'.format(
     5         (x, y)))
     6 
     7 def on_click(x, y, button, pressed):
     8     print('{0} at {1}'.format(
     9         'Pressed' if pressed else 'Released',
    10         (x, y)))
    11     if not pressed:
    12         # Stop listener
    13         return False
    14 
    15 def on_scroll(x, y, dx, dy):
    16     print('Scrolled {0} at {1}'.format(
    17         'down' if dy < 0 else 'up',
    18         (x, y)))
    19 
    20 # Collect events until released
    21 with mouse.Listener(
    22         on_move=on_move,
    23         on_click=on_click,
    24         on_scroll=on_scroll) as listener:
    25     listener.join()
    View Code

    处理鼠标监听器错误

     1 from pynput import mouse
     2 
     3 class MyException(Exception): pass
     4 
     5 def on_click(x, y, button, pressed):
     6     if button == mouse.Button.left:
     7         raise MyException(button)
     8 
     9 # Collect events until released
    10 with mouse.Listener(
    11         on_click=on_click) as listener:
    12     try:
    13         listener.join()
    14     except MyException as e:
    15         print('{0} was clicked'.format(e.args[0]))
    View Code

    键盘事件监听器是一个线程,所有的回调函数都会在独立的线程中运行。

    调用pynput.keyboard.Listener.stop,发起StopException异常,或者回调函数中返回False都会停止事件的监听。

    传递给回调函数的key参数是一个pynput.keyboard.Key类的实例。当特殊按键和普通按键一起按下时,数字字母按键的值会被放置在pynput.keyboard.KeyCode类的实例中,对于不知道的按键会返回None。

    Controlling the keyboard 控制键盘

     1 from pynput.keyboard import Key, Controller
     2 
     3 keyboard = Controller()
     4 
     5 # Press and release space
     6 keyboard.press(Key.space)
     7 keyboard.release(Key.space)
     8 
     9 # Type a lower case A; this will work even if no key on the
    10 # physical keyboard is labelled 'A'
    11 keyboard.press('a')
    12 keyboard.release('a')
    13 
    14 # Type two upper case As
    15 keyboard.press('A')
    16 keyboard.release('A')
    17 with keyboard.pressed(Key.shift):
    18     keyboard.press('a')
    19     keyboard.release('a')
    20 
    21 # Type 'Hello World' using the shortcut type method
    22 keyboard.type('Hello World')
    View Code

    Monitoring the keyboard 监控键盘

     1 from pynput import keyboard
     2 
     3 def on_press(key):
     4     try:
     5         print('alphanumeric key {0} pressed'.format(
     6             key.char))
     7     except AttributeError:
     8         print('special key {0} pressed'.format(
     9             key))
    10 
    11 def on_release(key):
    12     print('{0} released'.format(
    13         key))
    14     if key == keyboard.Key.esc:
    15         # Stop listener
    16         return False
    17 
    18 # Collect events until released
    19 with keyboard.Listener(
    20         on_press=on_press,
    21         on_release=on_release) as listener:
    22     listener.join()
    View Code

    处理键盘监听器错误

     1 from pynput import keyboard
     2 
     3 class MyException(Exception): pass
     4 
     5 def on_press(key):
     6     if key == keyboard.Key.esc:
     7         raise MyException(key)
     8 
     9 # Collect events until released
    10 with keyboard.Listener(
    11         on_press=on_press) as listener:
    12     try:
    13         listener.join()
    14     except MyException as e:
    15         print('{0} was pressed'.format(e.args[0]))
    View Code

    利用python实现查看浏览器历史记录

     1 #coding:utf8
     2 '''
     3 Created on 2018年8月16日
     4 
     5 @author: Administrator
     6 '''
     7 #统计浏览器访问历史记录
     8 #se://version/ 用于查看浏览器文件存储地址
     9 
    10 
    11 import os  
    12 import sqlite3  
    13 import operator  
    14 from collections import OrderedDict  
    15 import matplotlib.pyplot as plt  
    16 
    17 def parse(url):  
    18     try:  
    19         parsed_url_components = url.split('//')  
    20         sublevel_split = parsed_url_components[1].split('/', 1)  
    21         domain =sublevel_split[0].replace("www.", "")  
    22         return domain
    23     except IndexError:  
    24         print('URL format error!') 
    25 
    26 def analyze(results):  
    27     prompt =input("[.] Type <c> to print or <p> to plot
    [>] ")
    28 
    29     if prompt == "c":
    30         with open('./history.txt','w') as f:
    31             for site, count in sites_count_sorted.items():
    32                 f.write(site+'	'+str(count)+'
    ')
    33     elif prompt == "p":
    34         key=[]
    35         value=[]
    36         for k,v in results.items():
    37             key.append(k)
    38             value.append(v)
    39         n=25
    40         X=range(n)
    41         Y=value[:n]
    42         plt.bar(X,Y,align='edge')
    43         plt.xticks(rotation=45)  
    44         plt.xticks(X,key[:n])
    45         for x,y in zip(X,Y):
    46             plt.text(x+0.4, y+0.05,y, ha='center', va= 'bottom')
    47         plt.show()
    48     else:  
    49         print("[.] Uh?")  
    50         quit()  
    51 
    52 if __name__=='__main__':
    53     #path to user's history database (Chrome)  
    54     data_path=r"D:360浏览器360se6User DataDefault"
    55     files=os.listdir(data_path)
    56     #"D:360浏览器360se6User DataDefaultHistory"
    57     history_db = os.path.join(data_path, 'History')
    58     print(history_db)
    59 
    60     #querying the db  
    61     c = sqlite3.connect(history_db)  
    62     cursor = c.cursor()  
    63     select_statement = "SELECT urls.url, urls.visit_count FROM urls, visits WHERE urls.id = visits.url;"  
    64     cursor.execute(select_statement)  
    65 
    66     results = cursor.fetchall() #tuple  
    67 
    68     sites_count = {} #dict makes iterations easier :D  
    69 
    70     for url, count in results:  
    71         url = parse(url)  
    72         if url in sites_count:  
    73             sites_count[url] += 1  
    74         else:  
    75             sites_count[url] = 1  
    76 
    77     sites_count_sorted = OrderedDict(sorted(sites_count.items(), key=operator.itemgetter(1), reverse=True))  
    78 
    79     analyze (sites_count_sorted)  
    View Code

    实现原理,找到浏览器浏览历史的保存的SQLit数据文件。利用代码读取数据,并对数据进行处理,加工

    #-*- utf-8 -*-
    
    
    # 已经可以在这个上面编写了
    
    # 统计浏览器访问历史记录
    # se://version/ 用于查看浏览器文件存储地址
    # matplotlib  这是用来显示数据的包   一般都是科学计算用的。估计你用的会多点。你可以好好学习一下  还有一个包 numpy
    #
    
    import os  # 访问系统的包
    import sqlite3  # 链接数据库文件胡包
    import operator
    from collections import OrderedDict
    import matplotlib.pyplot as plt
    import re
    
    def parse(url):
        try:
            parsed_url_components = url.split('//')
            sublevel_split = parsed_url_components[1].split('/', 1)
            domain = sublevel_split[0].replace("www.", "")
            return domain
        except IndexError:
            print('URL format error!')
    def filter_data(url):
        try:
            parsed_url_components = url.split('//')
            sublevel_split = parsed_url_components[1].split('/', 1)
            data=re.search('w+.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',sublevel_split[0])
            if data:
                return data.group()
            else:
                yuming_count.add(sublevel_split[0])
                return "ok"
        except IndexError:
            print('URL format error!')
    
    def analyze(results):
        prompt = input("[.] Type <c> to print or <p> to plot
    [>] ")
    
        if prompt == "c":
            with open('./history.txt', 'w') as f:
                for site, count in sites_count_sorted.items():
                    f.write(site + '	' + str(count) + '
    ')
        elif prompt == "p":
            key = []
            value = []
            for k, v in results.items():
                key.append(k)
                value.append(v)
            n = 25
            X = range(n)
            Y = value[:n]
            plt.bar( Y,X, align='edge')
            plt.xticks(rotation=45)
            plt.xticks(X, key[:n])
            for x, y in zip(X, Y):
                plt.text(x + 0.4, y + 0.05, y, ha='center', va='bottom')
            plt.show()
        else:
            print("[.] Uh?")
            quit()
    def analyze2(results):
        print("我一看就知道你要打印折线图")
        key=[]
        value=[]
        for k,v in results.items():
            key.append(k)
            value.append(v)
        n = 20
        X = key[:n]
        Y = value[:n]
    
        plt.plot(Y,X,label="number count")
        plt.xticks(rotation=45)
        plt.xlabel('numbers')
        plt.ylabel('webname')
        plt.title('number count')
        plt.show()
    
    if __name__ == '__main__':
        # 先运行起来吧
        #  这是查看360的。
    
        history_db = r"C:UsersAdministratorDesktopHistory"
        # files=os.listdir(data_path)
    
        # history_db = os.path.join(data_path, 'History')
        # print(history_db)
    
        # querying the db
        c = sqlite3.connect(history_db)
        cursor = c.cursor()
        select_statement = "SELECT urls.url, urls.visit_count FROM urls, visits WHERE urls.id= visits.url;"
        cursor.execute(select_statement)
    
        results = cursor.fetchall() # tuple
    
        sites_count = {}  # dict makes iterations easier :D
        yuming_count=set()#创建一个空的集合,用来收集已经存在国际域名
        for url, count in results:
            url= filter_data(url)
            if url in sites_count:
                sites_count[url] += 1
            else:
                sites_count[url] = 1
        print(yuming_count)
        # print(sites_count)
        del sites_count["ok"]
        sites_count_sorted = OrderedDict(sorted(sites_count.items(), key=operator.itemgetter(1), reverse=True))
        #
        # # analyze(sites_count_sorted)
        analyze2(sites_count_sorted)
    View Code

    利用python编程实现对浏览器接口的监听,实时查看用户访问的网站和数据


     

    https://blog.csdn.net/xuanhun521/article/details/51779292 

    Python黑客编程3网络数据监听和过滤

    课程的实验环境如下:

    •      操作系统:window7

    •      编程工具:pycharm IDE

    •      Python版本:3.6.4

    •      涉及到的主要python模块:pypcap,dpkt,scapy,scapy-http


    https://blog.csdn.net/sinat_22659313/article/details/53420492

    大数据背景下互联网用户行为分析





  • 相关阅读:
    Mysql学习笔记(十四)备份与恢复
    Mysql学习笔记(十三)权限管理
    docker容器持久化卷讲解
    logstash关于date时间处理的几种方式总结
    ELK收集tomcat状态日志
    ELK收集tomcat访问日志并存取mysql数据库案例
    利用fpm制作rpm包
    Elasticsearch一些常用操作和一些基础概念
    Linux Cluster
    LNMP下动静分离部署phpmyadmin软件包
  • 原文地址:https://www.cnblogs.com/Mengchangxin/p/9487278.html
Copyright © 2011-2022 走看看