zoukankan html css js c++ java

FOFA链接爬虫爬取fofa spider

之前一直是用的github上别人爬取fofa的脚本，前两天用的时候只能爬取第一页的链接了，猜测是fofa修改了一部分规则（或者是我不小心删除了一部分文件导致不能正常运行了）

于是重新写了一下爬取fofa的代码，写的不好:(

因为fofa的登录界面是https://i.nosec.org/login?service=https%3A%2F%2Ffofa.so%2Fusers%2Fservice

FOFA的登录跟一般网站登录不同，在nosec登录成功后，只拥有nosec的cookie，并没有fofa的cookie，所以访问fofa还是未登录状态，需要再访问https://fofa.so/users/sign_in才会生成fofa的cookie。

然后我就换了一种方式，手动添加_fofapro_ars_session来进行登录，fofapro_ars_session在我们登录fofa之后使用F12可以查看，这一步比较麻烦

添加了对应的session之后，我们对输入内容进行base64编码，因为当我们在fofa网站进行搜索的时候，网站也是将我们输入的内容进行base64编码然后进行搜索的

接着解析页面获取相应链接，持续找到下一页即可。

需要注意的是，因为fofa也有防止快速爬取的机制，所以我们在爬取的时候要设置一点延时，防止抓取到的IP地址有漏掉的。

在检索到了搜索的内容之后，首先显示该搜索对象有多少页，爬取的页数也是由输入者自己决定。

代码如下：（有一个漂亮的字符画大LOGO）

# -*- coding:utf-8 -*-
import requests
from lxml import etree
import base64
import re
import time

cookie = ''


def logo():
    print('''
                
            
             /$$$$$$$$ /$$$$$$  /$$$$$$$$ /$$$$$$                                   
            | $$_____//$$__  $$| $$_____//$$__  $$                                  
            | $$     | $$   $$| $$     | $$   $$                                  
            | $$$$$  | $$  | $$| $$$$$  | $$$$$$$$                                  
            | $$__/  | $$  | $$| $$__/  | $$__  $$                                  
            | $$     | $$  | $$| $$     | $$  | $$                                  
            | $$     |  $$$$$$/| $$     | $$  | $$                                  
            |__/      \______/ |__/     |__/  |__/                                  
                                                                                    
                                                                                    
                                                                                    
                                /$$$$$$            /$$       /$$                    
                               /$$__  $$          |__/      | $$                    
                              | $$  \__/  /$$$$$$  /$$  /$$$$$$$  /$$$$$$   /$$$$$$ 
                              |  $$$$$$  /$$__  $$| $$ /$$__  $$ /$$__  $$ /$$__  $$
                               \____  $$| $$   $$| $$| $$  | $$| $$$$$$$$| $$  \__/
                               /$$   $$| $$  | $$| $$| $$  | $$| $$_____/| $$      
                              |  $$$$$$/| $$$$$$$/| $$|  $$$$$$$|  $$$$$$$| $$      
                               \______/ | $$____/ |__/ \_______/ \_______/|__/      
                                        | $$                                        
                                        | $$                                        
                                        |__/                                        
                                
                                                                                version:1.0
    ''')


def spider():
    header = {
        "Connection": "keep-alive",
        "Cookie": "_fofapro_ars_session=" + cookie,
    }
    search = input('please input your key: 
')
    searchbs64 = (str(base64.b64encode(search.encode('utf-8')), 'utf-8'))
    print("spider website is :https://fofa.so/result?&qbase64=" + searchbs64)
    html = requests.get(url="https://fofa.so/result?&qbase64=" + searchbs64, headers=header).text
    pagenum = re.findall('>(d*)</a> <a class="next_page" rel="next"', html)
    print("have page: "+pagenum[0])
    stop_page=input("please input stop page: 
")
    #print(stop_page)
    doc = open("hello_world.txt", "a+")
    for i in range(1,int(pagenum[0])):
        print("Now write " + str(i) + " page")
        pageurl = requests.get('https://fofa.so/result?page=' + str(i) + '&qbase64=' + searchbs64, headers=header)
        tree = etree.HTML(pageurl.text)
        urllist=tree.xpath('//div[@class="list_mod_t"]//a[@target="_blank"]/@href')
        for j in urllist:
            #print(j)
            doc.write(j+"
")
        if i==int(stop_page):
            break
        time.sleep(10)
    doc.close()
    print("OK,Spider is End .")

def start():
    print("Hello!My name is Spring bird.First you should make sure _fofapro_ars_session!!!")
    print("And time sleep is 10s")

def main():
    logo()
    start()
    spider()

if __name__ == '__main__':
    main()

　　Github链接：https://github.com/Cl0udG0d/Fofa-script

我设置的time.sleep()延时是10秒，可以根据自己的需求进行修改，以及，虽然在代码里面进行了base64解码，但是有的时候总会出现编码问题而导致搜索不到想要的结果，pagenum[0]等于0的情况，如果修改关键字还是不行的话，可以自己在fofa网站里面查了之后，在url中将base64之后的搜索关键字替换成代码里面的searchbs64，这样就必然能够搜索到了，这些不足的地方在下次修改的时候进行改进吧，奥利给。

查看全文

相关阅读:
正则表达式系统教程 [转,主要是自己备忘] 碧血黄沙
 vim打开txt文件看到^@字符
 使用PuTTY软件远程登录root被拒：access denied
Using CustomProperties of CodeSmith
ASP:Literal控件用法
 ASP.NET2.0中配置文件的加密与解密
 Enterprise Library 2.0 Data Access Application Block (补充)
Infragistics中WebGrid的MultiColumn Headers设计
 世界杯揭幕战比分预测
 Enterprise Library1.0 DataAccess Application Block

原文地址：https://www.cnblogs.com/Cl0ud/p/12384457.html