zoukankan      html  css  js  c++  java
  • A Basic Website Crawler, in Python, in 12 Lines of Code. « Null Byte

    A Basic Website Crawler, in Python, in 12 Lines of Code. « Null Byte

    Step 1 Layout the logic.

    OK, as far as crawlers (web spiders) go, this one cannot be more basic. Well, it can, if you remove lines 11-12, but then it's about as useful as a broken pencil - there's just no point. (Get it? Hehe...he...Im a sad person... )

    So what does a webcrawler do? Well, it scours a page for URL's (in our case) and puts them in a neat list. But it does not stop there. Nooooo sir. It then iterates through each found url, goes into it, and retrieves the URL's in that page. And so on (if you code it further).

    What we are coding is a very scaled down version of what makes google its millions. Well it used to be. Now it's 50% searches, 20% advertising, 10% users' profile sales and 20% data theft. But hey, who's counting.

    This has a LOT of potential, and should you wish to expand on it, I'd love to see what you come up with.

    So let's plan the program.

    The logic here is fairly straightforward:

    • user enters the beginning url
    • crawler goes in, and goes through the source code, gethering all URL's inside
    • crawler then visits each url in another for loop, gathering child url's from the initial parent urls.
    • profit???

    Step 2 The Code:

    #! C:\python27

    import re, urllib

    textfile = file('depth_1.txt','wt')
    print "Enter the URL you wish to crawl.."
    print 'Usage  - "http://phocks.org/stumble/creepy/" <-- With the double quotes'
    myurl = input("@> ")
    for i in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(myurl).read(), re.I):
            print i 
            for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(i).read(), re.I):
                    print ee
                    textfile.write(ee+'\n')
    textfile.close()

    That's it... No really.. That. Is. It.

    So we create a file called depth_1. We prompt the user for entry of a url

    Which should be entered in the following format -"http://www.google.com/"

    With the quotation.

    Then we loop through the page we passed, parse the source and return urls, get the child urls, write them to the file. Print the url's on the screen and close the file.

    Done!

    Finishing Statement

    So, I hope this aids you in some way, and again, if you improve on it - please share it with us!

    Regards

    Mr.F

  • 相关阅读:
    【PAT Advanced Level】1008. Elevator (20)
    模块的耦合和内聚
    《深入理解计算机系统》--链接
    HDU 1077Catching Fish(简单计算几何)
    [置顶] VC++界面编程之--自定义CEdit(编辑框)皮肤
    使用VS2012 开发SharePoint 2013 声明式的action(activity) 综合实例
    java 创建线程的三种方法Callable,Runnable,Thread比较及用法
    代码审计技巧讲解
    IP地址后面斜杠加具体数字详解
    80端口被system进程占用解决方法
  • 原文地址:https://www.cnblogs.com/lexus/p/2480915.html
Copyright © 2011-2022 走看看