A Basic Website Crawler, in Python, in 12 Lines of Code. « Null Byte

zoukankan html css js c++ java

A Basic Website Crawler, in Python, in 12 Lines of Code. « Null Byte
A Basic Website Crawler, in Python, in 12 Lines of Code. « Null Byte
Step 1 Layout the logic.
OK, as far as crawlers (web spiders) go, this one cannot be more basic. Well, it can, if you remove lines 11-12, but then it's about as useful as a broken pencil - there's just no point. (Get it? Hehe...he...Im a sad person... )
So what does a webcrawler do? Well, it scours a page for URL's (in our case) and puts them in a neat list. But it does not stop there. Nooooo sir. It then iterates through each found url, goes into it, and retrieves the URL's in that page. And so on (if you code it further).
What we are coding is a very scaled down version of what makes google its millions. Well it used to be. Now it's 50% searches, 20% advertising, 10% users' profile sales and 20% data theft. But hey, who's counting.
This has a LOT of potential, and should you wish to expand on it, I'd love to see what you come up with.
So let's plan the program.
The logic here is fairly straightforward:
user enters the beginning url
crawler goes in, and goes through the source code, gethering all URL's inside
crawler then visits each url in another for loop, gathering child url's from the initial parent urls.
profit???
Step 2 The Code:
#! C:\python27
import re, urllib
textfile = file('depth_1.txt','wt')
print "Enter the URL you wish to crawl.."
print 'Usage - "http://phocks.org/stumble/creepy/" <-- With the double quotes'
myurl = input("@> ")
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(myurl).read(), re.I):
        print i
        for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(i).read(), re.I):
                print ee
                textfile.write(ee+'\n')
textfile.close()
That's it... No really.. That. Is. It.
So we create a file called depth_1. We prompt the user for entry of a url
Which should be entered in the following format -"http://www.google.com/"
With the quotation.
Then we loop through the page we passed, parse the source and return urls, get the child urls, write them to the file. Print the url's on the screen and close the file.
Done!
Finishing Statement
So, I hope this aids you in some way, and again, if you improve on it - please share it with us!
Regards
Mr.F
查看全文

相关阅读:
jquery将日期转换成指定格式的字符串
 jquery双日历日期选择器bootstrap-daterangepicker日历插件
 JAVA实体类不要使用基本类型，基本类型包含byte、int、short、long、float、double、char、boolean
S04_CH01_搭建工程移植LINUX/测试EMMC/VGA
S03_CH13_ZYNQ A9 TCP UART双核AMP例程
 S03_CH12_基于UDP的QSPI Flash bin文件网络烧写
 S03_CH11_基于TCP的QSPI Flash bin文件网络烧写
 S03_CH10_DMA_4_Video_Stitch视频拼接系统
 S03_CH09_DMA_4_Video_Switch视频切换系统
 S03_CH08_DMA_LWIP以太网传输

原文地址：https://www.cnblogs.com/lexus/p/2480915.html

A Basic Website Crawler, in Python, in 12 Lines of Code. « Null Byte

Step 1 Layout the logic.

Step 2 The Code:

Finishing Statement