f you're creating a search engine you'll need a way to collect documents. In this excerpt from Tony Segaran's Programming Collective Intelligence the author shows you how to set up a simple web crawler using existing tools.
I'll assume for now that you don't have a big collection of HTML documents sitting on your hard drive waiting to be indexed, so I'll show you how to build a simple crawler. It will be seeded with a small set of pages to index and will then follow any links on that page to find other pages, whose links it will also follow. This process is called crawling or spidering.
To do this, your code will have to download the pages, pass them to the indexer (which you'll build in the next section), and then parse the pages to find all the links to the pages that have to be crawled next. Fortunately, there are a couple of libraries that can help with this process.
For the examples in this chapter, I have set up a copy of several thousand files from Wikipedia, which will remain static at http://kiwitobes.com/wiki.
You're free to run the crawler on any set of pages you like, but you can use this site if you want to compare your results to those in this chapter.
Using urllib2
urllib2 is a library bundled with Python that makes it easy to download pages—all you have to do is supply the URL. You'll use it in this section to download the pages that will be indexed. To see it in action, start up your Python interpreter and try this:
>>import urllib2
>> c=urllib2.urlopen('http://kiwitobes.com/wiki/Programming_language.html')
>> contents=c.read()
>>print contents[0:50]
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Trans'
All you have to do to store a page's HTML code into a string is create a connection and read its contents.
Crawler Code
The crawler will use the Beautiful Soup API, an excellent library that builds a structured representation of web pages. It is very tolerant of web pages with broken HTML, which is useful when constructing a crawler because you never know what pages you might come across.
Using urllib2 and Beautiful Soup you can build a crawler that will take a list of URLs to index and crawl their links to find other pages to index. First, add these import
statements to the top of searchengine.py
:
import urllib2
fromBeautifulSoupimport*
from urlparse import urljoin
# Create a list of words to ignore
ignorewords=set(['the','of','to','and','a','in','is','it'])
Now you can fill in the code for the crawler function. It won't actually save anything it crawls yet, but it will print the URLs as it goes so you can see that it's working. You need to put this at the end of the file (so it's part of the crawler
class):
def crawl(self,pages,depth=2):
for i in range(depth):
newpages=set()
for page in pages:
try:
c=urllib2.urlopen(page)
except:
print"Could not open %s"% page
continue
soup=BeautifulSoup(c.read())
self.addtoindex(page,soup)
links=soup('a')
for link in links:
if('href'in dict(link.attrs)):
url=urljoin(page,link['href'])
if url.find("'")!=-1:continue
url=url.split('#')[0]# remove location portion
if url[0:4]=='http'andnotself.isindexed(url):
newpages.add(url)
linkText=self.gettextonly(link)
self.addlinkref(page,url,linkText)
self.dbcommit()
pages=newpages
This function loops through the list of pages, calling addtoindex
on each one (right now this does nothing except print the URL, but you'll fill it in the next section). It then uses Beautiful Soup to get all the links on that page and adds their URLs to a set called newpages
. At the end of the loop, newpages
becomes pages
, and the process repeats.
This function can be defined recursively so that each link calls the function again, but doing a breadth-first search allows for easier modification of the code later, either to keep crawling continuously or to save a list of unindexed pages for later crawling. It also avoids the risk of overflowing the stack.
You can test this function in the Python interpreter (there's no need to let it finish, so press Ctrl-C when you get bored):
>>import searchengine
>> pagelist=['http://kiwitobes.com/wiki/Perl.html']
>> crawler=searchengine.crawler('')
>> crawler.crawl(pagelist)
Indexinghttp://kiwitobes.com/wiki/Perl.html
Couldnot open http://kiwitobes.com...ramming%29.html
Indexinghttp://kiwitobes.com...ry_Project.html
Indexinghttp://kiwitobes.com...face.html
You may notice that some pages are repeated. There is a placeholder in the code for another function, isindexed
, which will determine if a page has been indexed recently before adding it to newpages
. This will let you run this function on any list of URLs at any time without worrying about doing unnecessary work.