Sunday 17 February 2013

A simple Python web-crawler.

The soup is beautiful:
BeautifulSoup is a HTML/XML parser that we use for Python.This little Python library is very useful when we are using Python to parse HTML code of a website.In this project we have used it to build our tiny web-crawler in Python.If you do not have BeautifulSoup I suggest you download it form here(DOWNLOAD).
The serious work:


The entire code is as given below.


#This Program is created by Aritra Sanyal use and reproduction of this code is allowed.But misuse of this code is not supported by the author.#


from BeautifulSoup import *
from urlparse import *
import urllib2
linksarray=[]
page='http://www.mathsisfun.com/' #this value will be use to complete the link.
c=urllib2.urlopen('http://www.mathsisfun.com')
data=c.read()
soup = BeautifulSoup(data)

links=soup.findAll('a')#finds all the links in the page.
for link in links:
    str_links=link.get('href')
    linksarray.append(page+str(str_links))
linkstr=str(linksarray)
file_links=open('links2.html','w')
for linking in range (len(linksarray)):
    hyperlink="<a href="+linksarray[linking]+">"+linksarray[linking]+"</a>"
   
    linkstr=str(hyperlink)
    file_links.write(linkstr)
file_links.close()
for i in range (len(linksarray)):
    try:
        nextdata=urllib2.urlopen(linksarray[i])
        namestr=str(i)
        name=namestr+".html"
        data2=nextdata.read()
        file1=open(name,'w')
        file1.write(data2)
        file1.close()
        print i
    except:
        i=i+1
        print "could not open link:",linksarray[i]
   
What this code does is to get the data(full page HTML) from a given website  and search for all the links in it and saves it in a html file links2.html and then those links get opened and the data on those pages get saved by their index names.