The soup is beautiful:
BeautifulSoup is a HTML/XML parser that we use for Python.This little Python library is very useful when we are using Python to parse HTML code of a website.In this project we have used it to build our tiny web-crawler in Python.If you do not have BeautifulSoup I suggest you download it form here(DOWNLOAD).
The serious work:
The entire code is as given below.
#This Program is created by Aritra Sanyal use and reproduction of this code is allowed.But misuse of this code is not supported by the author.#
from BeautifulSoup import *
from urlparse import *
import urllib2
linksarray=[]
page='http://www.mathsisfun.com/' #this value will be use to complete the link.
c=urllib2.urlopen('http://www.mathsisfun.com')
data=c.read()
soup = BeautifulSoup(data)
links=soup.findAll('a')#finds all the links in the page.
for link in links:
str_links=link.get('href')
linksarray.append(page+str(str_links))
linkstr=str(linksarray)
file_links=open('links2.html','w')
for linking in range (len(linksarray)):
hyperlink="<a href="+linksarray[linking]+">"+linksarray[linking]+"</a>"
linkstr=str(hyperlink)
file_links.write(linkstr)
file_links.close()
for i in range (len(linksarray)):
try:
nextdata=urllib2.urlopen(linksarray[i])
namestr=str(i)
name=namestr+".html"
data2=nextdata.read()
file1=open(name,'w')
file1.write(data2)
file1.close()
print i
except:
i=i+1
print "could not open link:",linksarray[i]
What this code does is to get the data(full page HTML) from a given website and search for all the links in it and saves it in a html file links2.html and then those links get opened and the data on those pages get saved by their index names.
BeautifulSoup is a HTML/XML parser that we use for Python.This little Python library is very useful when we are using Python to parse HTML code of a website.In this project we have used it to build our tiny web-crawler in Python.If you do not have BeautifulSoup I suggest you download it form here(DOWNLOAD).
The serious work:
The entire code is as given below.
#This Program is created by Aritra Sanyal use and reproduction of this code is allowed.But misuse of this code is not supported by the author.#
from BeautifulSoup import *
from urlparse import *
import urllib2
linksarray=[]
page='http://www.mathsisfun.com/' #this value will be use to complete the link.
c=urllib2.urlopen('http://www.mathsisfun.com')
data=c.read()
soup = BeautifulSoup(data)
links=soup.findAll('a')#finds all the links in the page.
for link in links:
str_links=link.get('href')
linksarray.append(page+str(str_links))
linkstr=str(linksarray)
file_links=open('links2.html','w')
for linking in range (len(linksarray)):
hyperlink="<a href="+linksarray[linking]+">"+linksarray[linking]+"</a>"
linkstr=str(hyperlink)
file_links.write(linkstr)
file_links.close()
for i in range (len(linksarray)):
try:
nextdata=urllib2.urlopen(linksarray[i])
namestr=str(i)
name=namestr+".html"
data2=nextdata.read()
file1=open(name,'w')
file1.write(data2)
file1.close()
print i
except:
i=i+1
print "could not open link:",linksarray[i]
What this code does is to get the data(full page HTML) from a given website and search for all the links in it and saves it in a html file links2.html and then those links get opened and the data on those pages get saved by their index names.