GnuDeveloper.com

web spider, crawler, indexing and search engine Scripts

The heart of web Spider is just 2 process.

1.Parsing HTTP Text (HTML).
2.How you Traverse the Link.

Parsing HTTP Text:

Generally CURL Library is used grab the content of HTML. From The HttpText we have grab the URL as follows
1. Links(like anchor tag)
2. Form action attribute
3.Meta Tag .
Store the Link from database.

Traverse Link:

After we got the URL , we can travel using DFS (debth first search ) or BFS (breadth First search). The Database design is very important , for this king of application PHP should be used as CGI. From the Database Traverse the Link.

Everything is based on the type of Application what want. Can you please tel me what you really need .

For Implementing the effective engine 3 steps are involved.
1.AddURL
2.Bot
3.Search Engine

AddURL:
Create a domain for Search engine & have a link called for AddURL, so that visitor can add domain Name into your site as get Their Email & Site URL.
Ex:
Google : http://www.google.com/addurl.html
Yahoo: http://submit.search.yahoo.com/free/request

Bot:
This has to get the submitted url from database one by one.
Before start traveling these domain first we have to be aware of these 3 things.

1. Robots Exclusion Protocol (Robots.txt)
2. SiteMap
3. Meta Tag
Most Site will have to use either one of these method. Since these will vary for each Site.

Then start indexing from the index page , and get all all Web links.
Link:
1.Anchor tag => href attribute
2.Frame, iframe tag => src attribute
3.Form tag = action attribute
Most search engine just follow anchor tag like yahoo hence initially anchor tag is sufficient.
classify the URL:
Static Content: html,.pdf,.txt

Dynamic Page: .php,.py,.pl,.ry,.asp,.aspx..

Exception: Since now a days, Framework supports Custom Extension like .do ... Hence we have to take care also.

There are library availbale to grab the content of pdf files. but initially parse html & known dynamic page.

Depth :
number of times the link can recursive travel its sub link,it is very important for effective search engine.
Database Design:
The Each and every traveled link should be properly linked into database and also no repetition of url should be allowed which result in infinite looping.

master table:
Domain name, Bot Rules, Max Depth allowed
domain_id
domain_name
domain_bot_rules
domain_max_depth

Link table :
This should contain all the parsed Content link like href attribute along with current depth

link_ID
link_domain_id
link
Depth

Content table:
For each & every travelled content content stored here, in which when visitor searching just we can search this content table.

content_id
content_link_ID
content_Text

Search Engine :
just get the search text from user & search in the content table

Reference Links:
http://sphider.eu
http://kscripts.com

Groups: