web spider, crawler, indexing and search engine Scripts
The heart of web Spider is just 2 process.
1.Parsing HTTP Text (HTML).
2.How you Traverse the Link.
Parsing HTTP Text:
Generally CURL Library is used grab the content of HTML. From The HttpText we have grab the URL as follows
1. Links(like anchor tag)
2. Form action attribute
3.Meta Tag .
Store the Link from database.
After we got the URL , we can travel using DFS (debth first search ) or BFS (breadth First search). The Database design is very important , for this king of application PHP should be used as CGI. From the Database Traverse the Link.
Everything is based on the type of Application what want. Can you please tel me what you really need .
For Implementing the effective engine 3 steps are involved.
Create a domain for Search engine & have a link called for AddURL, so that visitor can add domain Name into your site as get Their Email & Site URL.
Google : http://www.google.com/addurl.html
This has to get the submitted url from database one by one.
Before start traveling these domain first we have to be aware of these 3 things.
1. Robots Exclusion Protocol (Robots.txt)
3. Meta Tag
Most Site will have to use either one of these method. Since these will vary for each Site.
Then start indexing from the index page , and get all all Web links.
1.Anchor tag => href attribute
2.Frame, iframe tag => src attribute
3.Form tag = action attribute
Most search engine just follow anchor tag like yahoo hence initially anchor tag is sufficient.
classify the URL:
Static Content: html,.pdf,.txt
Dynamic Page: .php,.py,.pl,.ry,.asp,.aspx..
Exception: Since now a days, Framework supports Custom Extension like .do ... Hence we have to take care also.
There are library availbale to grab the content of pdf files. but initially parse html & known dynamic page.
number of times the link can recursive travel its sub link,it is very important for effective search engine.
The Each and every traveled link should be properly linked into database and also no repetition of url should be allowed which result in infinite looping.
Domain name, Bot Rules, Max Depth allowed
Link table :
This should contain all the parsed Content link like href attribute along with current depth
For each & every travelled content content stored here, in which when visitor searching just we can search this content table.
Search Engine :
just get the search text from user & search in the content table