Wednesday, September 15, 2010

THE INFORMATION RETRIEVAL PROCESS




The following steps occur before we throw query to search engine


1.To find information on the hundreds of millions of Web pages that exist, a search engine employs special software robots, called crawler. As the name suggest it crawls from the web pages to web pages following each link in the document. And storing each newly found web page in the database.




WEB CRAWLER


2. Text operations transforms original documents and generate logical view of a data i.e. words from the document are extracted such that they represents the document ( logical view of data is created by certain text operation on the document such as removing of stop words , stemming ).
These pages are stored in the database of the search engine.

3. Database Manager using database module builds index of the text from logical view of document.
Database module is a program which creates a index for fast searching based on the logical view of document created in previous step.
Index is a critical data structure it allows fast searching fast searching over large volumes of data.
Different types of indexing mechanism can be used . Inverted file indexing is most popular and used widely.


The Retrieval process


FOLLOWING OPERATIONS OCCUR WHEN QUERY IS THROWN,

1. User first specifies user need which is parsed and then transformed by same text operations applied to the text. Text operations are applied in order to represent query in system understandable format.

2. Query is then processed and documents are searched in the database.Here again comes the concept of the the index. Fast query processing is made possible by the index structure previously build.

3. The documents retrieved are then are ranked according to relevance before they are sent to user.

4. But not all documents are answer set of the user . User examines the set of retrieved and ranked documents for useful information.

User might be interested in the set of documents and initiates user feedback cycle.
User feedback cycle is the process in which documents selected by user are used for query formulation for improvement of the retrieval process.



This is a simple overview of search engine.