India has launched SANDHAN, the Indian language search engine for tourism domain to bridge the gap in information needs of people not conversant with English.
Unveiling the search engines for Bengali, Hindi, Marathi, Tamil and Telugu, J. Satyanarayana, the IT Secretary, said six years of research has resulted in this milestone but this is only the beginning. He said for making the search engines successful, it is equally important to develop and promote content in Indian languages. He said the real success would be when even village level e-services would be available in local languages.
SANDHAN has been developed by 120 researchers of 12 institutions over a period of 6 years led by Dr. Pushpak Bhattacharya under the supervision of TDIL DeitY. The project aims at satisfying the user information need through text documents present in the web, said a statement.
This search engine captures the information in the form of a query in one of the 5 Indian ‘query’ languages— Bengali, Hindi, Marathi, Tamil and Telugu. The query is processed to retrieve a set of relevant documents of the same language from crawled data in tourism domain from the World Wide Web (WWW). These retrieved documents are presented to the user in the form of an ordered list based on the relevance of the document.
Apart from the tourism, sectors such as business and academia would also benefit from SANDHAN and it can also be deployed as part of e-governance and e-learning, it said.
Here are some key features of SANDHAN:
• System is developed to satisfy the user information need in tourism domain.
• User has the facility to submit a query either with the help of in-script keyboard or phonetic keyboard. In case of in-script keyboard, user can type using the keyboard or onscreen keyboard can be used to submit a query to the system.
• It has the capability to process the query based on its language and retrieves results ONLY from that language.
• Snippets generated for each of the retrieved document helps the user to understand the context of query terms in that document.
• Summary is generated for each retrieved document. This feature helps the user to get an idea about the overall content of the document without opening the same.
• An additional URL based semantic search facility is provided for Tamil language.
• A set of ten results are displayed at a time to the user to increase the readability.
• Many of the Indian language web pages are in custom fonts that make the system difficult for retrieving documents. SANDHAAN uses a font transcoder that converts the custom fonts into Unicode fonts for processing.
SANDHAN is a project of a consortium of academic and research institutions, and industry partners. The institutes involved are IIT Bombay (Consortium leader), CDAC Noida (Co-Consortium leader), IIT Kharagpur, Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar, Anna University-Centre for Electronics, Anna University-Knowledge-based Computing Centre, CDAC Pune, Gauhati University, Indian Institute of Information Technology Bhubaneswar, International Institute of Information Technology Hyderabad, ISI Kolkata and Jadhavpur University. It is conceptualized, evolved and funded as a National-level project in the emerging area of Information Retrieval and Access in Indian Languages by Dept of Electronics & Information Technology (DeitY).
SANDHAN project has been put together by Technology Development for Indian Languages (TDIL), a flagship programme of DeitY involved in research, development, standardization and proliferation of Language Technology in India in 22 Constitutionally-recognized Indian languages. TDIL Programme is also associated with international standardization bodies like the Unicode Consortium , W3C , IETF and ELRA.
The link for SANDHAN is www.tdil-dc.in/sandhan