Technical searching in a foreign language: which search engine?
Up To infotechnologies (www.uptoit.org)
Mediterranean Notebook #4
Published in:
Business Information Searcher, 2001;11(1):17-20
Introduction
When searching for scientific and medical information, researchers turn to professional databases rather than the "open" Web. However, the Web is increasingly useful for investigating the commercial, technical or legislative aspects of the scientific and medical fields. In the Web, for example, one can find: medical care guidelines endorsed by a health ministry, reimbursement policies for medical devices, information on companies producing wind-driven energy generators, descriptions of conferences on robotics and automation, and the texts of public protests against radiation from high-frequency antennas. This type of information provides added value to a scientific search using traditional sources and gives perspective, especially when related to a particular European nation.
In order to offer an information research service from Italy, researchers at Up To infotechnologies needed to identify the best search engines for technical searching in the Italian Web. This problem has mostly posed itself in the last few years, as the number of regional (national) search engines has increased and as "global" search engines have created foreign language interfaces (e.g. Excite, Altavista, Lycos). Despite many comparisons of search engines published in industry journals and online (e.g. www.searchenginewatch.com), little has been written about Web searching in languages other than English. Italian search engines were compared to their global counterparts in 1999 study by G. Tonini¹ of ENEA, a member of MIDAS-NET (INFO2000 program). This study concluded that Italian search engines offered little advantage over global ones, in terms of size or quality of their databases. In addition, the study noted that the interfaces of Italian engines were less sophisticated and made complex searching more difficult. Since the Web has grown considerably in 3 years, especially in Italy, it seemed opportune to examine Italian search engines once again, to identify the best tool for technical searching in the Italian Web.
Top
Methods
Search engine performances were compared on 10 different technical queries. The queries (
Table 1) were selected from recent newspapers and represent medical and scientific subjects of current interest. Of the 10 queries, 8 were phrases, two were single words and one involved a Boolean operator. Since the study focused on the search engines' databases and not their interfaces, more complex queries were avoided. Additionally, all but 3 queries consisted of words unique to the Italian language so it was not necessary to specify a search language; in the other 3 cases, results were limited to pages in Italian. The queries were slightly modified in accordance with the search engines' rules, i.e. "apicectomi*" was used for Altavista but "apicectomia OR apicectomie" was written for Google.
Top
Table 1. Technical search queries and their English language equivalents. Quotation marks indicate a phrase search
| Search query | English equivalent |
| Medical set | |
| | "incontinenza fecale" | fecal incontinence |
| | "encefalite spongiforme" | spongiform encephalopathy |
| | apicectomi* (apicectomia OR apicectomie) § | apicoectomy |
| | "cefalea a grappolo" | cluster headache |
| | "cura canalare" | root canal |
| Scientific set | |
| | "energia eolica" § | wind power |
| | "lamiere grecate" | profiled (corrugated) metal sheets |
| | "trasporto intelligente" | intelligent transport systems |
| | domotica § | domotics (robotics for the home) |
| | "elettrosmog AND "impatto ambientale" | "radiation from electromagnetic fields" AND "environmental impact" |
§ In these cases, search results were restricted to the Italian language
Italian search engines were selected empirically. To be tested, an Italian search engine had to provide information explaining how to formulate a search. For example, it was necessary to know if truncation or Boolean operators were permitted, and how to indicate a phrase search. If a search engine repeatedly gave errors on preliminary searches, it was eliminated. For comparison, 3 well-known global search engines were selected: Google, Lycos and Altavista. The advanced search interface was used whenever possible (Lycos was accessed at http://lycospro.lycos.com).
Top
The test was run on two consecutive days in late April 2001, in the late morning and early afternoon CET, using Netscape Navigator 4.7 unless otherwise stated. After entering the query, three observations were made:
- Total number of results indicated by the search engine. In the event that a search engine was ambivalent on the total number of hits (changing from one page of results to another), a best approximation was made from 5 or more results pages.
- Number of "dead links", i.e. those not corresponding to an active page, among the first 20 results examined individually.
- Number of "irrelevant links" in the first set of 20 results. To be relevant, the search terms could be present anywhere on the page or in the meta-tags.
The number of valid results in the first set of 20 was determined as: 20 - (dead + irrelevant). The percent accuracy was calculated as: valid/20 x 100. The total number of valid results was estimated as: percent accuracy x total number of results.
Top
Results
Four Italian search engines were chosen for study:
- Arianna (http://arianna.iol.it) is owned by Italia On Line and uses a proprietary spider
- Excite (http://www.excite.it), the Italian interface of Excite@Home, is 70% owned by Tiscali and claims to index 25,000,000 Italian pages.
- Il Trovatore (http://www.iltrovatore.it) uses proprietary technology.
Two search engines were considered for analysis but then disqualified. SuperEva permits searching in the Italian Web or the entire Web (on Google), but was disqualified for the few directions on how to formulate a search. Katalogo, powered by Inktomi, was disqualified for server errors and a failed preliminary search (i.e. "incontinenza fecale" gave results corresponding to "incontinenza OR fecale"). Ciaoweb had fatal errors on Netscape (it could not return the second page of results), and so was tested with Microsoft's Internet Explorer.
When the 10 queries were run on the 7 search engines, the average number of results returned per query ranged from 18 (SD, 15) for "cura canalare" to 617 (SD, 351) for "energia eolica" (
Fig. 1). The number of results given by different search engines for any one query varied 4- to 10-fold, indicating a wide variability in performance. Lycos ranked first in returning the highest number of results on all searches but one, and Google ranked second most often. Il Trovatore most often ranked last. Ciaoweb and Altavista always gave an identical number of results per query, although the first 20 results listed were not identical for the two search engines. This implies that the two search engines use the same database but have different ranking or listing criteria.
Fig. 1. Number of results per search query, by search engine

When data from the 10 searches were amassed, Lycos had the greatest number of total results (
Fig. 2), followed by Google. Among Italian search engines, Arianna gave the best number of hits (slightly better than Altavista). On these amassed results, search engine accuracy ranged between 82% (SD, 12%) for Altavista and 95% (SD, 8%) for Excite.it, for an average of 12% invalid (dead or irrelevant) results. Interestingly, a slightly higher accuracy was found for Italian search engines whose databases are smaller than those of the global search engines (
Table 2). The mean accuracy of Ciaoweb on the first 20 hits was slightly higher than that of Altavista, probably because of the different order of presentation of the results.
Fig. 2. Total and estimated valid results on a set of 10 technical term searches, by search engine
View Fig. 2 in full size.
Table 2. Percent accuracy of results, by search engine. Values are means (standard deviations) for 10 searches. Accuracy was determined as the percent of valid (neither dead nor irrelevant) results among the first 20 results listed by the search engine.
| Search engine | Accuracy, % |
| Global | | |
| | Altavista | 82 (12) |
| | Google | 86 (12) |
| | Lycos | 88 ( 9) |
| Regional (Italian) | | |
| | Arianna | 85 (10) |
| | Ciaoweb | 93 ( 8) |
| | Excite.it | 95 ( 8) |
| | Il Trovatore | 90 ( 8) |
Top
Discussion
For technical searching in the Italian Web, Lycos - followed by Google - gave the greatest number of total and estimated accurate results. Based on the results of this late-April study, these search engines represent better tools than the strictly Italian engines for performing a thorough search of the Italian Web. Among Italian search engines, Arianna gave the highest number of results and had one of the best accuracies of all search engines tested. Overall, the differences in search engine accuracy were minor in comparison to the variability in number of results. Therefore, the study indicates that approximately 12% of indexed Web pages are transient, temporarily inaccessible or subject to frequent mutations. These errors represent the "background noise" of Web searching.
The present study is in agreement with many of the conclusions made in 1999 by Tonini. He commented that Italian search engines were unsophisticated ("rozzo") compared to their Italian counterparts - unfortunately this situation is still true today. Tonini found an average of 14% invalid results in 1999, similar to the present 12%. Tonini concluded that Italian search engines offer little advantage, while the present study found global search engines decidedly better. Finally, Tonini suggested that his analysis would remain valid only for a few months, due to the rapid evolution of the Web. Since 1999, the search engine rankings have clearly changed, but their transient nature is true even today.
Altavista, the engine that taught us Web searching years ago, gave disappointingly few results. Altavista's accuracy (82%), however, was notably improved from that of a preliminary study (67%) done in January 2001; in contrast there were no changes in the accuracies for Lycos, Google or Arianna. Altavista has recently modified its graphical layout. According to G. Notess², Altavista has made changes to its database and no longer includes RealNames, GoTo or LookSmart. Perhaps Altavista has done some spring cleaning, to yield fewer overall hits within an acceptable accuracy.
That Lycos and Google gave the greatest number of results on the test searches is not surprising. These search engines vie to have the most complete coverage of the Web, and Google claims to contain 1,346,966,000 pages in its database. While the present study indicates that Lycos presently has the best coverage of the Italian Web, a preliminary study in January 2001 placed Google in easy lead. This change reflects the dynamic nature of search engines and of the Web itself - both are subject to continual turnover and growth. Thus, the search engine rankings of the present study can be expected to change monthly if not daily. Today's best search tool may not be so tomorrow, as has been seen with Altavista. The wise searcher will do a preliminary evaluation of different search engines before embarking on any major Web research project, especially since actual poor performances may be due to transient search engine difficulties.
Given the dynamic nature of the Web, estimates of total number of pages indexed may not be relevant to searching. Excite.it claims to index 25,000,000 Italian Web pages. A special search ("l:ita") on Lycos found 10,207,207 pages in Italian (up from 8,260,000 in January 2001). In the present study, Lycos found on average 3-times more results than Excite.it. This conflict may reflect differences in search algorithm, in database quality, or in the estimation of number of results provided by the search engines.
The number of results is not the only important characteristic of a search engine. Equally important is a search interface that permits the construction of a complex query, leading to fewer, more relevant results rather than a slew of less-relevant hits. Although the present study did not compare search interfaces, global search engines provided more information on how to execute a search than did the regional (Italian) search engines. This lack of documentation and the poorly conceived interfaces lead professional researchers to use global search engines even for the non-English Web. This problem is exemplified by a comparison between Altavista and Ciaoweb: although the databases are apparently identical, Ciaoweb suffered from server failures and a less favorable search interface. Interestingly, a recent survey of 850 Web users by Hi-Flier (www.hi-flier.com), a Florence-based firm, revealed that 60% of Italian "navigators" do not know how to effectively use search engines and encounter frequent difficulties in finding the right information. Based on the experience of the present study, these difficulties likely stem from the confusing and inadequately explained search interfaces. The "simple" search interfaces are too simple to provide meaningful results, while the "advanced" interfaces are cumbersome and illogical. Rather than provide a traditional text box in which the researcher can compose a complex query using parentheses and Boolean operators (as Altavista did until recently), search engines are opting for "form searching". Here, the user enters query terms in a complex form, selecting among radial buttons and drop-down boxes of the sort "must contain" or "can contain". Not even the best documentation explains the relationships between one part of the form and another, and only with empirical testing can a searcher be sure. Clearly, greater dialog between searchers and search engine programmers is necessary.
Top
¹ http://www2.enea.bologna.it/MIDAS_NET/searchengines_IT_table.html
² http://www.searchenginewatch.com
Up To info technologies scrl
Via G. Battistella 4/2
31053 Pieve di Soligo (TV) Italy
Tel./fax +39-0438-842337
Contact
info@uptoit.org for further information.
©
2001 Up To info technologies s.c.r.l. All rights reserved.