Current SPIRES and CDS differences in author searching

Currently there is a considerable difference in author searches between SPIRES and CDS as far as precision/recall is concerned. E.g. if you search for "Ellis, J R" in CDS, you will obtain exact results only followed by a box with proposed similar author names:

  See also: similar author names
     1  Ellis, John Rolfe
   746  Ellis, Jonathan Richard
     1  Ellis, Juanita

see Ellis, J R at CDS, so that the end users can see potentially interesting similar (in some cases identical) author names and can check them out further if they wish.

SPIRES directly enriches the user query with similar results right from the start, which is often what users want, but which may also lead to false positives. E.g. compare the CDS and SPIRES results for "Ellis, Jacqueline": Ellis, Jacqueline in CDS, Ellis, Jacqueline in SPIRES.

The Invenio software comes with the CDS behaviour by default, but it can be easily altered to mimic the SPIRES behaviour if we consider this to be a better default. Note that there was already a question in this direction in the recent HEP IS poll.

Search results clustering in Inspire: a proposal

1. The strict-or-loose matching behaviour should be user configurable.

2. As for choosing the default behaviour, we can possibly come up with a compromise between currently-somewhat-strict CDS and currently-somewhat-loose SPIRES. One idea would be to propose built-in clustering by author names for author searches by default, along the lines of what we already do for clustering by collections for any query: higgs boson at CDS.

An ASCII art of how it could look like. For collections:

|   Query: higgs boson               Found 1234 hits in 0.50 seconds.
|   Results split by collections:
|   All results (1234)  Preprints (768) Books (20) Videos (350) [more]
|    1. On the foo and bar...

for authors searches:

|   Query: Ellis J                   Found 1234 hits in 0.50 seconds.
|   Results split by author names:
|   All results (1234) Ellis, John (895) Ellis, Juanita (333) [...]
|    1. On the foo and bar...

with cluster names being of course clickable and expandable etc.

Note that once we shall have processed all the fulltext files with the BibClassify keyword extractor to get the automatic keywords according to HEP RDF taxonomy, we can propose similarly an option to cluster the search results by keywords. Kind of what Vivisimo is doing higgs boson at Vivisimo.

Yet another example of the kind is the clustering of book searches according to Dewey classification system. Some Invenio clients use virtual collections for this: MeIND.

All these can be thought of as being "search results navigators", helping users to navigate inside, or disambiguate, search results. We can improve the current "cluster by collection" functionality of Invenio to fit all these use cases, starting by the author search.

-- TiborSimko - 26 Jun 2007

Note that I think that this in general a wonderful idea, however, most users want something that works mostly right, most of the time, and the SPIRES algorithm should probably be the default default.

This is especially important for searches based on author that result not in sets of papers, but instead in citation statistics. These searches need to present results quickly, not after several clicks. Having a disambiguation option on the page with the default (rather than minimally combined) results would be the ideal way, in my mind. In other words, take the current SPIRES output, and add a disambiguation method on the same page. Users who know what they are doing are not slowed down, but others can refine based on the clustering

-- TravisBrooks - 27 Jun 2007

It is also worth noting the results of the poll.

When you search, which do you prefer? (answered by 74% of respondents)

  • 34% To be suggested alternatives to refine your search
  • 33% To search for a precise term and get precise results only
  • 26% That your search is automatically extended to synonyms and closer terms (little noise but few hits)
  • 4% That your search is automatically extended to a wide range of synonyms and close matches (more hits but more noise)

-- SalvatoreMele - 27 Jun 2007

