Search results clustering
Current SPIRES and CDS differences in author searching
Currently there is a considerable difference in author searches between SPIRES and CDS as far as precision/recall is concerned. E.g. if you search for
"Ellis, J R" in CDS, you will obtain
exact results only followed by a
box with proposed similar author names:
See also: similar author names
1 Ellis, John Rolfe
746 Ellis, Jonathan Richard
1 Ellis, Juanita
see
Ellis, J R at CDS
,
so that the end users can see potentially interesting similar (in some
cases identical) author names and can check them out further if they
wish.
SPIRES directly enriches the user query with
similar results right from the start, which is often what users want,
but which may also lead to false positives. E.g. compare the CDS and
SPIRES results for "Ellis, Jacqueline":
Ellis, Jacqueline in CDS
,
Ellis, Jacqueline in SPIRES
.
The Invenio software comes with the CDS behaviour by default, but it
can be easily altered to mimic the SPIRES behaviour if we consider
this to be a better default. Note that there was already a question
in this direction in the recent
HEP IS poll.
Search results clustering in Inspire: a proposal
1. The strict-or-loose matching behaviour should be user configurable.
2. As for choosing the default behaviour, we can possibly come up with a
compromise between currently-somewhat-strict CDS and
currently-somewhat-loose SPIRES. One idea would be to propose
built-in clustering by author names for author searches by default,
along the lines of what we already do for clustering by collections
for any query:
higgs boson at CDS
.
An ASCII art of how it could look like. For collections:
| Query: higgs boson Found 1234 hits in 0.50 seconds.
|
| Results split by collections:
| All results (1234) Preprints (768) Books (20) Videos (350) [more]
|
| 1. On the foo and bar...
for authors searches:
| Query: Ellis J Found 1234 hits in 0.50 seconds.
|
| Results split by author names:
| All results (1234) Ellis, John (895) Ellis, Juanita (333) [...]
|
| 1. On the foo and bar...
with cluster names being of course clickable and expandable etc.
Note that once we shall have processed all the fulltext files with the
BibClassify keyword extractor to get the automatic keywords according
to HEP RDF taxonomy, we can propose similarly an
option to cluster the search results by keywords. Kind of what
Vivisimo is doing
higgs boson at Vivisimo
.
Yet another example of the kind is the clustering of book searches
according to Dewey classification system. Some Invenio clients use
virtual collections for this:
MeIND
.
All these can be thought of as being "search results navigators",
helping users to navigate inside, or disambiguate, search results.
We can improve the current "cluster by collection" functionality of
Invenio to fit all these use cases, starting by the author search.
--
TiborSimko - 26 Jun 2007
Note that I think that this in general a wonderful idea, however, most users want something that works mostly right, most of the time, and the SPIRES
algorithm should probably be the
default default.
This is especially important for searches based on author that result not in sets of papers, but instead in citation statistics. These searches need to present results quickly, not after several clicks. Having a disambiguation option on the page with the
default (rather than minimally combined) results would be the ideal way, in my mind. In other words, take the current SPIRES output, and add a disambiguation method on the same page. Users who know what they are doing are not slowed down, but others can refine based on the clustering
--
TravisBrooks - 27 Jun 2007
It is also worth noting the results of the poll.
When you search, which do you prefer? (answered by 74% of respondents)
- 34% To be suggested alternatives to refine your search
- 33% To search for a precise term and get precise results only
- 26% That your search is automatically extended to synonyms and closer terms (little noise but few hits)
- 4% That your search is automatically extended to a wide range of synonyms and close matches (more hits but more noise)
--
SalvatoreMele - 27 Jun 2007