System Design: BibClassify
1. Introduction
BibClassify is a utility used to extract semantic information from documents and
2. Use cases
Currently, there are 2 different ways to use
BibClassify:
- standalone file mode: running BibClassify from the sources, this allows to extract keywords from documents and output them in text.
- daemon mode: on an Invenio installation, running BibClassify updates the MARC representation of records by adding keywords
3. Workflow
4. Mock-up screenshots
Several keywords tag cloud representations
Single keywords are represented in grey and composite keywords in blue.
Keywords sorted alphabetically without index
Keywords sorted by weight (number of occurrences)
Keywords sorted by number of related documents
Thoughts and suggestions
Several parameters can lead to different results in term of tag cloud representations:
- Mix single and composite keywords.
- Show index (Is it relevant to show the index or is the size of the font sufficient to give an idea of the importance?)
- Number of keywords to show (on the examples are 20 single keywords and 20 composite keywords)
Then another problem to solve is how to represent keywords coming from different sources (human/machine, different ontologies). Of course a human won't be counting the number of occurrences of a keyword in the document. Several options are possible:
- Give the 'human' keywords the biggest importance
- Only show the 'human' keywords when there are some
- Allow the human to manually specify the weight of the keyword (8 levels are in the example)
At least the
BibClassify tag cloud representation could be used to draw tag clouds for sets of documents. As the weight we could consider the number of documents having the keyword or the sum of the occurrences of the keyword in each document.
5. Architecture
6. API
- Create a end-user view of BibClassify: this task implies to write a new view of the record that shows lists and/or tag clouds of keywords related to the document.
- Acronym extraction: A prototype for the extraction of acronyms exists. Further develop this prototype and validate the results of the extraction. Imagine and develop a way the latter could be used within Inspire (creation of acronym lists e.g., relations with other modules).
- Develop a test platform for BibClassify that will compare the results produced by BibClassify and the keywords chosen by humans.
- Field code: determine field code of an article using the association of keywords with field codes in the taxonomy (need to include new relation in taxonomy)
- Core papers: determine whether an article is hep core from core tags in taxonomy (to be defined)
For both these suggestions, create the
core and
field properties. Core could be boolean-like when the field property could contain the field.
<Concept rdf:about="http://cern.ch/thesauri/HEPontology.rdf#Composite.baryonhybrid">
<prefLabel xml:lang="en">baryon: hybrid</prefLabel>
<compositeOf rdf:resource="http://cern.ch/thesauri/HEPontology.rdf#baryon"/>
<compositeOf rdf:resource="http://cern.ch/thesauri/HEPontology.rdf#hybrid"/>
<core />
<field>hp-th</field>
</Concept>
BibClassify could then display some statistics in the text output in different ways such as: "Document has xx core keywords and yy keywords in the zz field."
Let's use our one-letter abbreviations for the field codes. We may assign more than one fc to a concept and the count should be split accordingly. The combination mgt should yield m=1/3, g=1/3, t=1/3 - at least in the beginning. Perhaps it may be necessary to assign percentages to the field codes according to the correlation between keywords and field codes in the database. A keyword appearing more than once in kw combinations should count only once. Probably we will assign fc's only to main keywords.
- HEP taxonomy: currently the taxonomy registers both properties composite and compositeOf. These 2 relations are each other's inverse and BibClassify uses only the compositeOf property in the taxonomy. Removing the composite property would have the advantage of removing any risk of contradictions, decreases the size of the taxonomy file by more than 20% and the time to process the file by up to 30%.
- For composite keywords containing a particle name, allow the latter to be inserted into the other single keyword (e.g. "radioactive Z0 decay").
--
BenoitThiell - 15 Jan 2009 --
AnnetteHoltkamp - 17 Jan 2009