System Design: BibClassify

1. Introduction

BibClassify is a utility used to extract semantic information from documents and

2. Use cases

Currently, there are 2 different ways to use BibClassify:

  • standalone file mode: running BibClassify from the sources, this allows to extract keywords from documents and output them in text.
  • daemon mode: on an Invenio installation, running BibClassify updates the MARC representation of records by adding keywords

3. Workflow

4. Mock-up screenshots

Several keywords tag cloud representations

Single keywords are represented in grey and composite keywords in blue.

Keywords sorted alphabetically without index


01-alphabetically.png

Keywords sorted by weight (number of occurrences)


02-weight.png

Keywords sorted by number of related documents


03-number_of_records.png

Thoughts and suggestions

Several parameters can lead to different results in term of tag cloud representations:

  • Mix single and composite keywords.
  • Show index (Is it relevant to show the index or is the size of the font sufficient to give an idea of the importance?)
  • Number of keywords to show (on the examples are 20 single keywords and 20 composite keywords)

Then another problem to solve is how to represent keywords coming from different sources (human/machine, different ontologies). Of course a human won't be counting the number of occurrences of a keyword in the document. Several options are possible:

  • Give the 'human' keywords the biggest importance
  • Only show the 'human' keywords when there are some
  • Allow the human to manually specify the weight of the keyword (8 levels are in the example)

At least the BibClassify tag cloud representation could be used to draw tag clouds for sets of documents. As the weight we could consider the number of documents having the keyword or the sum of the occurrences of the keyword in each document.

5. Architecture

6. API

7. BibClassify development ideas

  • Create a end-user view of BibClassify: this task implies to write a new view of the record that shows lists and/or tag clouds of keywords related to the document.

  • Acronym extraction: A prototype for the extraction of acronyms exists. Further develop this prototype and validate the results of the extraction. Imagine and develop a way the latter could be used within Inspire (creation of acronym lists e.g., relations with other modules).

  • Develop a test platform for BibClassify that will compare the results produced by BibClassify and the keywords chosen by humans.

  • Field code: determine field code of an article using the association of keywords with field codes in the taxonomy (need to include new relation in taxonomy)

  • Core papers: determine whether an article is hep core from core tags in taxonomy (to be defined)

For both these suggestions, create the core and field properties. Core could be boolean-like when the field property could contain the field.

<Concept rdf:about="http://cern.ch/thesauri/HEPontology.rdf#Composite.baryonhybrid">
<prefLabel xml:lang="en">baryon: hybrid</prefLabel>
  <compositeOf rdf:resource="http://cern.ch/thesauri/HEPontology.rdf#baryon"/>
  <compositeOf rdf:resource="http://cern.ch/thesauri/HEPontology.rdf#hybrid"/>
<core />                                                                                                 
<field>hp-th</field>
</Concept>

BibClassify could then display some statistics in the text output in different ways such as: "Document has xx core keywords and yy keywords in the zz field."

Let's use our one-letter abbreviations for the field codes. We may assign more than one fc to a concept and the count should be split accordingly. The combination mgt should yield m=1/3, g=1/3, t=1/3 - at least in the beginning. Perhaps it may be necessary to assign percentages to the field codes according to the correlation between keywords and field codes in the database. A keyword appearing more than once in kw combinations should count only once. Probably we will assign fc's only to main keywords.

  • HEP taxonomy: currently the taxonomy registers both properties composite and compositeOf. These 2 relations are each other's inverse and BibClassify uses only the compositeOf property in the taxonomy. Removing the composite property would have the advantage of removing any risk of contradictions, decreases the size of the taxonomy file by more than 20% and the time to process the file by up to 30%.

  • For composite keywords containing a particle name, allow the latter to be inserted into the other single keyword (e.g. "radioactive Z0 decay").

-- BenoitThiell - 15 Jan 2009 -- AnnetteHoltkamp - 17 Jan 2009

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng 01-alphabetically.png r1 manage 52.8 K 2009-01-14 - 09:03 BenoitThiell Keywords without index sorted alphabetically
PNGpng 02-weight.png r1 manage 59.3 K 2009-01-14 - 09:37 BenoitThiell  
PNGpng 03-number_of_records.png r1 manage 72.3 K 2009-01-14 - 09:38 BenoitThiell  
Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2009-02-03 - BenoitThiell
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback