System Design: BibKnowledge
1. Introduction
BibKnowledge module provides Web tools for cataloguers to manage various knowledge bases, authority files, and taxonomies or thesauri. BibKnowledge contains information for standardisation and record quality checking. Typical examples: (1) field author institute is often written as "Odd University Strange Research Lab" though it is officially (canonically) known as "StrangeLab of the Odd University". (2) If field "author email" contains "@strange.odd.edu" the author institute must be "StrangeLab of the Odd University". (3) Taxonomy knowledge bases contain information about broader term - narrower term pairs.
Currently some KBs live in the database (e.g. for output formatting) and some live as text files (e.g. for record checking). The new design standardizes this so that the knowledge bases are maintained in the database and exported in text files (so that standalone record checking programs can use them).
Note that in SPIRES way of doing things, many KBs live in separate SPIRES databases, some of which have not yet been considered for import to INSPIRE (conf, inst, experiments, abbrev.). We need to understand how to usefully import this data, but it might end up as collections in invenio, like Inst. The BibKnowledge module will use these as dynamic knowledge bases (see below).
The BibKnowledge module has a notion of KB dependencies, e.g. editing value for some field in one KB should check other KBs and propagate this edit where needed, warn about conflicts, etc.
The BibKnowledge module will provide nice APIs and export interfaces for other Invenio modules to access knowledge bases (e.g. BibFormat, WebSubmit, BibCheck, BibEdit, etc).
Types of knowledge bases
- "map_from" "map_to": this is the typical case, the knowledge base is essentially a list of left side - right side pairs, like Genf -> Geneva or "Odd University Strange Research Lab" -> "StrangeLab of the Odd University". The abbreviation for this type is kbr (for reference).
- "authority only": this kind of knowledge base only lists the canonical values. Example: "Geneva", "StrangeLab of the Odd University". It is a special case of "map_from" "map_to", where left side and right side are identical. The abbreviation for this type is kba (for authority).
- dynamic: these knowledge bases are "authority only" knowledge bases that are built dynamically using a search expression. Example: if the author institute is stored in field 100__u, a dynamic knowledge base that uses this field, returns all the values of 100__u. The abbreviation for this type is kbd (for dynamic).
- taxonomy (or ontology): an RDF file can be uploaded into invenio and used as a knowledge base.
2. Use cases
Use case 1: Making a "map_from" "map_to" knowledge base. A cataloguer wants to modify the knowledge base of known experiments. The web interface permits this and asks if the changes to the entry are to be propagated into other knowledge bases. It would detect any circular reference problems leading to conflicts.
Use case 2: Uploading a taxonomy. The HEP keyword taxonomy managed by DESY is currently being edited via text files and converted into RDF via a script. The resulting RDF file can be uploaded to BibKnowledge.
Use case 3: Using knowledge bases with search. There is a knowledge base managing abbreviations, e.g. 'LHC' for 'Large Hadron Collider'. A user searches for 'LHC'. Search engine asks BibKnowledge about KB for recommended lookup terms, and receives 'Large Hadron Collider' collider back, and either enriches the query, or proposes this to the user in the "did you mean?" way. This is effectively a left side -> right side lookup in a "map_from" "map_to" knowledge base.
Use case 4: Working in conjunction with inputting interface. An inputter is inputting the affiliations on a paper. She types "Menlo Park" and several institutions from that town ("SLAC", "KIPAC Menlo Park", "SLAC,SSRL"are proposed as correct affiliations. However the paper says "T. Brooks (SLAC/SPIRES, Menlo Park, CA)" This appears to be a new affiliation, so she enters "SLAC/SPIRES" and a dialog box pops up saying "this is a new affil., should we add it to INST DB?" She says yes, and a new record is created in inst., flagged for further checking by the inst maintainer later on. The inst maintainer looks at the new inst, and realizes that it is the same as "SLAC" and removes the inst from the inst file as well as changing the affiliation on the paper (alternatively, she could have decided it was new, and added address, email, url information to the inst record, and confirmed the entry).
3. Workflow
All the knowledge base manipulation operations are done through a web interface of CDS Invenio. Tools that interface knowledge bases (BibEdit web interface, enrichment scripts) should access the knowledge bases through the API shown below.
3.1 Editing
"map_from" "map_to"
The user is authenticated and authorised to use the BibKnowledge editor.
The cataloguer selects a knowledge base or creates a new one. When creating a knowledge base, the cataloguer should use a descriptive name, since the knowledge bases are exported with that name (see below).
The cataloguer enters new "map_from" "map_to" pairs or changes existing ones. The editor assists in following ways:
- for any "map_from" phrase, the interface shows phrases close to it (TBD).
- the interface can show if the "map_from" phrase appears in other knowledge bases.
- the interface can show the matching "map_to"'s that are already there in the database.
Dynamic knowledge bases
The cataloguer can define a search expression that generates a list of canonical values (an "authority only" knowledge base) for future use.
Taxonomy
Uploading an RDF file.
3.2 Exporting
Each knowledge base can be dynamically exported by an URL. See "Getting knowledge base items by the Web Interface" below.
3.3 Using the exported files
The exported "map_from" "map_to" and "authority only" knowledge bases are uses by bibcheck Clisp programs.
4. Mock-up screenshots
Adding a new knowledge base: |
|
Adding data in a knowledge base: |
|
Checking: |
5. Architecture
For historical reasons, the data is stored in the following database tables: fmtKNOWLEDGEBASES, fmtKNOWLEDGEBASEMAPPINGS.
Functions for accessing the data are listed below.
6. API
API for utilising knowledge bases
from invenio.bibknowledge import *
add_kb(kb_name="Untitled", kb_type=None) Add a new knowledge base. kb_type is None for mapFrom-mapTo, 'd' for dynamic, 't' for taxonomy
update_kb_attributes(kb_name, new_name, new_description) Change the name or description of a knowledge base.
delete_kb(kb_name)
add_kb_mapping(kb_name, key, value) Add a new mapFrom (key)- mapTo (value) rule
remove_kb_mapping(kb_name, key) Remove a rule
kb_exists(kb_name) Check if the kb exists
get_kbs_info(kbtype="", kbsname="") Returns matching all kbs as list of dictionaries {id, name, description, kbtype}
get_kba_values(kb_name, searchname="", searchtype="s") Get values of a authority type knowledge base (kba). searchtype can be 's' substring or 'e' exact.
Example: get_kba_values("journal names", "American") returns a list that includes all journals with American. ["American Physics Journal",..]
get_kbr_keys(kb_name, searchkey="", searchvalue="", searchtype='s') Get the mapFrom sides of a kbr knowledge base. Limit search by searchkey (mapFrom) or searchvalue (mapTo). searchtype as above.
get_kbr_values(kb_name, searchkey="", searchvalue="", searchtype='s') Get the mapTo sides of a kbr knowledge base. Parameters as above.
get_kbr_items(kb_name, searchkey="", searchvalue="", searchtype='s') As above but returns a list of dictionaries containing key=>x value=>y. Example get_kbr_items("journal names", searchkey="Am") returns [{'key'=>'American Ph. J.', 'value' => 'American Physics Journal"}.. ]
get_kbd_values(kbname, searchwith="") Get values from a dynamic knowledge base. If searchwith is given, limit the search to only those matching it.
get_kb_item(path_to_kbfile, path_to_xlst,searchwith="") returns an array of items from a taxonomy file, based on XSLT. If searchwith is defined only items matching it are returned.
Getting knowledge base items by the Web Interface
"kb_export" enables one to export the whole KB or just some values of it. It is called as follows:
http://yourhost/kb/export?kbname=XXXX
or
http://yourhost/kb/export?kbname=XXXX&term=YYYY
Knowledge Base Enumeration
CDS
The current kb's used at CERN are in
http://doc.cern.ch/uploader/KB/PREPRINTS
INSPIRE
For INSPIRE we should start with the minimum set of KBs necc. to reproduce existing SPIRES functionality as we begin to need more we will incorporate CDS kbs and/or new ones.
- Institutions: A dynamic KB defined against a 980a = INSTITUTIONS
- Conferences: Ditto...980a = CONFERENCES (I'm working on this import, but it should be straightforward)
- Title Abbrevs: A static kb I will generate from our titles db (note this is not context specific, and in phase III we can work to improve this)
- Journal Titles: A static kb I will generate from out coden db
- There will be a second kb here that defines which journals are peer reviewed
- There are also some assoc. kbs here that work with citation extraction
- Experiments: Not sure about this one...could create a collection or a flat file, the database is rarely used live and not well maintained...
- There will be a second kb here that defines which EXPS match which Collaborations (possibly...this is not in SPIRES now...can wait)
- Author Names: Not yet...
- desy keyword collection: RDF
- possibly a list of allowed collections as a static kb
- FC abbreviations (B for Accelerators, ...) - Kirsten
- DESY has a list for report number syntax depending on institute (possibly useful) - Kirsten
There may be others, but none that leap out at me. If we imagine that at the start it looks like the above, we won't be far wrong.
Notes
Oct 23 2008: New features implemented. Screenshots below.
The knowledge base management view in Admin Area > BibFormat Admin > Manage Knowledge Bases. Please notice searching by a keyword and the "Add/Configure" buttons at the bottom.
Searched for Geneva and selected this knowledge base. Only entries with Geneva are shown. Please notice the export links at the bottom of the page.
Configuring a dynamic knowledge base. This knowledge base will contain the values of field 100__u matching
Paris
Exporting a dynamic knowledge base by clicking the export link. The export results are shown in the text window.
Editing a taxonomy knowledge base. The export link (that simply downloads the existing RDF file) is shown on the right.