CernBibCheck

1. Introduction

This page is dedicated to technical description of BibCheck as it is currently used at CERN. It does not discuss checking/enrichment procedures that CERN uses this program for.

2. Usage

The BibCheck tool is used mostly for batch-oriented record checking and enrichment purposes.

It is usually used in this way:

   $ bibcheck exa.cfg exa.dat

Here:

  • exa.cfg is the configuration file telling which checks are to be performed on the input data file. (See an example below.)

  • exa.dat is currently ALEPH sequential format, which is almost exactly equal to the Text MARC format of Invenio (see an example).

The program does not require any interaction with humans during its execution. It outputs three files:

  • exa.dat_append is a Text MARC file ready to be uploaded in the append mode (record enrichment).

  • exa.dat_correct is a Text MARC file ready to be uploaded in the correct mode (record correction).

  • exa.dat_warnings is a text file containing warnings that the BibCheck could not resolve by itself and that need human attention. (Example: the program detected a subfield that is not permitted.)

All the necessary human interaction is therefore performed at the end of the program execution by perusing the warnings file.

3. Implementation

The current implementation is done in Common Lisp and is about 1500 LOC. (One advantage of Common Lisp is that the code is machine-compiled, not byte-compiled like Python, so it can run much faster in Common Lisp, which is interesting for potentially complex operations such as double record checking.)

The code is not committed to Invenio CVS yet because it lacks MARCXML parser. (This would be an easy intro task for new programmers in the project to do, but was never a high enough priority, since CERN has been using ALEPH.)

The configuration language is Lisp S expressions. It has grown in features with time which is why some parts may look ugly (e.g. conditions expressed via lists). It could be enhanced with a nicer syntax to express conditions to check several mutually-dependent fields.

4. Musings on BibCheck-in-Lisp and its role in Inspire

An important difference with usual human-driven enrichment tools is that the program assumes no human interaction. While it would be possible to develop this into the program, it does not seem advantageous, as we are going for a full-scale enrichment application written in Python.

One obvious option we have is therefore to forget about the existing Lisp code, take some good ideas from it, and reimplement all its functionalities in Python anew, with the aim of using the forthcoming Pythonic toolkit we shall design. This would give us a brand new BibCheck-in-Python module that could be used both in a batch-only mode but in a human-driven mode too.

Alternatively, we could also decide to keep the BibCheck-in-Lisp as such, improve its language a little as suggested above, and let it serve as a back-end checker for the forthcoming Web UI front-end enrichment application written in Python. The Pythonic application would interact with the user in its way and would call the Lisp backend code when necessary. It would be sufficient to define some nice APIs between the frontend and backend programs that are language-independent in order to achieve this.

The options are to be studied further.

Note from -- TravisBrooks - 18 Oct 2007

The last option is, I think the best. This really reproduces much of the functionality of the tools we have. About half of existing tools are simply dumb checkers, like BibCheck, that create lists of records the need some sort of work. To successfully port these tools the only things needed are: BibCheck, a nice frontend for perusing the warnings, and a nice record editor and batch change interface. See ComparisonSlacFermilabDesyCernEnrichmentScripts for more detail about what I would desire here...

5. Available checks (example configuration)

;;; -*- Mode: LISP; Syntax: COMMON-LISP; -*-
;;;
;;; exa.cfg 
;;;
;;;    This is a configuration file example for nchkall, the new
;;;    database consistency checking program.  The program will
;;;    perform checks on the ALEPH sequential files according to
;;;    checking rules specified in this file.
;;;
;;; Example of usage:
;;;
;;;    $ nchkall exa.cfg exa.dat
;;;
;;; General structure of how to specify checking action:
;;;
;;;    ;; this is a comment for check-action-1
;;;    (check-action-1
;;;      (field-1 param-1-1 param-1-2 ...)
;;;      (field-2 param-2-1 param-2-2 ...)
;;;       ...
;;;      (field-i param-i-1 param-i-2 ...))
;;;
;;; For detailed help on each action, see below.

;;; check-identical-records (f1-tag f1-sf-code f2-tag f2-sf-code ...)
;;
;;       1) Check each record in the file with respect to field
;;       instances f1-tag/f1-sf-code, f2-tag/f2-sf-code, etc.  Emit
;;       a warning if more records have the same values.  Useful to
;;       detect potential doubly inputted records.
 
(check-identical-records
 ("100" $$a "245" $$a "260" $$c))

;;; check-field-instance f-tag n1 n2 :if-not-exists str1
;;                                   :must-not-exist-when-exists str2
;;                                   :mandatory-subfields str3
;;                                   :optional-subfields str4
;;                                   :exclusive-mandatory-subfields str5
;;                                   :exclusive-optional-subfields str6 
;;                                   :inclusive-mandatory-subfields str7 
;;
;;       1) Check number of field instances f-tag.
;;          Must be between n1 and n2 times (included), except for the cases 3, 4 below.
;;          Otherwise an error is reported.
;;       2) If the field does not exist, and n1>0 and :if-not-exists str is present,
;;          then introduce str value, except for the cases 3, 4 below.
;;       3) If :must-not-exist-when-exists str exists, and f-tag exists, then check for
;;          the existence of field str.  It must not exist.
;;          Otherwise an error is reported.
;;       4) If :may-not-exist-when-exists str is present, and the other field str exists,
;;          then the current field f-tag may not exist (or should be between n1 and n2).
;;          No error is reported if it does not exist.
;;       5) Check for mandatory subfields in the field instance.
;;          All subfields passed in :mandatory-subfields must be present.
;;          Otherwise an error is reported.
;;       6) Check for exclusive mandatory subfields in the field instance.
;;          Exactly one of the subfields passed in :exclusive-mandatory-subfields
;;          string must be present in the record, not the others.
;;          Otherwise an error is reported.
;;       7) Check for exclusive optional subfields in the field instance.
;;          At most one of the subfields passed in :exclusive-optional-subfields
;;          string must be present in the record, not more.
;;          Otherwise an error is reported.
;;       8) Check for inclusive mandatory subfields in the field instance.
;;          At least one of the subfields passed in :inclusive-mandatory-subfields
;;          string must be present in the record.
;;          Otherwise an error is reported.
;;       9) Check for all present subfields in the field instance.
;;          Must be among the :mandatory-subfields, :optional-subfields,
;;          :exclusive-mandatory-subfields, :exclusive-optional-subfields,
;;          or :inclusive-mandatory-subfields strings.
;;          Otherwise an error is reported.
         
(check-field-instance
 ("037" 0 1 :mandatory-subfields "a")
 ("041" 1 10 :mandatory-subfields "a")   
 ("088" 0 10 :mandatory-subfields "a")
 ("100" 1 1 :must-not-exist-when-exists "110" :mandatory-subfields "a" :optional-subfields "u")
 ("110" 1 1 :must-not-exist-when-exists "100")
 ("111" 0 1 :optional-subfields "ancdf9g")
 ("245" 0 1 :mandatory-subfields "a" :optional-subfields "b")
 ("246" 0 10 :optional-subfields "bgi")
 ("250" 0 1 :optional-subfields "a")
 ("260" 1 1 :mandatory-subfields "abc" :if-not-exists "$$asine loco$$bsine nomine") 
 ("300" 1 1 :mandatory-subfields "a" :if-not-exists "$$amult. p")
 ("490" 0 10 :optional-subfields "av")
 ("502" 0 10 :mandatory-subfields "a")
 ("520" 0 100 :mandatory-subfields "a") 
 ("541" 0 100 :inclusive-mandatory-subfields "af") 
 ("65017" 0 10 :if-not-exists "XX" :mandatory-subfields "a2")
 ("6531" 0 100 :mandatory-subfields "a9")
 ("690C" 0 100 :mandatory-subfields "a")
 ("693" 0 10 :mandatory-subfields "ae")
 ("694" 0 100 :mandatory-subfields "a")
 ("695" 0 10 :mandatory-subfields "a9")
 ("700" 0 9999 :mandatory-subfields "a" :optional-subfields "eu")
 ("710" 0 1 :mandatory-subfields "a" :exclusive-mandatory-subfields "59" :exclusive-optional-subfields "xy")
 ("720" 1 1 :mandatory-subfields "a" :may-not-exist-when-exists "710")
 ("8560" 0 1 :mandatory-subfields "f")
 ("8564" 0 10 :mandatory-subfields "u" :optional-subfields "y")
 ("909C1" 0 10 :mandatory-subfields "a")
 ("909C4" 0 10 :mandatory-subfields "p" :optional-subfields "vycuna")
 ("909CA" 0 10 :mandatory-subfields "u")
 ("909CS" 1 1 :mandatory-subfields "sw" :optional-subfields "ay")
 ("909CY" 1 1 :mandatory-subfields "a")
 ("BAS" 1 10 :mandatory-subfields "a")
 ("LKR" 0 10 :mandatory-subfields "bn" :optional-subfields "k"))

;;; check-field-subfield-content f-tag sf-code n1 n2 :content predicate
;;                                                   :end-with-digits num
;;                                                   :remove-before-testing str
;;                                                   :fill-leading-zeros predicate
;;                                                   :if-not-exists str
;;
;;    1) Check length of f-tag sf-code content in the given record.
;;       If exists, then must be between n1 and n2 times (included), otherwise an error is reported.
;;       Except when :fill-leading-zeros is true, in which case we try to
;;       fill the existing content from the left by zeros so that at least n1 width is reached.
;;    2) Check type of f-tag sf-code content, if :content predicate is present.
;;       Predicate may be: standard, graphical, alphabetic, uppercase, lowercase, alphanumeric, numeric, latin1.
;;       Otherwise an error is reported.
;;    3) If :end-with-digits is present, test whether the content ends with that number of digits.
;;       Otherwise an error is reported.
;;    4) If the subfield does not exist, and :IF-NOT-EXISTS is nonzero string, then create the
;;       subfield with that content. 
;;   ad 2-3) Note that if remove-before-testing characters are present, then try to remove
;;      those characters first, for example if the year field may contain ``1992-'', so the
;;      dash is a good candidate for putting into remove-before-testing.
 
(check-field-subfield-content
 ("BAS" $$a 2 2 :content "numeric")
 ("LKR" $$b 9 9 :content "numeric" :fill-leading-zeros t)
 ("909CT" $$a 4 4 :content "numeric")
 ("260" $$c 4 20 :end-with-digits 4 :remove-before-testing "?-")
 ("041" $$a 3 3 :content "lowercase" :if-not-exists "eng")
 ("909C4" $$y 4 4 :content "numeric")
 ("909CS" $$a 0 2)
 ("909CS" $$s 1 1)
 ("909CS" $$w 6 6 :content "numeric")
 ("909CS" $$y 4 5 :end-with-digits 4)
 ("909CY" $$a 4 4 :content "numeric"))

;;; check-field-subfield-order r f-tag sf-order
;;
;;   Check whether subfields of fields F-TAG appear in the order
;;   specified by SF-ORDER.  If not, then reorder them.  Subfields
;;   present in field F-TAG and not present in SF-ORDER are appended
;;   at the end.  Example of SF-ORDER for the publication information
;;   tag 773 is `pvycn'."

(check-field-subfield-order
 ("260" "bac")
 ("269" "bac")
 ("773" "pvycn")
 ("LKR" "bnk"))

;;; check-field-subfield-content-for-tex-problems f-tag sf-code
;;
;;   Check some TeX-related errors in field f-tag sf-code.  At the
;;   moment it checks for impaired dollars and curly braces and
;;   reports them.

(check-field-subfield-content-for-tex-problems
 ("245" $$a)
 ("520" $$a))

;;; check-field-subfield-content-via-kba f-tag sf-code kba-filename
;;
;;   Check the content of field tag f-tag and subfield-code sf-code by
;;   comparing the content to the lines of the authority knowledge base
;;   file kba-file. (i.e. of the form ``good'')
;;   1) When no f-tag field is present in the record, nothing happens.
;;   2) If the field is present and matches some KB value, then nothing happens.
;;   3) If no match in KB was found, then the incident will be reported.

(check-field-subfield-content-via-kba
 ("041" $$a "~/private/work/nchkall/kbs/SISC-lang.kba")
 ("65017" $$a "~/private/work/nchkall/kbs/SISC-su.kba")
 ("693" $$a "~/private/work/nchkall/kbs/SISC-ac.kba")
 ("693" $$e "~/private/work/nchkall/kbs/SISC-ex.kba"))

;;; check-field-replace-subfield-content-via-regexp f-tag sf-code regexp replacement
;;
;;   Check the content of field tag f-tag and subfield-code sf-code
;;    and try to match the regular expression regexp.
;;    1) If a match is found, replace the value by replacement and
;;       correct the fields.
;;    2) If no match is found, nothing happens."

(check-field-replace-subfield-content-via-regexp
 ("773" $$p "\\s(and|for|in|of)\\s" " ")
 ("300" $$a "([^\\s])p$" "\\1 p") ; for Cath
 ("260" $$c "\\.+$" "") ; for Sandrine
)

;;; check-field-replace-field-content-when-other-fields-exist f-tag f-alephseq other-fields-presence-mode &rest other-field-cases
;;
;;   Check whether fields and subfields specified in OTHER-FIELD-CASES
;;   list are present as OTHER-FIELD-PRESENCE-MODE says.  If yes, then
;;   replace each field F-TAG with value parsed from F-ALEPHSEQ.  The
;;   OTHER-FIELD-PRESENCE-MODE may be 'none', 'all', or 'any'.  This
;;   rule is useful for example to ensure that BASE number is equal to
;;   13 when PR $$c exists.

(check-field-replace-field-content-when-other-fields-exist
 ("BAS" "$$a13" "any" '("773" $$c) '("LKR" $$k))
 ("BAS" "$$a11" "none" '("773" $$c) '("LKR" $$k)))

;;; check-field-append-new-field-instance-when-other-fields-have-certain-values f-tag f-alephseq other-field-value-cases-boolean-operator &rest other-field-value-cases
;;
;;   Check whether fields and subfields have values as specified in
;;   OTHER-FIELD-VALUE-CASES list, joined by
;;   OTHER-FIELD-VALUE-CASES-BOOLEAN-OPERATOR.  If yes, then create
;;   and append new instance of F-TAG with a value parsed from
;;   F-ALEPHSEQ.  Note that a special value of "*" in
;;   OTHER-FIELD-VALUE-CASES matches any value.  The
;;   OTHER-FIELD-VALUE-CASES-BOOLEAN-OPERATOR may be 'none', 'all', or
;;   'any'.  This rule is useful for example to ensure that 690C
;;   indicator THESIS gets created and appended when 502 exists or a
;;   TALK indicator gets created and appended when LKR contains the
;;   word presentation or meeting.

(check-field-append-new-field-instance-when-other-fields-have-certain-values
 ("690C" "$$aTHESIS" "and" '("502" $$a "*"))
 ("690C" "$$aTALK" "or" '("LKR" $$k "presentation") '("LKR" $$k "meeting")))

;;; check-field-remove-subfields-when-other-subfields-exist
;;
;;    Check each instance of field F-TAG. If OTHER-SF-CODES exist
;;    according to OTHER-SF-PRESENCE-MODE, then remove SF-CODES.  The
;;    OTHER-FIELD-PRESENCE-MODE may be 'none', 'all', or 'any'.
;;    Useful for example to remove all publication information like
;;    volume and year if we don't know pages.

(check-field-remove-subfields-when-other-subfields-exist
 ("773" "vyn" "none" "c"))

;;; check-field-replace-subfield-content-by-other-subfield-content f-tag sf-code other-f-tag other-sf-code
;; 
;;   Replace the content of field F-TAG subfield SF-CODE by the content
;;   of field OTHER-F-TAG subfield OTHER-SF-CODE, if it exists."

(check-field-replace-subfield-content-by-other-subfield-content
 ("260" $$c "773" $$y))

;;; check-field-replace-subfield-content-via-kbr f-tag sf-code kbr-file action-when-not-found value-to-add-when-not-found
;;
;;   Check the content of field tag f-tag and subfield-code sf-code by
;;   comparing the content to the keys of referential knowledge base file
;;   kbr-file. (i.e. of the form ``bad---good'')
;;   1) When no f-tag field is present in the record, nothing happens.
;;   2) If the field is present and matches some KB value, then correct these fields.
;;   3) If no match in KB was found, then behaviour depends on action-when-not-found:
;;       a) 'REPORT' means that the incident will be reported.
;;       b) 'IGNORE' means that nothing happens.
;;       c) 'ADD' means to add value-to-add-when-not-found, that is inserted at the
;;          beginning of the field.

(check-field-replace-subfield-content-via-kbr
 ("041" $$a "~/private/work/nchkall/kbs/SISUC-lang.kb" ignore "")
 ("260" $$a "~/private/work/nchkall/kbs/SISC-implace.kbr" ignore "")
 ("260" $$b "~/private/work/nchkall/kbs/SISUC-univ.kb" ignore "")
 ("65017" $$a "~/private/work/nchkall/kbs/SISC-su.kbr" ignore "XX")  
 ("693" $$a "~/private/work/nchkall/kbs/SISUC-ac.kb" ignore "Accelerator? ")
 ("693" $$e "~/private/work/nchkall/kbs/SISUC-ex.kbr" ignore ""))

;;; check-field-replace-subfield-content-strings-from-kbr f-tag sf-code kbr-file
;;
;;   Check the content of field tag f-tag and subfield-code sf-code
;;   by comparing its content to the phrases in knowledge base file
;;   kbr-file.  When a match is found for a knowledge base item, the
;;   action substitutes the present word by a new one indicated in the
;;   kbr-file.  Useful to expand/collapse abbreviations.
;;   BEWARE: it does substring replacement, so it does not check for
;;   word boundaries.

(check-field-replace-subfield-content-strings-from-kbr
 ("773" $$p "~/private/work/nchkall/kbs/SISC-abbrev.kbr")
 ("710" $$g "~/private/work/nchkall/kbs/SISC-coarticles.kbr"))

;;; check-field-replace-field-content-via-kbrs f-tag ((fm-tag sfm-code kbrm-file kbm-matchtype) ...)
;;
;;   Check one or more fields of a record (not equal to F-TAG) in
;;   knowledge bases defined by KBS-RULES (see below).  If match is found,
;;   then modify the field F-TAG content of the record R by the subfields
;;   defined in the KB.  Otherwise keep silence.
;;   KBS-RULES rules are list of forms (fm-tag sfm-code kbrm-filename
;;   kbm-matchtype).  Various knowledge bases (kbrm-filename) are tried one
;;   by one, and inside each knowledge base the various fm-tag instances
;;   are tried one by one, and inside each fm-tag instance the keys of the
;;   kbrm-filename knowledge base are tried one by one, until a match is
;;   found.  kbm-matchtype may be 'MATCH-AT-START', 'MATCH-WORD-AT-START',
;;   'MATCH-ANYWHERE', 'MATCH-EXACTLY'.  All but `MATCH-WORD-AT-START'
;;   check for substrings; the `MATCH-WORD-AT-START' does additional check
;;   for word boundary.

(check-field-replace-field-content-via-kbrs
 ("260" '("088" $$a "~/private/work/nchkall/kbs/SISC-rnim.kbr" match-word-at-start)
        '("8560" $$f "~/private/work/nchkall/kbs/SISUC-emim.kb" match-anywhere))
 ("690C" '("BAS" $$a "~/private/work/nchkall/kbs/SISC-baim.kbr" match-exactly)))

;; avant d'appliquer la KB RN->IM, il faut formatter les RN:
(check-field-replace-subfield-content-via-regexp
 ("088" $$a "(?i)preprint" "")
 ("088" $$a "([\\s_\\/\\.,\\\\\(\\)]+)" "-")
 ("088" $$a "^\\-+" "")
 ("088" $$a "\\-$" "")
 ("088" $$a "([a-zA-Z])(\\d)" "\\1-\\2")
 ("088" $$a "([a-zA-Z])(\\d)" "\\1-\\2")
 ("088" $$a "(\\d)([a-zA-Z])" "\\1-\\2")
 ("088" $$a "\\-\\-+" "-"))

;; exceptions:
(check-field-replace-subfield-content-via-regexp
 ("088" $$a "^DPHN\\-" "SACLAY-DPHN-")
 ("088" $$a "^DPHPE\\-" "SACLAY-DPHPE-"))

;;; check-field-transform-subfield-content-unless-kba f-tag sf-code string-operation kba-filename
;;
;;   Check the content of record R field tag F-TAG and subfield-code SF-CODE
;;      and try to apply the string operation STRING-OPERATION such as .
;;      1) Do this for all unless subfield value does not contain terms matching
;;         the knowledge base KBA-FILENAME.
;;      2) If no match is found, nothing happens.

(check-field-transform-subfield-content-unless-kba 
 ("088" $$a "string-upcase" "~/private/work/nchkall/kbs/SISC-rn-case.kba"))

;(check-field-print-max-subfield-value
; ("CAT" $$c))

;;COMMANDE 11 FORMATTER LES AU;;
(check-field-replace-subfield-content-via-regexp
 ("100" $$a "[\\.\\&]+$" "")
 ("700" $$a "[\\.\\&]+$" ""))

;;; -*- end of file -*-

6. Example of use-cases

See the Asana list (also have a look at archived tickets): Data Cleanup Data Curation

  • linking 980 fields to 773. When a paper is given a pub-note in the 773, the 773p field should be checked to determine if it is a journal that Inspire considers "Published". The journals are listed here: https://twiki.cern.ch/twiki/bin/view/Inspire/CatalogingRulesJournals If so, then if there is no current 980a value "Published", one should be added to the record.

7. Currently implemented rules

See rules.cfg on github.

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2014-10-07 - KirstenSachs
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback