System Design: BibExport

1. Introduction

The BibExport tool enables one to export wanted information from wanted bibliographic records in wanted format in a periodical manner. In other words, cataloguers can configure what should get exported and when and where and the BibExport daemon performs automatically their queries in the background and produces output text files for later perusal.

2. Use cases

2.1. Exporting for accelerator field checking

We want to export the accelerator and the experiment fields for all records written by some collaboration, in order to be able to (i) either proof-check them manually in Emacs or to (ii) load them into CernBibCheck to be corrected automatically.

2.2 Exporting for accent checking

We want to export all titles of multimedia records in order to check which UTF-8 accented letters were used there, because we positively know that in the past there was a window of time when the submission system got misconfigured and accepted titles with UTF-8 graphical glyphs which is not permitted by the site convention.

2.3 Exporting for checking LaTeX

We want to periodically export all preprint titles to check LaTeX syntax to see whether braces and dollars are properly paired, etc. The exported files are passed onto CernBibCheck which will warn the cataloguers about bad titles it found.

2.4 Exporting for exporting

We want to export theses in some format (say NLM) to some third party (say Google Scholar) once per month.

Example 1: please create Google Sitemap every day for these four public collections.

Example 2: every first of the month, please export all records modified during the last month and matching these search criteria in an NLM format in such a way that the output is split into files containing not more than 1000 records and compressed via gzip and placed in this place from where Google Scholar would fetch them. The output files would be organized like this:

* all exportable records:

    /export/googlescholar/all-index.html   - links to parts below
    /export/googlescholar/all-part1.xml.gz - first batch of 1000 records
    /export/googlescholar/all-part2.xml.gz - second batch of 1000 records
    ...
    /export/googlescholar/all-partM.xml.gz - last batch of 1000 records

* records modified in the last month:

    /export/googlescholar/lastmonth-index.html   - links to parts below
    /export/googlescholar/lastmonth-part1.xml.gz - first batch of 1000 records
    /export/googlescholar/lastmonth-part2.xml.gz - second batch of 1000 records
    ...
    /export/googlescholar/lastmonth-partN.xml.gz - last batch of 1000 records

(Note that this use case is a bit different from the ones cited above, so the BibExport architecture should be nicely pluggable to allow this.)

2.5 Exporting for automatic correction/update

We want to automatically run checks that select certain records and update certain fields automatically

example: all documents that have a conference code also get the meeting information imported from the conferences file...search for 111g without 111f all records found should be exported, then e.g. bibcheck or some external script could fetch conference info and populate 111a-f, and then reupload. There is no need for human intervention here, but output/input could be checked periodically

example: all documents that have a collaboration (710g) note should also have an experiment(693e). In most cases the experiment can be determined from searching the experiments file for the text of the CN, in these cases it should be a straight replacement with no human intervention. In other cases it may need to create a list for a human

example: find all documents with 773y (i.e. a publication year) without a 773a. Export these in a fashion ready to send to CrossRef for DOI lookup. CrossRef will reply with a listing of journal information + dois, which needs to be parsed and sent automatically into the changeset queue for approval (or automated approval probably)

3. Workflow

3.1. BibExport daemon

The BibExport daemon will be run periodically as a BibSched task, say every day. It will do:

bibexport-daemon-workflow.png

4. Mock-up screenshots

Here are some mock-up screenshots for further discussion, intended to clarify both user interface as well as desired functionality.

4.1. Overview of jobs

JobOverview.png

4.2. Edit job settings

EditJobSettings.png

4.3. Edit job queries

JobQueries.png

4.4. Edit query

EditQuery.png

4.5. Manual job running: results screen

Results.png

4.6. History of jobs

JobHistory.png

5. Architecture

5.1 bibexport

The bibexport CLI daemon ressembles our bibindex/bibrank daemons; it can 'clone' their functionality to a large extent. To recap, the bibexport daemon would be run periodically via bibsched like this:

  $ bibexport -u admin -s 24h

and would consult a table say expEXPORT listing all the various exporting jobs configured and would run them if needed and would update last run time, etc etc, exactly as indexer/ranker does.

The bibexport daemon would have a pluggable architecture enabling people to write various exporters for various use cases cited above. This is similar to how people could write various external authentication methods within the same framework. An example of exporters we shall start with: Google Sitemap, Google Scholar.

Every exporter would have an INI-style config file, something simple like the citation.cfg file that we use for the citation ranking. For example, for Google Sitemap we would have something along the lines of:

  [exporter]
  name = google_sitemap
  export_style = google_sitemap
  output_directory = /foo/bar
  collection1 = Theses
  changefreq1 = weekly
  maxrecords1 = 10000
  collection2 = Articles
  changefreq2 = daily
  maxrecords2 = 230000

and for Google Scholar we would have:

  [exporter]
  name = google_scholar
  export_style = google_scholar
  output_directory = /foo/bar/baz
  split_by_records = 1000
  period = month
  records_with_fulltext_only = True
  collection1 = CERN Theses
  collection2 = CERN Nots
  collection3 = CERN Videos

with all the needed parameters etc.

The bibexport business logic for every concrete job would then be basically determined by the export_style, so in case of google_sitemap, bibexport would simply call Greg's code to do whatever job is needed.

In order to write/understand exporter plugins, Python knowlegde would be needed; but in order to tweak the exporters for every Invenio site's needs, people would simply need to edit these cfg files and/or enable them in BibExport Admin.

5.2 BibExport Admin

A BibExport Admin interface would be created to allow people to easily add/remove/configure exporting tasks. Again, the architecture could ressemble existing Invenio modules.

6. API

7. Database

Database.png

7.1 queries

This table will contain data about queries. In the current design all the output fields are stored in one column (e.g. separated by commas) but another option is to use separate table with columns for the fields, indexes and subfields.

7.2 jobs

Contains data about the jobs.
- running_mode - indicates if the job runs daily, monthly, yearly or manualy
- start_date_time - date and time when the job should run next time
- output_directory holds the path where the output files with the results will be copied (necessary only in case we keep this functionality). If we provide the results only for download this is not necessary)
- user_id identifies the user who created the jobs

7.3 query_results

Holds the results for every query that is executed. - result - contains the result itself. The result is several lines of text where every line represents a field of a record that match the search criteria (field is among the output fields).
- query_id is a foreign key to a query
- status holds a value indicating of the query run successfully or error appeared.
- status message holds a message clarifying the status code (e.g. describing the error if there is one).
- number_of_records_found can be removed if we decide that it not necessary to show the number of found records.

7.4 job_results

Holds results of the jobs
- job_id is a foreign key to a job
- run_date time holds the time when the job was run
- status and status_message like in query_results hold the status of the entire job

7.5 job_query and job_query_results

Link jobs to their queries and job results to results of every query in the job

8. Classes

Classes.png

8.1 User Interface

The user interface will use the same pattern used at the other modules in Invenio. WebInterfacePages defines the set of pages for the user interface. Template holds the HTML code for the pages. Webinterface utils contains other methods necessary for displaying the content.

8.2 Data objects

Job, Query, JobResult and QueryResult stores the data about the jobs and their results.

8.3 DatabaseManager

Manages storing and retrieval of data from the database.

8.4 JobScheduler

Class that will periodically check for jobs that have to be run and run them. The logic for running the jobs and saving the results can be encapsulated in additional class. Part of the functionality for searching and conversion to aleph format can be reused form websearch and bibedit modules

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Classes.png r1 manage 23.5 K 2008-08-01 - 12:38 RadoslavIvanov  
PNGpng Database.png r1 manage 34.1 K 2008-08-01 - 12:18 RadoslavIvanov  
PNGpng EditJobSettings.png r1 manage 17.5 K 2008-01-23 - 17:12 TiborSimko  
PNGpng EditQuery.png r1 manage 22.2 K 2008-01-23 - 17:12 TiborSimko  
PNGpng JobHistory.png r1 manage 27.9 K 2008-01-23 - 17:10 TiborSimko  
PNGpng JobOverview.png r1 manage 29.1 K 2008-01-23 - 17:11 TiborSimko  
PNGpng JobQueries.png r1 manage 47.8 K 2008-01-23 - 17:11 TiborSimko  
PNGpng Results.png r1 manage 27.0 K 2008-01-23 - 17:12 TiborSimko  
Unknown file formatdia bibexport-daemon-workflow.dia r1 manage 2.2 K 2008-01-23 - 16:37 TiborSimko  
PNGpng bibexport-daemon-workflow.png r1 manage 19.5 K 2008-01-23 - 16:38 TiborSimko  
Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2008-10-03 - TiborSimko
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback