System Design: BibExport
1. Introduction
The BibExport tool enables one to export wanted information from wanted bibliographic records in wanted format in a periodical manner. In other words, cataloguers can configure what should get exported and when and where and the BibExport daemon performs automatically their queries in the background and produces output text files for later perusal.
2. Use cases
2.1. Exporting for accelerator field checking
We want to export the accelerator and the experiment fields for all records written by some collaboration, in order to be able to (i) either proof-check them manually in Emacs or to (ii) load them into CernBibCheck to be corrected automatically.
2.2 Exporting for accent checking
We want to export all titles of multimedia records in order to check which UTF-8 accented letters were used there, because we positively know that in the past there was a window of time when the submission system got misconfigured and accepted titles with UTF-8 graphical glyphs which is not permitted by the site convention.
2.3 Exporting for checking LaTeX
We want to periodically export all preprint titles to check LaTeX syntax to see whether braces and dollars are properly paired, etc. The exported files are passed onto CernBibCheck which will warn the cataloguers about bad titles it found.
2.4 Exporting for exporting
We want to export theses in some format (say NLM) to some third party (say Google Scholar) once per month.
Example 1: please create Google Sitemap every day for these four public collections.
Example 2: every first of the month, please export all records
modified during the last month and matching these search criteria in
an NLM format in such a way that the output is split into files
containing not more than 1000 records and compressed via gzip and
placed in this place from where Google Scholar would fetch them. The
output files would be organized like this:
* all exportable records:
/export/googlescholar/all-index.html - links to parts below
/export/googlescholar/all-part1.xml.gz - first batch of 1000 records
/export/googlescholar/all-part2.xml.gz - second batch of 1000 records
...
/export/googlescholar/all-partM.xml.gz - last batch of 1000 records
* records modified in the last month:
/export/googlescholar/lastmonth-index.html - links to parts below
/export/googlescholar/lastmonth-part1.xml.gz - first batch of 1000 records
/export/googlescholar/lastmonth-part2.xml.gz - second batch of 1000 records
...
/export/googlescholar/lastmonth-partN.xml.gz - last batch of 1000 records
(Note that this use case is a bit different from the ones cited above,
so the BibExport architecture should be nicely pluggable to allow
this.)
2.5 Exporting for automatic correction/update
We want to automatically run checks that select certain records and update certain fields automatically
example: all documents that have a conference code also get the meeting information imported from the conferences file...search for 111g without 111f all records found should be exported, then e.g. bibcheck or some external script could fetch conference info and populate 111a-f, and then reupload. There is no need for human intervention here, but output/input could be checked periodically
example: all documents that have a collaboration (710g) note should also have an experiment(693e). In most cases the experiment can be determined from searching the experiments file for the text of the CN, in these cases it should be a straight replacement with no human intervention. In other cases it may need to create a list for a human
example: find all documents with 773y (i.e. a publication year) without a 773a. Export these in a fashion ready to send to
CrossRef for DOI lookup.
CrossRef will reply with a listing of journal information + dois, which needs to be parsed and sent automatically into the changeset queue for approval (or automated approval probably)
3. Workflow
3.1. BibExport daemon
The BibExport daemon will be run periodically as a BibSched task, say every day. It will do:
4. Mock-up screenshots
Here are some mock-up screenshots for further discussion, intended to clarify both user interface as well as desired functionality.
4.1. Overview of jobs
4.2. Edit job settings
4.3. Edit job queries
4.4. Edit query
4.5. Manual job running: results screen
4.6. History of jobs
5. Architecture
5.1 bibexport
The bibexport CLI daemon ressembles our bibindex/bibrank daemons; it can 'clone' their functionality to a large extent. To recap, the
bibexport
daemon would be run periodically via
bibsched
like this:
$ bibexport -u admin -s 24h
and would consult a table say
expEXPORT
listing all the various exporting jobs configured and would run them if needed and would update last run time, etc etc, exactly as indexer/ranker does.
The
bibexport
daemon would have a pluggable architecture enabling people to write various exporters for various use cases cited above. This is similar to how people could write various external authentication methods within the same framework. An example of exporters we shall start with: Google Sitemap, Google Scholar.
Every exporter would have an INI-style config file, something simple like the
citation.cfg
file that we use for the citation ranking. For example, for Google Sitemap we would have something along the lines of:
[exporter]
name = google_sitemap
export_style = google_sitemap
output_directory = /foo/bar
collection1 = Theses
changefreq1 = weekly
maxrecords1 = 10000
collection2 = Articles
changefreq2 = daily
maxrecords2 = 230000
and for Google Scholar we would have:
[exporter]
name = google_scholar
export_style = google_scholar
output_directory = /foo/bar/baz
split_by_records = 1000
period = month
records_with_fulltext_only = True
collection1 = CERN Theses
collection2 = CERN Nots
collection3 = CERN Videos
with all the needed parameters etc.
The
bibexport
business logic for every concrete job would then be basically determined by the
export_style, so in case of
google_sitemap, bibexport would simply call Greg's code to do whatever job is needed.
In order to write/understand exporter plugins, Python knowlegde would be needed; but in order to tweak the exporters for every Invenio site's needs, people would simply need to edit these cfg files and/or enable them in BibExport Admin.
5.2 BibExport Admin
A BibExport Admin interface would be created to allow people to easily add/remove/configure exporting tasks. Again, the architecture could ressemble existing Invenio modules.
6. API
7. Database
7.1 queries
This table will contain data about queries. In the current design all the output fields are stored in one column (e.g. separated by commas) but another option is to use separate table with columns for the fields, indexes and subfields.
7.2 jobs
Contains data about the jobs.
- running_mode - indicates if the job runs daily, monthly, yearly or manualy
- start_date_time - date and time when the job should run next time
- output_directory holds the path where the output files with the results will be copied (necessary only in case we keep this functionality). If we provide the results only for download this is not necessary)
- user_id identifies the user who created the jobs
7.3 query_results
Holds the results for every query that is executed.
- result - contains the result itself. The result is several lines of text where every line represents a field of a record that match the search criteria (field is among the output fields).
- query_id is a foreign key to a query
- status holds a value indicating of the query run successfully or error appeared.
- status message holds a message clarifying the status code (e.g. describing the error if there is one).
- number_of_records_found can be removed if we decide that it not necessary to show the number of found records.
7.4 job_results
Holds results of the jobs
- job_id is a foreign key to a job
- run_date time holds the time when the job was run
- status and status_message like in query_results hold the status of the entire job
7.5 job_query and job_query_results
Link jobs to their queries and job results to results of every query in the job
8. Classes
8.1 User Interface
The user interface will use the same pattern used at the other modules in Invenio. WebInterfacePages defines the set of pages for the user interface. Template holds the HTML code for the pages. Webinterface utils contains other methods necessary for displaying the content.
8.2 Data objects
Job, Query, JobResult and QueryResult stores the data about the jobs and their results.
8.3 DatabaseManager
Manages storing and retrieval of data from the database.
8.4 JobScheduler
Class that will periodically check for jobs that have to be run and run them. The logic for running the jobs and saving the results can be encapsulated in additional class. Part of the functionality for searching and conversion to aleph format can be reused form websearch and bibedit modules