Workflow for adding collaboration author lists
Travis' New Idea
Why are we doing this this way?
Why not instead:
1) Ask collaborations to maintain an author and affil. ID in their database. We don't care what it is, only that they assign a unique, persistent alphanumeric code to each author and affil. For most collaborations this will be a key in their DB.
2) Ask them to embed this in their XML
3) Upon XML ingestion we (Marg.) figure out the INSPIRE IDs for each author, and store the colln ID in the HEPNAMES or INST entry.
4) Next time most authors and affs are found instantly via the collaboration's key. Any new ones are understood and added at that time.
No communication back and forth with the collaboration is needed. We only maintain what we need in the XML, they need not suck info back up. If they want us to export to APS or ORCID, we can. Or we can give to them our IDs, or ORCID IDs.
Why is this stupid, please tell me...?
Summary
We are asking collaborations to generate from their database files called authors.xml and embed them in the tarball to arXiv. This can solve several problems:
- Parsing of long coll. author lists is made efficient and happens immediately
- Can get author IDs from collns.
- Can get affiliation data from collns.
However there are several pitfalls, and these are the problems as I see them:
- Avoid making one off deals with a collaboration. With LaTeX we got caught in one-off deals with collaborations, and it is too hard to maintain the long tail. Better to have one standard and stick to it, for formats in XML and workflow.
- Make sure that our XML examples are consistent and validate with standard XML tools
- Make sure that they meet the needs of all collaborations
- If we ask authors to go to effort, we must have the infrastructure in place to follow through and deliver as soon as they go to the effort, otherwise they will stop making the effort.
- Make sure that we are set up to extract XML files and parse them as soon as we download from arXiv (first SPIRES, then INSPIRE)
- CERN and DESY colls currently send us the xml files directly, so arxiv harvesting for them not essential in the immediate future (AH)
- If we want IDs for authors and affils to come out of the collaboration databases, then we must give them a simple way to obtain this information from us one time.
- Affiliation ID: Prefer a SPIRES Aff ID, but will take domain name or other ext. ID. Most important to have fill address as well
- When we give them back an XML file to enter into their DB, it must have in information they need, as well as what we need.
- We must not duplicate effort
- Make sure we each understand what everyone else is doing, including collaboration DB maintainers.
- Make sure we don't build things that will soon be irrelevant when Claim Papers/INSPIRE disambiguation comes online (i.e. how necc. is it to embed the IDs in the colln. papers, if we soon have input running through Hennings algorithms?)
Proposed workflow
- Collaboration starts
- Collaboration gets sample XML from us
- Collaboration generates XMl for their current author lists that matches sample
- INSPIRE modifies file
- We (Margaret) inserts author IDs into this file for all authors
- Also inserts standard affiliation names
- Need to agree on format here
- File must retain original data as well, so that it can be re-uploaded to Collaboration database.
- should colln include internal keys or other markers to make the info transfer easier?
- perhaps a webservice they could consume that resolves name+aff -> author id?
- Collaboration uses new file to update their database
- This is tricky, and can cause some difficulty for the collns. We propose to offer them a choice
- suck back up the IDs from us, and send them on to us, then they can preserve ownsership of the lists and pass it to anyone.
- we offer the service of maintaining the list for them. In this case there is extra work in that we store the original, and diff everytimea new one comes in, flag for addition to Author ID etc. * the infrastructure for this isn't really there yet, but could be, using a lot of manual effort. This manual effort is what we already put in, and it allows us to perform a useful service for a collaboration that might not be able to otherwise...
- Every new paper submitted to arXiv contains the XML file generated for that paper, named "authors.xml"
- SPIRES (and then INSPIRE) immediately detects *authors.xml in arXiv tarball.
- harvester sends result to xslt-> spires input-> directly into Mike's author subfile
- Mike's author subfile feeds spires (may need special handling (flattening) for lists that are too big for SPIRES)
- Mike's subfile feeds INSPIRE the full (non flattened) list.
- SPIRES detects new authors/affils in the xml file, assigns INSPIRE-id, feeds this info back to collaboration (AH)
TODO
- Modify arXiv harvester to look for and use authors.xml -Mike
- Documentation for collaborations (and others who might use this) -Heath/Margaret/Suenje - then review by all
- Create web page describing the procedure
- Sample XML files
- Consider the case of single authors/small lists -???
- This procedure will not work in general for small lists, "for the collaboration" papers and other similar items.
- We can ask authors to embed something else, somewhere else, or to claim papers on INSPIRE when that interface is built (6 months?) or ?
- Whatever we do here, we need to make sure that we are ready and able to do it before telling authors
--
TravisBrooks - 14-Apr-2010