New module needed on ingestion. Simple requirements:
Generate a score as to how relevant this document is to HEP. If the score is lower then x, goes to delete pile, if score is in middle, then goes to manual categorization, if score is high goes to core
Delete pile is a collection of all the junk records, but records are deleted only as needed, and low score first. May be that there are some higher scoring records that are permanently kept.
Score is generated from the following, in order of importance to implement, user:
- SystemDesignBibClassify output->more core keywords, higher score (3+ core kws should get into core)
- SystemDesignRefExtract output-> references N or more existing core papers
- Affiliations (not clean affs, just string matching) in particular, if it might go in a inst. repository (SLAC, CERN, DESY, Fermilab) we should make sure it goes at least to manual check
- Future alert service for other repositories? Would be nice to get fulltext links back for non-OA journal articles
- core arxiv's should get an automatic core tag. Their scores could be used for calibrating our score algorithm. Some journals like PRD should go into manual check even with low scores.
- ???? any other good things here??? Can be improved as we are in production
If score is midrange, goes to a manual interface that simply shows user the paper, the abstract, link to fulltext, and asks for score, then acts accordingly.
--
TravisBrooks - 20 Jun 2009