BibCategorize

New module needed on ingestion. Simple requirements:

Generate a score as to how relevant this document is to HEP. If the score is lower then x, goes to delete pile, if score is in middle, then goes to manual categorization, if score is high goes to core

Delete pile is a collection of all the junk records, but records are deleted only as needed, and low score first. May be that there are some higher scoring records that are permanently kept.

Score is generated from the following, in order of importance to implement, user:

  • SystemDesignBibClassify output->more core keywords, higher score (3+ core kws should get into core)
  • SystemDesignRefExtract output-> references N or more existing core papers
  • Affiliations (not clean affs, just string matching) in particular, if it might go in a inst. repository (SLAC, CERN, DESY, Fermilab) we should make sure it goes at least to manual check
    • Future alert service for other repositories? Would be nice to get fulltext links back for non-OA journal articles
  • core arxiv's should get an automatic core tag. Their scores could be used for calibrating our score algorithm. Some journals like PRD should go into manual check even with low scores.
  • ???? any other good things here??? Can be improved as we are in production

If score is midrange, goes to a manual interface that simply shows user the paper, the abstract, link to fulltext, and asks for score, then acts accordingly.

-- TravisBrooks - 20 Jun 2009

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2009-06-22 - AnnetteHoltkamp
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback