AnalysisOperations post-mortem for 2009 data running

Notes from operation side:

Note1: little data

  • Users activity was not a visible load, Crab/!CrabServer operations and support went on transparently
  • Almost all lessons in this area are still the ones from OctX

Note2: but much MC

  • MC sample were felt to be of proper size and available as needed, largely a success in this ares
  • sort of too many with too many versions and difficult to know which one to use
    • not an immediate problem for AnalysisOperations
    • metainfo in DBS does not appear adequate yet, info basically limited to what is in dataset name and people work out of dataset lists in twiki's, mails

Note3: crab support load still to high:

  • Daily support for problems in crabFeedback list is too much
    • still have no idea of how effective can new people be, before they have ~1year experience
    • most questions we get are not solved by pointing to documentation * good: documentation and tutorials are good and people use them, they seldom asks the baiscs * bad: few thread are closed quickly, often support needs to reproduce, debug, pass to developers * still working with crab 266, much is hoped from new 27x in error handling and messaging (or at least 26x is only getting critical fixes)

Note4:users relied on CAF

  • since data was few, ntuple could be made on CAF and replicated around by "all sort of means"
  • feat that in a pattern of one-day-beam-few-days-no-beam habit may continue
  • will be painful to wean from it in a rush under pressure of more data
  • many people use grid daily, we know it works, want to make sure all can do it
  • use cases appeared for moving data from CAF to T2's (MC so far, expect data and calibration next)
  • CAF is by design separated by the distributed infrastructure, but hard to explain to users that "data can't move to/from CERN disks and offsite T2 disks"
  • grid access to CAF would allow to user same /store/result instance (only have 1FTE overall to operate the service)

Note5: CAF Crab Server

  • not really "like the grid one, but with AFS"
  • requires dedicated test/validation and support (which we lack). Not to mention development.
  • does not hide the "unreliable grid", but the highly reliable local batch system, need to be not less robust. On grid the problem is to make grid invisible, here it is to make CRAB server invisible.

Note6: transition control:

changes in offline (SL5, pythong 2.6..) are oftern declared donw with CMSSW out. But transition in CRAB and overall infrastrucutre needs to be better taken into account. Needed to throw in new CRAB release in a hurry to cope with SL5, grid integration issues solved last day (with recirpy from operations) and leat to "horror code", DPM sites still not working.

Note6: long lead time to get new crab servers in at CERN (get hardware, configure....)

Need good advance planning, no sudden changes and robust servers. Crab server failures are not transparent, take ~half of CMS workload with it. UPS ?

Metrics and usage

CAF: far from saturation

CAF Crab Server

Concerns, Lessons, Plans

data distribution largely a success

  • but we have not dealt with cleanup and working with "full sites" yet
    • are space bookeeping tools adequate ? Is there any developer effort ?

daily crab support effort must be brought down

  • can not offer expert advice and debugging on all currently implemented features
  • will make a list of what we are comfortable in knowing how to use, and only support that
  • coudl use better separation of CMSSW/DBS/CRAB/GRID parts (true for us, users and possibly developers)
    • clean segmentation allows to "attach problems to specific area"

need to become able to operate in "watch it while it runs and figure out overall patterns" mode

(a piece of) CAF should be integrated with distributed infrastructure, be able to use with same tools running same workflows

  • submit with same crab server
  • use same /store/result instance
  • access to phedex transfers

development features must be championed by needing users

  • each extra "thing" generates workload forever.

-- StefanoBelforte - 25-Jan-2010

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2010-01-26 - StefanoBelforte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback