Workflow Team Meeting - Nov 6 4PM CERN time & US Meeting Tues Nov 4 at 1PM FNAL time

Vidyo Link

Attending

  • Tues Meeting : Jen, Sean and Ian
  • FNAL: Luis, Jen, Dave, Seanchan
  • CERN : Alan, Jullian the Terrible

Personel

EU

Oct 30 -> Nov 6 Xavier
Nov 7 -> Nov 13 ?

US

Oct 22 -> Oct 30 Ian
US shifters have decided it works better for their schedule to commit to 10 hrs a week of monitoring. We will give it a try and see how it goes!

  • Julian will be in Colombia 25th Nov - 25th Dec - Working plans on being online.
  • Luis will be in Columbia Dec 20-through New Year will get us exact dates soon

News

  • spend more time on sites not working - talk about it at US meeting
  • anybody that can make the compOps Meeting on monday should:
https://twiki.cern.ch/twiki/bin/viewauth/CMS/CompOpsMeeting
  • Monitoring scripts should move to vocms049 too? how is migration going?
    • in principle migration should be OK but we are missing git on the machine
    • scripts are over there and running and no issues.
  • WMAgentScritps big changes:
    • please pull the last version
    • closeout script won't close unless Phedex Files == DBS Files
    • WorkflowPercentage with different options,
    • http://jbadillo.web.cern.ch/jbadillo/closeout.html
      • dbsF = dbsFiles phdF = PhEDEx files if those 2 numbers do not match it will not close out * if a WF stays stuck for more than a day in that state we should debug. Put the numbers in the closeout report so we can see if it made progress.
  • compaction of database causing issues with wmstats
    • move of cmsweb to virtual machines + couch issues causing issues in monitoring on Wed
    • Not being a problem today, but it was painful this morning EU time, we can't run without compaction for to long. Diego is commissioning a new machine for backend, it is a real machine.
    • latency in virtual machines is going to be tough on us.
    • What has been moved to SL6
      • all of cmsweb - they all moved to virtual machines
        • everything has been moving slowly, causing the ErrorHandler is crashing everywhere
        • it takes 1 month for physical machines, there is a big struggle with CERN IT
        • we are going to get delays everywhere getting machines up, but eventually it will be moved to physical machines? PhEDEx subscrioptions was slow yesterday
        • everything was slow yesterday, maybe we should report in web-interface hypernews, PhEDEx dashboard, requestmanager and WMStats slow, assigning and aborting slow
        • We don't know when we will get physical machines
        • give green light to get 4 physical machines and then run comparison between RAID 0 and RAID 1, and we need to have a recovery proceedure to recover or robust machines.
      • dqm gui machines haven't moved
  • Jobs not reporting properly to Dashboard - seems to be working now
    • Taskname has a unicode string, and it isn't reporting probably. WMBS insertion wasn't working. We need new request manager for testbed?
  • the T0 needs T2_CH_CERN_T0 back for the cosmics run - Not assign anything there
      • there were jobs running on Monday.
  • What is coming soon from PPD: see slides

Site support

  • FNAL will need to be

EU shift notes

  • have we come up with a time that works for EU shifters to have a meeting?

US Shift notes

  • Ian still can't connect to the SL6 machines
    • Jen will send Ian that part of the .config file
  • Ian having problems authenticating as cmst1 on CERN machines. - Julian is working on this
    • Ian needs to send email to Alan and Ivan to get added
  • Ian will be watching late afternoon each day - covering early Asia shift
  • Sean - later in the day the better so late afternoon or evening, so also covering part of Asia shift!

Agent Issues

  • Error handler on submit 1 going down repeatedly - now working
  • 98 is almost drained
  • 112 is totally drained (can be given up) is shutdown for now and hidden from wmstats.
  • vocms85 is also being drained
  • submit2 is reporting to testbed rather than backfill so is not showing up as healthy in the monitoring
    • there are 2 old workflows that were on submit2 before it was upgraded. Do we need to just force complete them? WMStats is showing jobs running but the agent isn'
    • Explanation: they were acquired in last wmstats view, but they were force-completed before first redeployment.
  • vocms216 rebooted, having some condor troubles, Farrukh is working on that.

Redeployment plan

  • Redeployment plan
    • Production Pool:
      production SL6 mc SL5
      cmssrv217 (up/new version)
      218 (up/new version)
      219 (up/new version)
      vocms216 (up/new version)
      201 (up/new version)
      235 (up/new version)
      cmssrv98(drain - about to abandon)
      reproc_lowprio SL5 step0 SL5
      vocms202 (up/new version)
      234 (up/new version)
      85 (up - will be abandoned)
      cmssrv112 (abandoned)
      vocms237 (up/new version- will be abandoned)
    • Global Pool
      backfill SL6
      submit1 (up/new version)
      submit2 (up/new version)
    • cmssrv112 abandoned (shut down)
    • Draining cmssrv98 (to give away)

Workflows

ReDigi

  • pdmvserv_JME-Summer12DR53X-00185_00334_v0_RD_141024_190329_5720 = ACDC failed with ACDC is not supported for WMBSMergeBySize ./WorkflowPercentage.py pdmvserv_JME-Summer12DR53X-00185_00334_v0_RD_141024_190329_5720
/ZJetToMuMu_Pt-230to300_TuneEE3C_8TeV_herwigpp/Summer12DR53X-PU_RD1_RD_START53_V7N-v1/AODSIM Input events: 100000 Output events: 76710 /ZJetToMuMu_Pt-230to300_TuneEE3C_8TeV_herwigpp/Summer12DR53X-PU_RD1_RD_START53_V7N-v1/AODSIM match: 76.71 % /ZJetToMuMu_Pt-230to300_TuneEE3C_8TeV_herwigpp/Summer12DR53X-PU_RD1_RD_START53_V7N-v1/DQM Input events: 100000 Output events: 100000 /ZJetToMuMu_Pt-230to300_TuneEE3C_8TeV_herwigpp/Summer12DR53X-PU_RD1_RD_START53_V7N-v1/DQM match: 100.0 %

miniaod's

Rereco

Store Results

MonteCarlo

SL6 testing/backfill

RelVal Andrew

-- JenniferAdelmanMcCarthy - 2014-11-04
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2014-11-06 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback