Workflow Team Meeting - Sept 25 4PM CERN time

Vidyo Link

Attending

  • FNAL: Jen, Seangchan, Dave, John, Luis
  • CERN : Julian, Andrew Alan

Personel

Sep 20 -> Sep 25 Sara
Sep 25 -> Oct 2 Sara
  • Dave is down to one last Tues of Jury Duty! Yeah!!!!

News

  • hyper urgent stuff is running Upg2023SHCAL14DR is in the tails, ACDC's do seem to be giving us a little more stats.

  • New stuff in the issues page: http://jbadillo.web.cern.ch/jbadillo/reproc/issues.html
    • Workflows acquired on a site but input not subscribed.
    • if block is open it doesn't pull location in WQ, it will check PhEDEx if it's complete
      • if you see WF's here, go look at the WF grab input dataset, check PhedEX to see if subscriptioni is pending, if not, subcribe dataset to _DISK
      • Julian will work on making a tool to check this automatically so we don't have to look at these things manually.

Site support

  • There are issues with sites with xrootd because sites are working out AAA and redirector
  • xrootd is critical for SAM tests now so if it happens for any length of time the SAM test will turn red, if it stays down for a few days it will go to the waiting room and go to drain.
  • before the site goes to drain it may be failing for several days in xrootd
  • Bari - they were in a scheduled downtime, they haven't been out of drain in over a week. Site readiness is at 33% SAM tests and Hammercloud issues are still happening so it is still in drain.

Sara's notes

Agent Issues

  • Need to work on testing the SL6 agent harder
    • cmssrv217 finished some test workflows, we can deploy 218 and 219
    • CERN virtual SL6-WMAgent machines are not doing good when they reach certain load.
      • we see condor submit errors because we can not condorq. Do we need bigger virtual machines or bring down the number of jobs, if neither work we will have to request Physical machines
    • Production agents are 1/2 way between the priority of the, when we put in the backfill stick to teir I's
  • WorkQueue stuck and needs a bit of debugging.
    • Alan found something weird going on between central couch and Workqueue
    • Local WQ's are having the same errors

Redeployment plan

  • agents are up to date, just waiting for more SL6 testing to get going and then deploy the rest of the SL6 machines

Workflows

  • Processing string - the plan was that they were going to pick some date, which was supposed to be yesterday was going to use a processing string and anything before then was not going to have it. Dave would prefer it to be a split by campain. If Processing string = none do old way, otherwise we take the new string.
    • there are some "old" as in the last month or so wf's that have processing strings in them.
    • we need to come up with a clear policy for when to use the processing string and when we are doing things the old way.
      • We need to talk to MCM about how they want the policy set: Dave will reply to the e-mail that they are postponing it and asking for clarification as to how we will know that we are using it.
      • Julian will make a list of WF's that have processing string in the schema that are already in the system.

ReDigi

miniaod's

  • Workflow with duplicates was resubmitted by PPD, and invalidation requests sent without them telling us that is what the request was. Anyway, now we can run the miniaod on the data and invalidate and reject the remaining workflows.
    • Dave will check and submit the new miniaods, then we will reject old workflows and invalidate/delete old outputs.

T1's summary

Rereco

  • nothing... literally

Store Results

  • Discussion with Luis on getting LogArchives. Luis already replied in e-mail during meeting

MonteCarlo

  • running smoothly

RelVal Andrew

  • there are not enough timestamps in the wmagent stage out code: https://cmslogbook.cern.ch/elog/Workflow+processing/17008
  • out-of-office e-mails from Antonio Perez-Calero Yzquierdo are annoying
  • can we remove the Release Validation elog and replace it with a wmagent problems elog?
  • workflow status changes that are not in the github diagram
  • what is the memory limit at FNAL?
  • are workload summary plots showing cpu time or wall time? how are these plots made?
  • how we get eos space for the relval account?

-- JenniferAdelmanMcCarthy - 24 Sep 2014

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2014-09-25 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback