Reprocessing and Production Team Meeting - June 16 4PM CERN, 9 FNAL time


Vidyo Link

Attending

  • FNAL: Jen, Gaston, Matteo
  • US: Allie
  • CERN : Dima, Paola, Sebastian, Alan
  • Korea :

Personnel

  • Jen Vacation July 25-29
  • SeangChan Jun 2-July 5
  • Jorge June 13-24

News - Dima

  • We need to be checking the stuck workflows pages

Top issues affecting production

  • wrong config for MiniAOD
  • wallclock timeouts - https://ggus.eu/index.php?mode=ticket_info&ticket_id=122064 * some of the jobs were landing on worker nodes that didn't have the right cms setup, and the file that they were this was fixed in WMCore * we need to figure out what sites this is landing on, the files were being held.
      • we need to look at the condor logs in the agents. there is a Github issue open on this. You can get the worker node names from dashboard but that isn't working right now either
    • aborted workflow that was still running some jobs, and this was aborted and archived, so if you see wallclock failures on aborted jobs we can ignore
    • The way we know which site a job is running is by jobstatus lite, which has a polling cycle. What happens if a job starts and ends in the same min the agent doesn't know which site it was run at. This needs further investigation. Alan is looking into it to see if this is really an issue.
  • Exit code 71101: Unknown submit failures. Jean-Roch has disable overflowing for now to at least have the workflows restricted to the couple of sites in the whitelist. https://cms-logbook.cern.ch/elog/Workflow+processing/24694.
    • this will make debugging and recovery easier.
  • assign.py script issues.
    • Looks like there are a couple issues that JR found and Matteo will look at it later today.
    • exception on taskChainMerge, Alan and Matteo will look into it. JR and Alan are not in agreement
    • we can use it for everything but mergeTaskChain ACDC's

Site support -

  • UCSD, TIFR, Bari, UCL - into the waiting room * UCSD out as of yesterday *UERJ, Rome -out
  • Ioinina in the morege
  • Gaston is going to help track down the wallclock issue

Transfers - Sebastian

  • link between CERN and Nebraska is fully commissioned now
  • new version of PhEDEx is up and transfers are running taking care of the backlog @ CERN
  • tuning agent configurations
  • JR has a list of files that needs to be checked

Workflows

ReDigi

MiniAOD

TaskChains

StepChain

  • NA

Rereco

Store Results

MonteCarlo

Agent Issues

  • 219 is crashing, talking to Yuyi about why it is crashing, couch was down 2X in the last 48 hrs so need to keep a close eye on it.

Agent redeployment

RequestMgr2 Migration

  • still have a problem where the api returns an internal server error, it's still being debugged

Merging Scripts

  • assign.py script - being debugged today, can't assign things until it's fixed.
  • need to stop using assignProdTaskChain and assignWorkflow

RelVal Andrew

AOB

-- JenniferAdelmanMcCarthy - 2016-06-15

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2016-06-16 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback