Workflow Team Meeting - Dec 11 4PM CERN time

Vidyo Link https://indico.cern.ch/event/358187/

Attending

  • Wed EU Operators Meeting:
  • Thurs Meeting:
    • FNAL: Jen, Luis, Dave, SeangChan, Ian, Sean
    • CERN : Dima, Andrew, Alan, Julian

Personel

Dec 4 -> Dec 10 Xavier
Dec 11 -> Dec 17 Sara
  • EU - note the schedule doesn't go on past next week. Who do we have shifting over the Holidays? We will have the Christmas Production going so we need to have people watching the system!
    • Julian will ask.
  • Holiday vacation plans
    • Julian will be in Colombia 25th Nov - 25th Dec - Working plans on being online. Jan 1-7
    • Luis will be in Colombia Dec 20-through New Year will get us exact dates soon
    • Jen will be in MN Dec 26-Jan 2 and will have limited internet access
    • Ian & Sean will be around but nothing prolonged
    • Alan Dec 13-New Years Brazil - if you need him you have to call Alan's Grandmother, he will post number in Twiki wink
    • Seangchan will take a couple days off, but not telling us when wink
    • Dima will be in FL next week, but will be around over the Holidays

News

  • News on Christmas production? Expect it to hit right as CERN shuts down - Dima
    • Information from PPD. We should expect the following requests:
      • Upgrade - DigiReco
        • need to put on disk TP2023SHCALGS
        • injected may e by end of week
      • PF calibration with CMSSW_7_3 - DigiReco
      • 5 datasets of 2013A data - DigiReco
        • needs to be staged in from tapes. One of them /PPJet/Run2013A-v1/RAW
        • this is REAL data, it has been over a year since we did real data so so we are going to have to be really careful with this. Problems staging data because permissions have changed. If we can get the dataset, and a test workflow to run lets try it and see what happens. Keep expectations LOW!!! Where we write data has also changed so we could be having stageout problems. dbs2/3 transition has also happened during this time. The sooner we can get something reasonable to run as backfill the better.
      • RelVals 7_4
      • 200M Phys14DR for TSG - mostly for trigger group.
      • 250M of RunIIFall14GS - GEN-SIM
    • We got some information on input datasets, is that all of them? Have the ones we got already been checked to make sure they are on _DISK, valid, complete, no duplicates, subscribed ready to go?
      • We don't have all the datasets yet. We should make sure everything is ready to go, and that it is all at FNAL_DISK in case there are issues at other sites.
    • Both WF and Site support will be on skeleton staff so keep
    • do we know what time for event we will be looking at? don't know yet, most likely not something nice.
      • still don't know, keep in mind whatever they give us is best case and we should double to get a reasonable estimate
    • What is the drop dead date for the outputs? Which of these samples need to be done by what date? do we go all the way to 95% or is 75% good enough for each sample
      • Nothing that urget, so we will see. Upgrade will of course be most urgent.
  • Problems after cmsweb upgrade :
    • assignment script is giving :
      • urllib2.HTTPError: HTTP Error 503: Service Temporarily Unavailable
    • closetout script gives: (randomly)
    • Both solved by now, CMSWEB upgrade issues.
    • problem isolated to 2 different backends, have it isolated, disabled the backends that are giving problems, we are down to 2 backends instead of 4 but problems have disappeared.
    • services are still really slow, Tues had lots of mismatch info in agent
    • Who do we bug to get this fixed? this is more than a little annoying it's making it almost impossible to get through everything!
    • need to confirm cert forever, when you access cmsweb you need to install the grid cert - Julian will do this
    • still not totally solved but getting better

Site support

  • there are still about 7 sites (including MIT, Wisconsin, IFCA, ...) that are not being used in our production, where do we stand on this?
    • CERN Luis should be looking at it. Julian has MIT and Nebraska running, he will test the rest of the jobs.
  • Alan - two opprotunistic sites Luis doesn't have permissions to do so. Asked John about it and he hasn't replied yet. T3_US_SDSC and T3_US_NERSC
    • Luis and John will work on it. Alan will talk to the right people at CERN

EU shift notes

  • Wed meetings do we have notes?
    • Was cancelled, Julian sent them an email with a few points to take care.
  • Jasper - not going to be doing shifts after Jan 1
  • IN order to do 1 wk full time we need 4 people, we have 2 people. We need to do 10 hrs a week of monitoring and keeping things up and running. Given they have other duties getting them to agree to 10 hrs a week each. If one keeps components up and the other tracks issues it would be enough to keep their commitment up and keep them in the loop.
  • we need to talk to Christoph about getting more people.

US SHIFT notes

  • No notes

Agent Issues

  • Where do we stand on the resubmission from duplication issues:
    • Looks like there are still some WF's in assignment-approved and acquired

Redeployment plan

  • Log collect jobs aren't working properly, pretty much all global pool, I found cases for all T1's while debugging other issues today
    • Who can we assign to look at why log collect jobs are failing?
    • Have we mapped Alan's DN, happened Tues, but all jobs still failing. Who do you need to camp with to get this working? Has error message changed? Alan. Nicklo and Ivan.
      • Julian pull what we can get the right peeople working. Julian will followup on GGUS ticket
      • Production pool is working fine, issue only in global pool
  • Let's drain submit 1 & 2 so we can upgrade condor. Dave will touch base with Krista to see what needs to happen before we do the upgrade
  • we have a patch re stageout that needs to be deployed everywhere. It has already been deployed everywhere including RelVal. On Tues, there will be a downtime to remove the endpoint, so we need to make sure this is everywhere.
  • Who is in charge of fixing while Alan is gone SeangChan will be the one patching if there is a problem.

Workflows

ReDigi

  • Phys14DR keeping some sites busy -- especially RAL and KIT still highest priority
  • miniaods in super tails, next highest priority
  • Preparing for upgrade digi-reco - when will this hit the fan? Dima can you give us an update?

  • WF's with Product not found error - note lit is growing and still we are not getting these back to the requestors: Dima???? The list is growing and we aren't getting anything back to the requestors on them. They can't fix the issue if they don't know it exists!
    • Product not found... add this to the list of WF's to return Note this is 3 wks these have been on the list, Lets get these back!
    • alahiff_BTV-Phys14DR-00006_00018_v0__141129_234223_67
    • pdmvserv_BTV-Phys14DR-00026_00041_v0__141112_153935_4359
    • pdmvserv_HIG-Phys14DR-00006_00031_v0__141111_144303_536
    • pdmvserv_BTV-Phys14DR-00010_00018_v0__141110_011817_633
    • pdmvserv_B2G-Spring14miniaod-00087_00075_v0__141030_150442_6231
    • pdmvserv_MUO-Spring14miniaod-00019_00084_v0__141030_155616_5795
    • pdmvserv_EXO-Spring14miniaod-00215_00077_v0__141030_153350_2462
    • pdmvserv_SUS-Spring14miniaod-00058_00072_v0__141030_100336_4528
    • pdmvserv_EXO-Spring14miniaod-00192_00068_v0__141030_093231_949
    • pdmvserv_EXO-Spring14miniaod-00163_00067_v0__141030_093040_4447
    • jen_a_BTV-Phys14DR-00003_00041_v0__141126_052704_5345

  • WF's not 100% dispite no errors other than Log collect:
    • https://cms-logbook.cern.ch/elog/Workflow+processing/18058
    • alahiff_HIG-Phys14DR-00003_00032_v0__141129_233127_8117
    • alahiff_SUS-Phys14DR-00029_00032_v0__141129_233141_4544
    • alahiff_TSG-Phys14DR-00018_00032_v0__141129_233155_8179
    • alahiff_TSG-Phys14DR-00020_00032_v0__141129_233202_8153
    • alahiff_TSG-Phys14DR-00023_00032_v0__141129_233233_73
    • alahiff_TSG-Phys14DR-00021_00032_v0__141129_233210_2211

miniaod's

  • WF's with no errors but not at 100% - again this issue has been sitting since last week!
    • pdmvserv_EXO-Spring14miniaod-00163_00067_v0__141030_093040_4447 80%
    • pdmvserv_EXO-Spring14miniaod-00192_00068_v0__141030_093231_949 86%
    • pdmvserv_SUS-Spring14miniaod-00058_00072_v0__141030_100336_4528 89%
    • pdmvserv_EXO-Spring14miniaod-00215_00077_v0__141030_153350_2462 86%
    • pdmvserv_MUO-Spring14miniaod-00019_00084_v0__141030_155616_5795 100% error
    • https://cms-logbook.cern.ch/elog/Workflow+processing/17637

Rereco

Store Results

MonteCarlo

SL6 testing/SL5 Decomissioning

  • seems we are still haveing issues getting LogCollect files to work on SL6. Who can we assign to look into this?

RelVal Andrew

  • made a pull request needs to talk to Seangchan and Alan, they are holding the merging until validation is done and then will do the pull after Tues.
  • We need to configure WMAgent to inject them into PhEDEx instance

AOB

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2014-12-11 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback