INDICO LInk: https://indico.cern.ch/conferenceDisplay.py?confId=254661

Attending

Jen, John, Edgar, Diego, Xavier, Sara, Sunil

Personel

  • Coming off shift - Xavier + Sunil
  • going on shift Sara
  • Jen will be late on Tues
  • US Holiday on Thursday. Expect very limited coverage.

Site Issues

Sites for MC production

  • One Russian site not available
  • we can move all Disk entries to down/skip from the table
    • John is still trying to understand why not to put something in skip instead of down
    • Edgar will make change in script so that sites that are listed as skip will be seen as down in siteDB
  • T2_GR_Ioananniana & T2_AT_Vienna - need to be commissioned
    • Edgar doesn't have a lot of time to help with the commissioning
    • do we have instructions listed so that John can do this? John has a basic list and will try and talk to Edgar when stuck
    • let's get this on a twiki linked off the Workflow Team Main twiki
  • is it possible to leave notes in the text file why things are in that status in the text file or will that mess something up?
    • maybe have a file in dashboard with an extra column for notes
    • keeping notes in Edgars head is not a good solution as the rest of us can not read his mind wink
    • lets start a thread on the monitoring list on this
Site in SSB Not MC WR
T1_CH_CERN X -
T1_RU_JINR X -
T1_RU_JINR_Disk X -
T1_US_FNAL_Disk X -
T2_GR_Ioannina X -
T2_MY_UPM_BIRUNI X X
T2_PK_NCP X X
T2_TR_METU X X

Site in MC WN Status Notes
T0_CH_CERN 2000 down leave down
T1_UK_RAL_Disk 2000 down remove from twiki, it should only exist for PhEDEX
T1_TW_ASGC 1500 drain working in opportunistic mode
T2_BR_UERJ 200 drain network problems
T2_FI_HIP 1500 drain take out see how it goes
T2_IN_TIFR 200 drain keep in drain as long as possible - everything an issue
T2_IT_Bari 2000 drain take out
T2_RU_IHEP 1000 drain in wating room
T2_RU_INR 100 drain network problem
T2_RU_JINR 1500 drain network problem
T2_RU_RRC_KI 0 drain network problem
T2_RU_SINP 100 drain network problem
T2_TR_METU 200 drain in waiting room
T2_AT_Vienna 400 drain need to start commissioning - John
T2_FR_GRIF_IRFU 0 skip shares with LNR nodes so as long as we are using that we don't need this
T2_KR_KNU 300 skip needs re-commissioning
T2_RU_PNPI 10 skip in waiting room
T2_UA_KIPT 500 skip to small to make it work

Agents

  • 202 - back in production
  • 234 & 112 set to drain for upgrades
  • still draining
  • 235 still draining - has some stuck requests that we need to run the stuck requests script on this and get it working
  • what happens in binding for 2nd level tasks

Workflows

  • IN2P3 workflows - where are we in re-doing these?
    • Done!
  • Step0 requests that don't have Step1's. Do we know what happened? how can we detect this sooner?
    • need to talk to Vincenzo
  • Low effecency of jobs at FNAL, there was conflicting information from the agent/dashboard on what issues were. One said things were going OK the other was complaining
    • how do we decide "who to believe"
    • is everything documented on how to find this issue and debug it down to getting it reported properly?
      • Burt didn't know there was a problem until the meeting this morning, the Issues Edgar was seeing are not showing up in standard monitoring
    • we need to find a good place to look for effeciency and add it to our monitoring right now it is not in our monitoring
    • Edgar will send the links he used to find the problem to Xavier and Xavier will implement it in the monitoring.
  • Oli wants better accountablity of all the drops in numbers of jobs, aborts/failures
    • in general we are good about reporting when worklfows are aborted, but it would be good to summarize things into a list of Workflows having issues and what is causing them.
    • we have fallen out of the habit of summarizing things in shift reports, making it difficult to track what is going on when we have multiple issues hitting and different people dealing with parts of them. Let's try to get back into a summary so there is less question about who is supposed to be doing what.
    • see Action items

IEEE Paper

Draft Outline #1

  • Introduction (Why we need to run so much simulations, why we need to do a rereconstruction of the data) (Edgar/Jen)
  • How we ran prior to 2011
    • ProdAgent vs WMAgent ( Diego/Alan) (Focus on differences and improvements)
    • Reprocessing and Production (Jen/Xavier) (How this was handled with ProdAgent and why the need to move to another framework
  • How we ran with WMAgent (after 2011)
    • WMAgent /ReqMgr/Workqueue (Diego/Edgar/Alan) General comment on how it works
    • PREP/ReqmG Interaction (Vincenzo?)
    • Organization of the workflow team and operations around it (Edgar)
  • Achievements
    • Events reconstructed (L3s)
    • Usage of the grid (Edgar/Jen/L3s)
  • Conclusions / Outlook (Edgar/Jen)

Action Items

  • Recovery workflows - Jen - ongoing
    • Jen and Diego will talk about how we can make this go faster Diego has time to look at this this week !!
  • updating missing Lumi's - Jen - done
  • Making new shift schedule - Xavier - done
  • Github issue for Required OS, parameter (REQUIRED_OS = "rhel5", "rhel6", "any"). - done
  • JOHN: Discuss sites in drain and skip - also include new sites? - commission Warsaw (Edgar)- Done
  • we need to add a daily report on Workflow stats
    • How many workflows running, pending, waiting, stuck -
      • Jen -come up with template report
      • Edgar - please comment on workflow statuses I feel like we are not always communicating what workflows are in a waiting status for various issues
      • Diego - how hard would it be to have a "manual switch" that we can set on workflows for "waiting" so if there is a group of workflows that we are waiting back from a site/requesters to close out we can put the workflows in waiting so that things that are in "complete" really are ready to be closed or need to be looked at.
  • Diego - Can we have the script you wrote for finding stuck workflows?
    • Diego will put it in a public place so we can add it to svn
    • Is it documented yet?
      • need to pull documentaion out of e-log and put it on the twiki
  • Problems with dbsTest.py https://cmslogbook.cern.ch/elog/Workflow+processing/8656
    • Edgar have you looked at this yet?
    • not solved yet, Edgar will look at it. It is made to look at DAS run by run and is slow. Maybe we need to think about splitting the script

AOB

  • None!
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2013-07-02 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback