https://indico.cern.ch/conferenceDisplay.py?confId=254673

Attendence

  • Jen, Luis, Seang Chan, Edgar, Julian, John, Andrew

Shifts

  • Sep 17 --> Sep 24 Sara
  • Sep 24 --> Oct 1 Sara
  • Is Julian shadowing?
    • Julian is not so sure about the meaning of "shadowing", but he is here.
  • our shift schedule only goes until Oct 8th, has anybody started the new shift schedule?
  • Edgar will be on vacation friday up to Monday 7 October.

Issues from last week

  • Thanks to everyone for the twiki's and e-mail discussions on git!
    • Has anybody transferred the additional information that Jacob needed to get things working to the twiki's? I haven't looked.
    • Once I realized that git was not in fact on any of the production machines, I was able to successfully get things checked out and running on lxplus.
    • Is git installed on 112 and 98? I haven't tried there yet either.
      • talk to Krista about getting git over there
  • condor_status is a much less expensive command
  • condor_overview.py - is hanging up on one of the workflows... This script does not work now with late binding and that is why it is crashing [vocms201] /afs/cern.ch/user/j/jen_a/WmAgentScripts > ./condor_overview.py
Traceback (most recent call last): File "./condor_overview.py", line 89, in ? EnteredCurrentStatus=int(array[4]) ValueError: invalid literal for int(): /data/srv/wmagent/v0.9.69a/install/wmagent/JobCreator/JobCache/pdmvserv_TOP-Summer11Leg-00013_00008_v0__130916_040247_1427/MonteCarloFromGEN/JobCollection_176487_15/job_2963898/condor.181857.38.log
    • is this script supposed to be reporting accurately now with late binding? answer is no. Edgar is going to have Julian look at it and they will try to get it working again.
  • 201, 216 and 235
    • Couch down on 201 & 216 over the weekend, but monitoring showed they were OK. The fact that the couch was down was masked by the AnalyticsDataCollector working on a long query
    • AnalyticsDataCollector showing as down for long periods all week
      • This issues is a side effect of a long Couch query that is currently taking 5-6 hrs to complete. If you see the collector listed as down PLEASE DO NOT JUST RESTART IT! Instead go to the machine, run the status command, make sure couch is up and happy. Tail the logs and check to see if there are actual errors. If you restart the collector during the middle of this long query it will just have to start the query all over again reseting the time it takes to run it back to 0. If you are truly unsure what to do just e-log it and SeangChan will look at it as soon as he can. He and Luis are looking at solutions to fix this problem but in the meantime we need to go back to running the status command on the machines, and tailing logs on these machines to determine if there is really a problem.
    • Higher priority workflows are getting stuck behind lower priority workflows in the agent. Can somebody explain what is going on? Here is an example:
      • When I was debugging stuck workflows I noticed that some workflows that actually have a higher priority are sitting with jobs queued and nothing pending/running, and have been that way for days
        • pdmvserv_L1T-UpgFall13-00005_00001_v0__130913_114848_5196 running-closed MonteCarlo 110000
        • pdmvserv_L1T-UpgFall13-00002_00001_v0__130913_114839_4729 running-closed MonteCarlo 110000
      • These two workflows have "lots of jobs" running despite having a lower priority of 63000
        • pdmvserv_SUS-Summer11Leg-00004_00009_v0__130920_120325_163
        • pdmvserv_TOP-Summer11Leg-00013_00008_v0__130916_040247_1427
  • There was some discussion last week in e-log about the fact that numbers between dashboard, condor and WmStats are not anywhere near the same and there isn't agreement on which numbers to believe. We need to come to some resolution here.

Site problems

Sites for Production

Site in MC Slots Status Notes Issues
T2_FR_GRIF_IRFU 235 skip commissioned all checks passed
T2_RU_PNPI 176 skip to be commissioned presently under maintenance

Drain list

T2_BR_UERJ
T2_IN_TIFR
T2_RU_INR
T2_RU_JINR
T2_RU_RRC_KI
T2_RU_SINP
T2_TR_METU
T2_UK_SGrid_Bristol

Other issues

Action Items

  • Write twiki disk/tape separation T1_IT_CNAF. Edgar
  • Recovery workflows - Jen - suspend
    • first 2 workflows are completely through and now we are waiting for people to really look and make sure that there are no show stoppers before we do the other 50.
    • Guillemo is bothering JeanRoc about if people have actually looked at the data
  • A new state for completed and already dealt with ACDC. issue is created - closed
  • solve the problem of how to use a non-production scram architecture (waiting for Alan to come back) -closed
  • Updating documentation on scripts with github now that we aren't using svn anymore
    • docuentation needs to be updated and everyone needs to start ramping up on github - done

AOB

none need to add to WMAgent get core dump if job fails... Andrew will create an action item for this do we want IFCA in drain - no we will wait until glidin guys fix Andrew's ACDC doesn't have input step specified, not sure what is going on jobs are moving to failed. Seang Chan will followup on e-log
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2013-09-24 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback