Week of 111017

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Massimo, Yuri, John, Jamie, Maria G, Maarten, Lukasz, Andrea, Dirk, va, Jakob, Oliver, Maria D);remote(Joel, Tore, Onno, Jhen Wei, Giovanni, Rob, Rolf, Tiju, Dmitri, Lisa).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • data taking, T0 reprocessing and HLT reconstruction in progress.
    • T1 sites
      • IN2P3-CC:
        • LFC outage ~2h. Friday afternoon due to ORACLE DB problem, FR cloud was set offline in production temporarily, Savannah:124066. Recovered at ~5pm CET.
        • Transfer failures on Sunday: "dftp_copy_wait: Conn. timeout and FIRST_MARKER_TIMEOUT". GGUS:75352 in progress, also SAM test for SRM failed. SRM overload,downtime (dCache) till 3pm. Recovered. Still many SRM related stage-out job failures on Monday morning.
      • PIC many file transfer failures on Saturday: GGUS:75341 verified. PNFS overload. LHCb heavily asking for tape files, srm-bring-online requests accumulating and affecting the PNFS server. (GGUS:75344). No more failures on Sunday.
      • NIKHEF transfer failures: failed to contact on remote SRM on Saturday. GGUS:75343 and GGUS:75345 verified. Job failures: missing BDII entry. Fixed.
      • Taiwan-LCG2 production job failures on Sunday: missing installation. GGUS:75354 solved after the experts adjusted the field in central DB.
      • INFN-CNAF missing files on SCRATCHDISK causing transfer failures on Sunday. GGUS:75356.
      • RAL-LCG2 transfer failures: "DESTINATION error: SRM_FAILURE or SRM_ABORTED". GGUS:75361 in progress.
    • T2 sites
      • several GGUS tickets submitted, some of them resolved.

  • CMS reports -
    • LHC / CMS detector
      • data taking ongoing
    • CERN / central services
      • CMSR Oracle instance spontaneous reboot problem, GGUS:74993, kept open to follow up with increased logging information
    • T0 and CAF:
      • cmst0 : very busy processing data !
    • T1 sites:
      • [T1_TW_ASGC]: file access problems, MinBias MC files used for Pileup mixing not accessible by processing jobs, problem solved, problematic disk servers have been put offline GGUS:75258
      • [T1_TW_ASGC]: general file access problem, suspicion is some staged files are corrupt when coming from tape, GGUS:75377
      • [T1_IT_CNAF]: JobRobot problems (part of our Nagios checks), GGUS:75351
      • [T1_FR_CCIN2P3]: overload in dCache affecting transfers form/to IN2P3, nothing to be done right now?

  • LHCb reports -
    • Experiment activities
      • Reconstruction and stripping at CERN
      • Reprocessing at T1 sites and T2 sites
    • T0
    • T1 sites:
      • CERN : (GGUS:75374) during the last week we never get in average our LSF share at CERN : why ?
      • PIC : (GGUS:75344) : problem with SRM
      • RAL : problem with one disk server.
    • T2 sites:
      • IN2P3 : reconfiguration of CE is done

Sites / Services round table:

  • ASGC: Investigating on corrupted files. Restaging (reloading from tape) fixes this problem.
  • BNL:
  • CNAF: Problems with one (out of several CE). Nevertheless site is operational (confirmed by CMS)
  • FNAL: On Saturday the site came back online at 6:30.
  • IN2P3: nta
  • KIT: Downtime tomorrow (Oracle upgrade affecting ia FTS) 9-12 UTC
  • NDGF: ntr
  • NLT1: Slow access to tape. Vendor call open.
  • PIC: ntr
  • RAL: Downtime cancelled
  • OSG: ntr

  • CASTOR/EOS: ntr
  • Dashboards: ntr
  • Databases: nt

AOB: (MariaDZ)

  1. Could ALICE support please re-assign GGUS:74373 to themselves as per internal diary note of 2011/10/13, if they agree.
  2. There was a complaint in email by LHCb (Joel) about wrong assignments by the SNOW 2nd Line Support.
  • GGUS:75374 was too laconical for general-purpose supporters to understand what is the right SNOW Assignment Group. One can see the SNOW Assignment Group in the GGUS diary, so one can comment (in GGUS) if this is wrong. (Recommendation for GGUS ticket submitters).
  • GGUS:75373 had the right keyword 'LFC' in its subject but was wrongly assigned due to the 'VOBOX' occurence in the detailed description. We asked help from PES experts for internal assignment as nobody else has privileges to do it. In this sense, the GGUS free-access-for-all-supporters model is more flexible. (Recommendation for SNOW deployers).

Tuesday:

Attendance: local(Elena, Massimo, Luca, Ulrich, John, Mattia, Maarten, Mariz DZ, Alessandro, Jakob);remote( Joel, Tiju, Dmitry, Xavier, Tore, Gonzalo, Rolf, David, Rob, Jhen Wei, Brian).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • data taking, T0 reprocessing and HLT reconstruction in progress.
    • T1 sites
      • NIKHEF: Timeouts for transfers to T2s in other clouds caused by network cut by Internet provider GGUS:75388. NIKHEF declared DT. The problem is fixed.
      • PIC - SARA transfer problem GGUS:75413.
    • T2 sites
      • ntr

  • CMS reports -
    • LHC / CMS detector
      • data taking ongoing
    • CERN / central services
      • CMSR Oracle instance spontaneous reboot problem, GGUS:74993, kept open to follow up with increased logging information
    • T0 and CAF:
      • cmst0 : very busy processing data !
    • T1 sites:
      • [T1_TW_ASGC]: general file access problem, suspicion is some staged files are corrupt when coming from tape, GGUS:75377
      • [T1_IT_CNAF]: JobRobot problems (part of our Nagios checks), GGUS:75351, ongoing (site not blacklisted, just marked not ready; production jobs are running fine)
      • [T1_FR_CCIN2P3]: overload in dCache affecting transfers form/to IN2P3, nothing to be done right now? GGUS:75397

  • LHCb reports -
    • Experiment activities
      • Reconstruction and stripping at CERN
      • Reprocessing at T1 sites and T2 sites
    • T0
    • T1 sites:
      • CERN : (GGUS:75374) during the last week we never get in average our LSF share at CERN : why ?
      • PIC : (GGUS:75344) : problem with SRM
      • RAL : problem with one disk server.
      • IN2P3: (GGUS:75382) reallocation of free space from LHCb_MC_DST and LHCb_MC_M-DST
      • PIC: (GGUS:75383) reallocation of free space from LHCb_MC_DST and LHCb_MC_M-DST
      • SARA: (GGUS:75384) reallocation of free space from LHCb_MC_DST and LHCb_MC_M-DST
      • Gridka: (GGUS:74915) reallocation of free space from LHCb_MC_DST and LHCb_MC_M-DST * T2 sites:
      • IN2P3 : reconfiguration of CE is done.

Sites / Services round table:

  • ASGC: ntr
  • BNL:
  • CNAF:
    • CE problem still there (jobrobot uses the faulty one; production correctly uses the remaing 2 which are OK)
    • tomorrow downtime (atlas storm end-point) from 14:00 to 16:00 UTC
    • Thursday downtime (tape library) from 6:00 to 10:00 UTC
  • FNAL:
  • IN2P3:
  • KIT:
    • GPFS failure early this morning. Now solved
    • Yesterday DB upgrade OK
    • Firewall changes on October 24. (5:00 to 6:00 UTC): During that period GGUS could be unreachable
  • NDGF: ntr
  • NLT1: Big network problem (with a network provider).
  • PIC: ntr
  • RAL:
    • Unclear problem in transfer with EOS (n. of stream = 1)? They will open a ticket.
  • OSG: ntr

  • CASTOR/EOS: ntr
  • Dashboards: ntr
  • Databases: ATL Offline crash (high load) around 4AM UTC. Service could be re-established quickly. Under investigation.
  • Grid services:
    • LFC problem solved. Measures to avoid this situation to happen again now in place
    • Last LCG CEs at CERN will be retired during the Christmas break

AOB:

Wednesday

Attendance: local(Elena, Massimo, Luca, Jamie, John, Edward, Jakob, Maria DZ);remote(Gonzalo, Lisa, Jhen Wei, Gareth, Rolf, Ron, Giovanni, Tiju, Rob, Pael, Tiju).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • data taking, T0 reprocessing and HLT reconstruction in progress.
    • T1 sites
      • TAIWAN: Production jobs were failing at several WN with "Exception caught by pilot" errors GGUS:75436. WNs are set offline.Thanks
      • PIC - SARA transfer problem GGUS:75413. SARA team make investigation. Thanks. Transfers are very slow ( 300 Mbps ).This is very likely an LHCOPN issue. CERN is involved. Also the timeouts are seen for transfers PIC->GRIF-LAL GGUS:75429.
    • T2 sites
      • ntr

  • CMS reports -
    • LHC / CMS detector
      • data taking ongoing (90 m tests)
    • CERN / central services
      • CMSR Oracle instance spontaneous reboot problem, GGUS:74993, kept open to follow up with increased logging information
    • T0 and CAF:
      • cmst0 : very busy processing data !
    • T1 sites:
      • [T1_TW_ASGC]: general file access problem, suspicion is some staged files are corrupt when coming from tape, GGUS:75377, possible fix being tested
      • [T1_IT_CNAF]: JobRobot problems (part of our Nagios checks), GGUS:75351, ongoing (site not blacklisted, just marked not ready; production jobs are running fine)
      • [T1_FR_CCIN2P3]: overload in dCache affecting transfers form/to IN2P3, nothing to be done right now? GGUS:75397

  • LHCb reports -
    • Experiment activities
      • Reconstruction and stripping at CERN
      • Reprocessing at T1 sites and T2 sites
    • T0
    • T1 sites:
      • CERN : (GGUS:75374) during the last week we never get in average our LSF share at CERN : why ?
      • PIC : (GGUS:75462) srm / FTS error
    • T2 sites:

Sites / Services round table:

  • ASGC: CMS job problem under investigation
  • BNL:
  • CNAF: Job Robot problem not understood (but quite benign). Anyway the service will be moved to a more powerful node to compensate for some high CPU load.
  • FNAL: NTR
  • IN2P3: NTR
  • KIT: NTR
  • NDGF:
  • NLT1: dCache problem now solved
  • PIC: NTR
  • RAL: DB problem affecting FTS and LFC. Fail over was OK for LFC but human intervention was needed to restart FTS properly
  • OSG: One BDII (out of 2) had a problem. Should have been invisible.

  • CASTOR/EOS: EOS interventions: tomorrow (10:00-11:00 CET) on EOSCMS (transparent); Next Monday (24-OCT 10:00-11:00 CET) EOSATLAS (15' downtime)
  • Dashboards: NTR
  • Databases: NTR
  • Network: the problem PIC/SARA (since it is routed via CERN) was also debugged at CERN. Now (14:00 UTC) seems to be gone.

AOB: (MariaDZ)

  • Please comment in Savannah:123393 on the request to make GGUS tickets publicly available for searches.
  • Today's GGUS release was problematic for the middleware Support Units as the old (but still open) GGUS tickets were created in SNOW without their diaries' contents.
  • We have received no comments from WLCG in Savannah:123890 with arguments in favour or against making ToP mandatory.
  • We have received no comments from WLCG shifters in Savannah:120505 about the request for GGUS to generate an email notification to the GGUS ticket submitter. These last 2 points were requested 2 weeks ago (see https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek111003#Thursday AOB).

Thursday

Attendance: local(Elena, Steve, Massimo, Ulrich, Mattia, Maarten, Jakob, Maria DZ, Luca, John, Gavin);remote(Lisa, Tiju, Alessandro, Joel, Ulf, JhenWei, Ronald, Gartehm Rolf, Gonzalo, Rob, Andreas, Giovanni).

Experiments round table:

  • ATLAS reports -
    • T0/Central services
      • HLT reconstruction in progress.
    • T1 site
      • TAIWAN: Problem with one disk server. Fixed quickly.Thanks
      • INFN-T1: one stucked FTS job is blocking new transfers. GGUS:75524.
      • CERN: CERN FTS monitor doesn't show active transfers for ATLAS (only in READY state) GGUS:75526.
      • Transfers PIC->GRIF-LAL GGUS:75429. CERN was involved. There is no problem at the CERN side. PIC is seeing the problem at FR side. Ticket has been re-assigned to GRIF.
    • T2 sites
    • ntr

  • CMS reports -
    • LHC / CMS detector
      • Totem / Alpha run ongoing
    • CERN / central services
      • CMSR Oracle instance spontaneous reboot problem, GGUS:74993, kept open to follow up with increased logging information, still waiting for a solution
    • T0 and CAF:
      • cmst0 : very busy processing data !
      • Tier-0 head node vocms15.cern.ch was not reachable this morning, GGUS:75510, fixed
    • T1 sites:

  • LHCb reports -
    • Experiment activities
      • Reconstruction and stripping at CERN
      • Reprocessing at T1 sites and T2 sites
    • T0
      • LFC issue: There were a bit more than 12 millions requests sent to lfclhcbro spread over the 2 servers almost equally (lfclhcbro01 got more bursts). On lfclhcbro02, there were not more that 50 threads in use at any time out of the 80 configured. On lfclhcbro01, we had a few bursts taking 70 threads (out of 80). So we were never out of threads and so the connect time should be almost instantaneous. There were no special errors except 2 sets of errors: 1) 23 sessions were not closed by the client and timed out after 60 seconds (requests spread over the 2 servers and during the day, no burst). 2) Around 18:30 there were disconnections between lfclhcbro02 and the DB servers. May be the DB people could comment.
    • T1 sites:
    • T2 sites:

Sites / Services round table:

  • ASGC: CMS problem being investigating
  • CNAF: CMS JobRobot problem solved. ATLAS FTS problem: probably overload effect, worked around by increasing the number of slots: this is a temporary measure because ATLAS would like to go back to std settings and fix the problem root cause
  • FNAL: NTR
  • IN2P3: 3-hour dCache outage because of a hardware upgrade of the main server for coming Tuesday morning. No jobs using dCache will be able to run during the intervention.

  • KIT: LHCb ticket in hand
  • NDGF: Finland site going now in DW due to OPN maintenance (2h). Saturday Slovenian resource will be unavailble for a few hours (power intervention)
  • NLT1: 2 disk servers restarted (scratch space for ATLAS)
  • PIC: NTR
  • RAL: NTR
  • OSG: NTR

  • CASTOR/EOS: Today at 12:00 UTC EOSATLAS emergency intervention to fix an xrd3cp problem (transparent)
  • Dashboards: NTR
  • Grid services:
    • VOMS problem (new CA not automatically accepted) creating problems to UK and Brazil users identified. Workaround: add CAs by hand. Developers contacted.
    • LSF workaround in place (LHCb quota problems). Discussing with Platform for a proper solution
    • LFC problems: please submit a ticket

AOB: (MariaDZ) All 17 test ALARM tickets seemed to have worked well following the release yesterday. Comments, if any, in Savannah:123788 please.

Friday

Attendance: local(Elena, Joel, Jamie, Jacek, Ignacio, Mattia, Uli, Jacob, Maarten);remote(Elisabeth, Ulf, Lisa, Rolf, Gonzalo, Sunrise, Onno, Giovanni, Shu-Ting, Xavier, + 2 others (anon, sunrise), Gareth ).

Experiments round table:

  • ATLAS reports -
  • Physics
    • data taking (alpha run)
    • HLT reconstruction in progress.
  • T0/Central services
    • ntr
  • T1 site
    • INFN-T1->pic : reduced efficiency for transfers between two sites GGUS:75550. CERN moved CERN-CNAF link together with the CERN-PIC link on the same CERN router. In progress. [ Gonzalo - AFAIK problem is solved based on info received; will have to check on this ] Checked and problem solved, transfers CNAF-PIC OK since Edoardo Martelli confirmed that problem was fixed at CERN
    • INFN-T1: one stucked FTS job is blocking new transfers. GGUS:75524. Two problematic channels with long queues. In progress. [ Giovanni - job shown in ticket finished "dirty". Ale - please clarify. Giovanni - finished dirty. Ale - 3 jobs were in active state for several days in 3 channels. Problem that ATLAS observed is that since these jobs were stuck ATLAS was not able to submit any more transfers for those channels. ]
    • CERN: CERN FTS monitor doesn't show active transfers for ATLAS (only in READY state) GGUS:75526. Solved. Thanks
    • Transfers PIC->GRIF-LAL GGUS:75429. In progress.
  • T2 sites
    • ntr


  • CMS reports -
  • LHC / CMS detector
    • Data taking ongoing
  • CERN / central services
    • CMSR Oracle instance spontaneous reboot problem, GGUS:74993, kept open to follow up with increased logging information, still waiting for a solution
  • T0 and CAF:
    • cmst0 : very busy processing data !
  • T1 sites:
    • [T1_TW_ASGC]: general file access problem, suspicion is some staged files are corrupt when coming from tape, GGUS:75377, possible fix seems to have failed
    • [T1_FR_CCIN2P3]: overload in dCache affecting transfers form/to IN2P3. GGUS:75397. Ongoing. (but not as severe as it was ...) [ IN2P3 - request to CMS; 2 tickets waiting for reply from CMS side. GGUS:75391, GGUS:75397
    • [T1_DE_KIT]: Job Robot failing, https://savannah.cern.ch/support/index.php?124147 [ Xavier - please use GGUS as noone from us is looking into Savannah. If problem still exists open new GGUS ticket referring to Savannah ]


  • LHCb reports -
  • Experiment activities
    • Reconstruction and stripping at CERN
    • Reprocessing at T1 sites and T2 sites
  • T0
    • LFC issue: (GGUS:75533)
    • CERN : after discussion we are running more than 3500 jobs now. Thanks to LSF people for the new tuning.
  • T1 sites:
    • GRIDKA : (GGUS:74584) : jobs like stalled !!!
    • SARA : (GGUS:75573) : dCache problem. fixed this morning and closed. [ Onno - we had to restart 3 pool nodes that had troubles restoring files from tape. After restarting dCache on those nodes was ok. Looks like communication problem between head node and these nodes. If happens again will investigate further . ]
  • T2 sites:


Sites / Services round table:

  • ASGC: ntr
  • BNL
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: nta
  • KIT: early this morning dCache instance was offline for 2h as /var full and service had to be restarted. LHCb affected. Joel - could explain why jobs were stalled. This morning - Y
  • NDGF: ntr
  • NLT1: nta
  • PIC: ntr
  • RAL: had problem for 1h yesterday pm for CMS CASTOR instance. Job manager stopped working and failed 1-2 SAM tests
  • OSG: ntr

  • CASTOR/EOS: ntr
  • Dashboards: ntr
  • Databases: next week we will be patching all INT DBs with latest security patches from Oracle to prepare for patching production during next TS; problems with devdb11 which is down today due to problems with VMs used for DBs. Hopefully fixed today
  • Grid services: this morning saw pileup of grid OPS jobs that did not get scheduled - CERN was thus degraded in SLS. Proxies expired for pending jobs. Maybe related to changes to fairshare system

AOB: (MariaDZ) ALARMers who receive GGUS notifications by sms or other email derivatives please read and comment in Savannah:124169 if you have any formatting requests.

-- JamieShiers - 14-Sep-2011

Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2011-10-21 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback