Week of 100614

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Harry(chair), Jaroslava, Laurence, Ulrich, Eva, Lola, PeterK, Andrea, JanI,Jamie, MariaG, MariaDZ, Alessandro, Jean-Philippe, Gavin, Dirk, Roberto, Simone);remote(Jon+Catalin(FNAL), Joel(LHCb), Gonzalo(PIC), Michael(BNL), Gang(ASGC), Angela(KIT), Gareth(RAL), Rob(OSG), Rolf(IN2P3), Onno(NL-T1)).

Experiments round table:

CERN-PROD_SCRATCHDISK: issue got solved on Saturday morning, Thanks! https://gus.fzk.de/ws/ticket_info.php?ticket=58904

Overloaded BDIIs at several sites over the weekend resulted in failing jobs with problems retrieving inputs: Australia-Atlas, JINR, Nikhef, INFN-T1. It's becoming an issue over past 2 months. What is the underlying issue? Laurence Field asked for more details - which sites and clients. Alessandro reported CERN, RAL and INFN and that worker nodes querying their local bdii were timing out. Peter Kreuzer thought CMS are also seeing this in some SAM tests at INFN and some Tier 2s. Laurence thought the client timeout was hard-wired at 15 seconds but he will follow-up this up together with Alessandro.

Dear T1s, please take opportunity of this week to perform FTS upgrade! LHC is commissioning this week, therefore not much data to take, no data to export to T1s from ATLAS (except functional tests).

T0 Highlights 1) cosmics. 2) data taking during machine development. 3) CSM DB issue from last week : T0 Data Base Service (TOAST) ok now, after restarting DBS, moving it away from the DB-node TOAST is using. 4) Today : intervention on the storage manager DB at point 5. Afterward affected the CMS T0 (SM host down ?); T0 team changed settings and then working again.

T1 Highlights: 1) 2 GGUS Team tickets opened to INFN-T1 ((tickets :https://gus.fzk.de/ws/ticket_info.php?ticket=58920 and https://gus.fzk.de/ws/ticket_info.php?ticket=58987), may or may not be the same issue), both related to SW configuration : admins at INFN-T1 made a general fix and asked CMS to confirm things are fine now, to be followed up 2) BDII not Visible issue at INFN-T1 on Saturday Jun 12, covered in the same ticket as above : https://gus.fzk.de/ws/ticket_info.php?ticket=58987

T2 Highlights: 1) MC production as usual. 2) BDII not Visible issues at the 2 CMS Lisbon T2s : see output from the Dashboard Site Status Board : http://dashb-ssb.cern.ch/templates/cache/bdii_log.html#T2_PT_NCG_Lisbon and http://dashb-ssb.cern.ch/templates/cache/bdii_log.html#T2_PT_LIP_Lisbon

GENERAL INFORMATION: Pass 1 reconstruction activities together with two analysis train ongoing. No MC production activities during the weekend. In terms of raw data transfers, very low activity for the moment

T1 sites: CNAF: on Sunday morning the experts detected a problem at the local ce07 (CREAM). Connections to the service were being refused at submission time. The site admin was immediately informed and he took actions in few hours (therefore there is no GGUS ticket)

FZK: also on Sunday morning the restart of all the local services at both VOBOX was neccessary. ALICE experts investigating why this activity is needed from time to time at several voboxes

T2 sites: Clermont: On Saturday night experts detected a wrong information reported by the local information system of the CREAM-CE: A large amount of Alice jobs appeared in status running while the experiment had stopped submitting new agents since almost 24h. The issue was reported to the ALICE expert at the site during the weekend and this expert has confirmed this morning the problem is gone. Site is back in production

Cagliari and CyberSar-Cagliari: Both sites are out of production. The local ALIEN user proxy expired in both local VOBOXES. The responsible has been informed, waiting for his actions

Experiment activities: Running several MC productions at low profile. Merging production. GGUS (or RT) tickets:

T1 site issues: CNAF no shared area variable defined (GGUS:58985). IN2p3: SRM endpoint not available on Saturday. SAM tests confirm this outage. GGUS:58994

T2 sites issues: Shared area issues at: BG05-SUGrid(GGUS:59015) IL-TAU-HEP(GGUS:59007).

Sites / Services round table:

FNAL: Up to date with FTS.

ASGC: Will migrate to FTS 2.2.4 tomorrow.

KIT: Migrated FTS last week. Now planning to go to srm 2.5. Had a hardware problem with their CMS dcache headnode during the weekend - 4 hour downtime.

RAL: Had bdii issues over the weekend (had missed an upgrade) and were failing ops SAM tests. Ready to perform the FTS upgrade and proposing Wednesday. Installation problems with new disk servers last week for ATLAS and CMS - fixed by Friday. Gareth queried if the SAM to Nagios switchover is still scheduled for the 15 June. Jamie replied this date had been to mesh with an MB meeting that has now been postponed. There are differences in the test results (there are different algorithms of course) that MB members would like to fully understand so there is no new date yet. Gareth also queried if the access to the results database will change - Harry to follow-up.

OSG: Problems in the OSG-GGUS interface on the OSG side over the weekend but no updates were lost.

IN2P3: The LHCb srm crash outage was due to a known bug which is protected by an autorestart and the service was back before the ticket was created.

NL-T1: Will upgrade their FTS tomorrow.

CERN CEs: Planning to migrate four lcg-CE that submit to slc4 to submit to slc5 as soon as possible and to update the CREAM-CE to release 3.2.6. Will take some time as they need to be drained of jobs.

CERN CASTOR: Adding 128 TB to the ATLAS Tier 3 disk pool as requested by B. Panzer.

AOB: Simone reported that Friday's high OPN traffic between CERN and RAL was an ATLAS user of CERN LSF batch reading data from RAL. He was surprised that CERN worker nodes are on the OPN as he did not think other Tier 1 were configured that way. Gareth pointed out that if not on the OPN then the GPN would have been overloaded.

Tuesday:

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

CERN VOMS: The host certificate of voms.cern.ch will be updated on Wednesday 16th June 10:00 CEST. If you have the lcg-vomscerts package installed on your service then you must have updated to version 5.9.0-1 of this package by this time.

AOB:

Wednesday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 08-Jun-2010

Edit | Attach | Watch | Print version | History: r13 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2010-06-14 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback