Week of 090323

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).

GGUS Team / Alarm Tickets during last week

Weekly VO Summaries of Site Availability

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Jean-Philippe, Maarten, Roberto);remote(Angela/FZK, Gareth/RAL, Alessandro).

Experiments round table:

  • ATLAS (Alessandro) - few problems during the weekend: Panda Monitor machine voatlas21: 2 processes automatically killed for overload; problem currently with the Panda Monitor developers; many "critical" errors on PIC VOBOX - not really critical as due to subscription to non existing datasets - msg to be changed; SRM server down at MPPMU; pretty smooth otherwise.

Gareth worried about raising an alarm ticket to push Tier1s to update FTS in case of proxy delegation error. Also discussed in Prague. But as a site not updating could lead to great unavailability, Alessandro proposes to send now a team ticket about FTS upgrade; Maarten will check how to have the FTS release pushed from PPS to Production (easier for sites to take); if sites do not upgrade and many errors are seen an alarm ticket will be sent.

  • ALICE -

  • LHCb (Roberto) - quiet week: low level Monte Carlo and random analysis; several sites banned as unable to upload data from WNs to SE; because of the problem seen with some sites a couple of weeks ago, LHCb is wunning lcg_util transfers with a single stream, but this is very bad from performance point of view (0.3 MB/s form WN to SE); will try to test the sites one by one outside of the Dirac framework to see if some network mis-configuration. Could be also due to wrong setting of GLOBUS_TCP_PORT_RANGE on WNs. Roberto will give the list of problematic sites (GGUS 46946).

Sites / Services round table: RAL: the large I/O activity from MCDISK reported last week was activity on the LAN (SE to WN), this is why Alessandro did not notice it in his monitoring. RAL had today one hour unscheduled downtime because of a fibre channel problem on the Oracle RAC servers for Castor; problem for CMS took a bit longer to fix but everything should be ok now.

AOB: (MariaDZ) It's show time folks! In https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru one reads: "Next round in 2009: Maria to remind the VOs on Mar 23rd. The tests will run in the week of Mar 30th, the GDB being planned for April 8th. Related ticket savannah #107452. Any questions, lets discuss them tomorrow in this meeting. Alarm tests should start right after CHEP'09.

Tuesday:

Attendance: local(Miguel, Eva, Alessandro, Jean-Philippe, Roberto, Olof, MariaD, Ignacio);remote(Luca/CNAF, Angela/FZK, Gareth/RAL).

Experiments round table:

  • ATLAS (Alessandro)- Monday was quiet - Reprocessing will start on Thursday - problem in FZK due to Munich and Freiburg (ticket issued) - GRIF/LAL MCDISK full: need to understand why - Stephane restarted *aod subscriptions, this will increase the data rate on MCDISK.

  • ALICE -

  • LHCb (Roberto)- load on LFC at CERN increased because of the retries on failed uploads: one LFC node has been added by FIO - one file access issue in Lyon: file not yet online after 2 days - SRM problem at CERN (load?): uploads failing, ticket submitted,Shaun investigating.

Sites / Services round table:

Databases (Eva): DB down in Taiwan for several weeks: remove from replication - replication to RAL affected by power cut - replication to CNAF will be stopped next week for the scheduled site down time.

New version of FTS is going from PPS to Production patch #2760/2761. Sites should start installing.

AOB: (MariaDZ) As discussed with Luca, if CNAF is on scheduled downtime during the whole of the ALARMS testing week, the site should be omitted from this round of tests. I have added this common-sense phrase in the 'testing rules' https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru (point 4). About CNAF rejecting LHCb alarm tests in the last round (early March), expecting the email to be signed with Roberto's certificate, the documentation https://gus.fzk.de/pages/ggus-docs/PDF/1560_Alarm_Ticket_Process.pdf (section 3.1.2. and 3.2.1), explains:

  • The Authorised ALARMER signs with his/her certificate (section 3.1.2.)
  • The email is sent to a specific site alarm mail address and signed with the GGUS certificate. (section 3.2.1.).
This document, included in https://gus.fzk.de/pages/docu.php#8 is also linked from 'testing rules' https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru (point 5).

* RAL (Gareth): major outage due to 2 power glitches: nothing available at RAL now - looked at BDII problem reported earlier: seems to be similar to the one reported by NIKHEF; Maarten suggests to install the latest BDII RPM from PPS repository; several sites have already installed it.

* CNAF (Luca): CNAF on scheduled downtime next week - queues will be stopped this Friday afternoon - jobs should be able to complete - power off on Monday morning - some services like mail will stay up - Tier1 may be optimistically back on Thursday.

Wednesday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

  • LHCb -

Sites / Services round table:

LHC Voms Service
On March 31st at 09:00 UTC a routine host certificate change will happen on lcg-voms.cern.ch. Intervention will be transparent and last only a few minutes. As per usual by this time ALL voms aware services must have deployed the /etc/grid-security/vomsdir/*.lsc file method or have upgraded the lcg-vomscerts package to version 5.3.0-1. (As normal the .lsc method is not available to the WMProxy). A standard GOCDB "at risk" has been entered.

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

  • LHCb -

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

  • LHCb -

Sites / Services round table:

AOB:

-- JamieShiers - 19 Mar 2009

Edit | Attach | Watch | Print version | History: r17 | r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2009-03-24 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback