-- HarryRenshall - 05 Jan 2009

Week of 090105

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

Site Date Duration Service Impact Report Assigned to Status

GGUS Team / Alarm Tickets during last week

  • Tickets: see https://gus.fzk.de/ws/ticket_search.php and select "VO" - this will give all tickets, including team & alarm - or use the following link: * In the GGUS Escalation reports every Monday
  • Probably due to Year End holidays the escalation reports were unavailable today so here is the manual selection of GGUS tickets per experiment for the 19/12/2008-05/01/2009 period:
    • Alice: 0 tickets
    • Atlas: 30 tickets
    • CMS: 4 tickets
    • LHCb: 1 ticket

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Olof,Eva,Harry,Nick,Jean-Philippe,Julia,simone,Andre,MariaDZ);remote(Fabio,Gareth,Jeff).

elog review:

Experiments round table:

ALICE (reportedby HR): Over the xmas break the ALICE WMS at CERN became unable to schedule jobs sufficiently fast (both gLite 3.0 and 3.1 but with different symptoms) thus losing production time.

ATLAS (SC): Positive news. The only activity during the break was a major reprocessing of the most valuable part (0.5 TB out of 2 TB) of last Autumn's cosmic data. 8 Tier 1 were validated for this (not ASGC and NL-T1) plus 5 US Tier 2 sites and CERN and the production ran smoothly. Merging of AODs and DPDs is complete and distribution (from the merge sites) has started. Merging of ESDs is ongoing. JT asked how to identify merge jobs the answer being that they are far more i/o intensive than reprocessing jobs.

CMS (AS): Nothing special to report,

Sites round table:

IN2P3 (FH): They switched to using srmv2 and found some cms jobs putting a high load on on their dcache srm servers by issuing srmLs commands on subdirectories and therefore switched back to srmv1 (which has no srmLs command). Andrea followed this up and reported: I understand that the problem you mentioned at the daily WLCG operations meeting was the one experienced at IN2P3 before December 20th, when your SRMv2 was overloaded by many "srmLs" calls coming from CMS jobs. As Farida knows very well, as it was discussed extensively in the CMS Hypernews, the cause was that the srmls command was used to check if the directory where to stage out the files existed, but without the -recursive_depth=0 option. That was equivalent to an "ls" in UNIX. Now the ProdAgent code has been changed to use that option, which is equivalent to a "ls -d" and is much lighter on the SRM server. As soon as CMS starts using the ProdAgent with the updated code, it should be safe again to use SRMv2.

RAL (GS): There will be an outage of the castoratlas instance tomorrow to reconfigure it in preparation for the next ATLAS 10-million files test which should start next week.

ASGC (by email from JS): earlier this morning, one of the raid subsystem serving oracle rac showing overheat alarm and cause instability of i/o access error intermittently. OPS help resetting the controller and front end cluster service able to resume normally but the database serving CASTOR and LFC (they're accessing same partition map to same LUN from the controller) can't resume normally.

The error code 'ORA-00600: internal error code, arguments: [3020]' remain the same observed earlier but this mainly because the DB service shutdown unexpectedly when three of the cluster nodes offline in the same time.

OPS already perform recovery procedures and escalation (over 12Hr from the first event observed in monitoring interface) have been proceed to expert. we hope the service able to resume shortly.

SD have been add for two services (srm and srm2 of CASTOR) to avoid confusion from dashboard. keep you posted if having any update.

Services round table:

Databases (ED): Since 1 January there is a problem with CMS streams replication due to their running big transactions which are exhausing the memory of the DB servers. Replication to ASGC is down with a lock corruption.

FIO (US): seeing a big increase in the load on the CERN myproxy servers due to ALICE renewing submit server proxies on each job submission. They will change this to a 10 minute interval.

VOMS (MDZ): 2 minor VOMS problems over the break - one on Dec 23 fixed in a few hours and one yesterday transparent to users.

AOB:

Tuesday:

Attendance: local(Eva, Harry, Jan, Jean-Philippe, Simone, Olof, Ulrich, Julia, MariaDZ, Nick, Kors);remote(Jeff, Michael, Gareth, Brian).

elog review:

Experiments round table:

ALICE (by email from L.Betev): 1. Despite the partial overload of the WMS-es at CERN, the production was very successful with ~4K jobs in parallel over the 3 weeks of Christmas holidays. 2. Maarten Litmaath (always online) warned us of the overload status and provided quick info on the situation with the WMS-es at CERN. With Patricia Mendez we were able to reconfigure the sites to use less loaded WMS, thus the loss of production time was very small. 3. Thanks to the expert information from both Patricia and Maarten, we were also able to identify couple of improvement points for the submission framework, which will be implemented and deployed shortly. 4. One of the biggest issues is that most of the sites rely on WMS at CERN, thus making it a critical point for the ALICE GRID operation. Further post-mortem will be performed in the coming week.

LHCb:

The plot in attachment summarizes the activities during the Christmas period:

- Red and green represent the dummy productions ~35k (most marked as failed, see
http://lblogbook.cern.ch/Operations/1114 )
- Yellow shows user jobs ~12k
- Purple are SAM jobs ~5k.

jobs0809xmas.png

The dummy productions were being extended as necessary during the holidays so a relatively small number of jobs
were executed. Jobs that managed to run show a reasonable ratio CPU/wall clock then the real problem experienced can be traced down into a problem getting pilots running. This is more evident in the following plot that shows
the status of submitted pilots over the Xmas period and gives an indication of the throughput of the activities.

pilotStatus0809xmas.png

With the central task queues saturated we observed a small throughput (with ~8WMSes in use). The cause of large
numbers in the 'Deleted' status is when the status of pilots is not determined after several days e.g. if an WMS becomes not available. Only after the 29th of December a steady ramp up of Done pilots (see last plot) shows then a more stable submission. From a look on the DIRAC logs
there seemed to be a large number of list-match failures. It would be nice to know whether there were known problems (BDII, Network) that could explain instabilities on the RB side during the first week of the Xmas break time (21-29 of December) .

cumulativepilots0809xmas.png

ATLAS (SC): Functional tests are running smoothly with one communications problem between FZK and NDGF (put down to networking as their links to other sites are working). Both sites have been contacted by the ATLAS expert on call. There will be a past mortem analysis of the reprocessing exercise tomorrow. Jeff then queried the information that SARA had not been validated to participate in the reprocessing. Simone reported the site had failed validation tests before xmas possibly due to pilot jobs not coping with space tokens at two different subsites. To be confirmed at the post mortem analysis.

Sites round table:

RAL: Brian reported they had made a configuration change to one of the disk pools serving reprocessing jobs which should improve their efficiency. Some jobs were timing out waiting to stage data in from tape due to a too tight garbage collection strategy. Gareth reported the castoratlas changes reported on yesterday had been completed with a slight overrun and they were just waiting now to flip to a new version of srm.

NL-T1 (JT): There will be router changes at SARA on 12 January causing several short network interrupts.

Services round table:

DataBases (ED): The problem of CMS streams replication reported yesterday has now been fixed.

AOB:

Wednesday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Services round table:

AOB:

Thursday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Services round table:

AOB:

Friday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Services round table:

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng cumulativepilots0809xmas.png r1 manage 39.6 K 2009-01-06 - 11:38 RobertoSantinel  
PNGpng jobs0809xmas.png r1 manage 53.0 K 2009-01-06 - 11:29 RobertoSantinel  
PNGpng pilotStatus0809xmas.png r1 manage 31.5 K 2009-01-06 - 11:35 RobertoSantinel  
Edit | Attach | Watch | Print version | History: r11 | r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 2009-01-06 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback