Week of 090706

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Jamie, MariaG, Daniele, Sophie, David, Julia, Patricia, Stephane, Simone, MariaDZ, Gang, Andrea, Dirk);remote(Michael/BNL, Angela/GridKA, Roberta/CNAF).

Experiments round table:

  • ATLAS - (Stephane) - comics datataking finished, data distribution will continue until wed due to some some delay in T0 processing. ATLAS experienced problems with several of the Muon sites (Roma, Napoli, and Munich) for about one day in total. The experiment is discussing with the sites concerned how to improve their problem resolution procedures. A ticket concerning file access problems (#5007), was created an followed up at T0. ATLAS was able to get access to the data from a file copy at BNL. The reason for the failure at T0 was tracked down to a faulty disk server. ATLAS asked if the service was alerted also by their monitoring. Sophie: service responsible got notified also by sysadmin team. Last Friday a problem with about 1000 files in IN2P3 has been fixed - most of which have been recovered from other sites. RAL expect to be back in production by the end of today. Michael: how much more data is going to be distributed by ATLAS? Stephane: will find out and inform the T1 sites.

  • CMS reports - (Daniele) - "no-ticket" period from CMS to ASGC ended today. CMS will now restart normal follow-up procedure with the site. Otherwise services were running with high quality (no other tickets). Only a glitch in automated email notification from savannah was found and fixed. Other minor sites issues are listed in detail on the CMS twiki.

  • ALICE - (Patricia) smooth weekend also for ALICE. Only issues with a few unaccessible voboxes are being followed up directly with the sites.

Sites / Services round table:

  • GridKA/Angela : investigating ALICE problems with GridKA cream ce - local submission works.
  • RAL/Gareth: RAL outages ended according to schedule (castor at midday - batch at 14:00). Only few remaining issues (3 disk servers from ATLAS MC). The new tape robot is still being filled with media expected to become available tomorrow. ATLAS: can we write? Gareth: yes, files will stay on disk until the tape is available.
  • CNAF: following up on CE problems - expected to be solved by an upgrade to a new version of INFN grid s/w (later this week).
  • ASGC/Gang: tape drives are online since last Thu, but no traffic seeen yet. The site is crosschecking with experiments.
  • T0/Sophie reported on a SRM intervention to cleanup remaining problems from the oracle "BigID" problem.
  • T0/MariaG: finished migration of validation clusters to RH5 and will continue with production instances. On Wed there will be transparent intervention to add redundant power supply on network switches. Sophie: what is the downtime for the RH5 upgrade? MariaG: some 30 mins
  • T0/David: Geant intervention planned for tonight 6am, which will interrupted several circuits - expected to be transparent.

AOB:

  • MariaDZ reported from discussions with OSG responsible on the notification procedure (eg Alert tickets). It was suggested to raise proposals / remaining issues eg in one of the next LCG MBs.

Tuesday:

Attendance: local(Julia, Ricardo, Sophie, Harry, Andrea, Jamie, MariaG, MariaDZ, Danielle, Roberto, Patricia );remote(Michael, Angela, Jeremy, Gareth, Tiziana).

Experiments round table:

  • ATLAS - (email from Stephane) ATLAS is now fully using RAL (except that there is a scheduled downtime now). I have no news about the ATLASMCDISK disk servers missing (but my colleagues might have some - later update by Gareth: diskservers back but CASTORATLAS down) - when reprocessing of cosmic data (taken in the last 2 weeks) will start, TAIWAN will get , as any other T1, the reprocessed data.

  • CMS reports - Tickets to ASGC have now been unpostponed so there are currently 5 open tickets for them. Situation for Tier 1 is very good apart from RAL in shadow downtime. Nebraska and Florida now in better state. Minor glitch in the dashboard which resolved itself.

  • ALICE - was overloading a WMS at CERN. Have an issue of sites having more than one CE where ALIEN was only counting the waiting jobs from one CE so sending too many jobs. Will be solved with a new ALIEN module being tested at CERN today.

  • LHCb reports - MC generation of their 10**9 events has resumed after a weekend pause to fix the full disks on the SE used to store logs. Preparing for FEST production tomorrow although some issue with installation SAM jobs failing and preventing to install requested versions of LHCb application software. Small issue with a WMS at PIC not responding - ggus ticket submitted.

Sites / Services round table:

* RAL: Network instabilities this morning causing various interruptions. All services and ATLAS MC diskservers are back but CASTORATLAS instance down at the moment.

* INFN: CE problem (large number of zombie globus processes) has been solved by upgrading to globus 1.0.15

* ASGC: Services back in computer centre. Have found water damage in some patch panels, causing communications instabilities, that will need to be replaced (next week). Site is now passing the ATLAS functional tests.

* GRIDPP: One of the Tier 2 at Imperial College (London) closed as planned.

AOB:

  • Proposal for rest of week (due to GDB and WLCG STEP'09 Post-Mortem workshop):
    • Retain the call tomorrow (Wednesday)
    • Tentatively skip the calls Thursday and Friday (unless there are major problems)
    • In both cases, sites and experiments are invited to add entries directly into this page (well) prior to the 'meeting'.

Wednesday

Attendance: local(Daniele, Sophie, Michele, Jamie, Roberto, Patricia, Dirk);remote(John Kelly/RAL, Stephane/ATLAS, Angela/GridKA, Michael/BNL)..

Experiments round table:

  • ATLAS - (Stephane) - functional tests outage due the certificate issues (test stopped from yesterday afternoon to this morning). Related also a problem with ASGC FTS server which seemed to be still using an old proxy after the VObox upgrade earlier this week. The problem has been fixed now by ASGC. ATLAS is suffering from problem with their central catalog which is still under study.

  • CMS reports - (Daniele) quiet day with a few tickets closed. New tickets for PIC concerning time-outs on test transfers and failing CMS specific SAM tests. Started discussion on how to properly handle t2 downtime on status board to steer mc production and inform shifters. A new CMS sw release is coming up. Currently a few sites don't appear in the bdii and production agents can't send jobs (Andrea Sciaba is investigating the issue).

  • ALICE - (Patrica) this morning ALICE started a new mc cycle: old jobs have been killed and new production is ramping up now. Testing with virtual machines for worker nodes show some communication with the VObox - followed up between ALICE and T0/FIO.

  • LHCb reports - (Roberto) MC activity ongoing and a new FEST round started ( transfer and reco at t1). Two new problem for CERN and RAL (with local tickets as GGUS was down). CERN: LHCb space token affected by new disk servers which moved in production yesterday (LHCb has lost one day of production time). The problem was tracked down to a wrong gridmap configuration and the disk servers have been taken out again. Sophie: not clear why only LHCb was affected as ALICE and ATLAS where using a similar configuration in tests. RAL: LHCb saw many job in staging status - Shaun is already investigating why.

Sites / Services round table:

  • RAL/John: Castor ATLAS showed db performance problems under load which may be resolved now.
  • T0/Michele: network intervention from 5-21 with disconnect between Paris and RAL. RAL will be disconnected, NL will be rerouted. 50116 GGUS.
  • GridKA/Angela: problem with cream for alice: bug found in auth process, glite lcas upgrade should fix this problem.
  • BNL/Michael: distributed email note about HPSS upgrade on 14 July. Michael also announced the atlas farm upgrade to sl5 planned for 28 july.
  • ASGC/Gang: the ATLAS DDM issue with FTS is fixed now. The problem occurred during 1-2h.

AOB:

Thursday

No call - WLCG STEP'09 Post-Mortem workshop

Friday

No call - WLCG STEP'09 Post-Mortem workshop

Sites / Services round table:

  • ASGC (by email): briefing the status of tape service: earlier h/w error code 'B582 (Column 2 Fiducial missing during calibration)' is actually due to the f/w issue and have been resolved by contract vendor patching latest release this morning. four tape server resume full function and tape logging start parsing new records according to all sync'ing tape usage logs at the tape servers. two old generation tape drives still showing abnormality but manual operation with scsi command able to pass normally. though with limited drives online right now (4 LTO4) but data migration/recall shall able to complete successfully except for possible long duration. will work on rest of three drives next week.

AOB:

-- JamieShiers - 06 Jul 2009

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2009-07-13 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback