WLCG Tier1 Service Coordination Minutes - 19 May 2011

Attendance

Action list review

Release update

Data Management & Other Tier1 Service Issues

WLCG Baseline Versions

Status of open GGUS tickets

The meeting will not take place this time. The e-group wlcg-service-coordination is asked to comment on issues offline. Are the "Type of Problem" values in https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCriticalServices#GGUS_Type_of_Problem_field necessary and sufficient for TEAM and ALARM tickets? Comments to Maria Dimou please.

Review of recent / open SIRs and other open service issues

Conditions data access and related services

Database services

  • Experiment reports:
    • ATLAS:
      • First instance of Atlas offline database (ADCR) has crashed on Sunday (08.05). Issue has been caused by internal database error and is now being under investigation. Services were available on the surviving nodes while the instance restarted and were relocated back to instance one after it came back into operation.
      • We had three hangs of ATLAS offline DB (ADCR) during which the service was not available: Monday 16th between 16:25 and 17:10, Monday 16th between 21:50 and 23:30 and Tuesday 17th between 1:50 a.m. and 2:40 a.m. No data loss occurred except for all uncommitted data. All incidents were caused by unusual reaction of ASM on a broken disk (itstor737 disk 3). ASM did not properly initiated a rebalance operation during the first incident and was affected by some problems during second and third. After the incidents a normal rebalance has finished and we were trying to forcefully evict the problematic disk. SR has been opened on this issue.
      • On 18th of May around 11:20 ADCR DB experienced another disk failure during rebalancing operation which did not finish. Decision was taken to switchover the DB to the standby cluster. Switchover completed successfully after several minor issues and the DB was back operational at 13:05. IN2P3 reported that AMI applications are not able to reach the DB. It turned out that DB was not visible outside of CERN. We requested the port on the firewall to be opened and it was done the next day (19th of May in the morning).

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
ASGC    
BNL -CPU April 2011 and OS kernel patches deployed in VOMS and Conditions database clusters.
- Applied Streams patch (9232517) in Conditions Database
Apply CPU April 2011 LFC and FTS cluster and standby database cluster.
CNAF    
KIT    
IN2P3    
NDGF Nothing to report None
PIC Nothing to report Planning to apply April CPU Patch in two weeks time. No exact date yet.
RAL April CPU (+ recommended patches) has been applied on CASTOR DB and 3D.
Started testing of the new HW that will used for the data guard configuration.
 
SARA Nothing to report On the 24th of May - upgrade to 10.2.0.5 and application of CPU April 2011 and all other recommended patches.
TRIUMF    

AOB

-- JamieShiers - 05-May-2011

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2011-05-19 - DawidWojcik
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback