WLCG Tier1 Service Coordination Minutes - 20 May 2010

Attendance

Site Name(s)
CERN Flavia, Roberto, Andrea V, Eva, Patricia, Harry, Maarten, Jamie, Alessandro, Julia, Simone, Maria, MariaDZ, Jean-Philippe, Mara Alandes, Maite, Tim, Nicolo, Manuel, Luca, Jacek
ASGC Felix Lee
BNL Carlos Fernando Gamboa
CNAF AlessandroCavalli
FNAL Jon
KIT Angela Poschlad
IN2P3  
NDGF  
NL-T1 Ron
PIC Gonzalo
RAL  
TRIUMF Andrew
GridPP Jeremy
GGUS Guenter

Experiment Name(s)
ALICE  
ATLAS Elisabeth Gallas, John DeStefano, Rod Walker, Kors
CMS Pepe Flix (CMS/PIC), Peter Kreuzer, Rapolas Kaselis, Dave Dykstra
LHCb  

Summary of / Actions from Meeting on Alarm Chain

  • Downtimes and Experiment Calendars

Downtime calendar

  • Presentation from Peter Kreuzer on the CMS solution. (Slides on agenda). Is there a need for a common solution?
  • In the discussion ATLAS already had a similar solution.
  • Julia - try to gather common issues / requirements: much of the problems are with the information sources.

Deployment / Rollout Issues

glexec

Site "/ops/Role=pilot" job + glexec test glexec capability in BDII
ASGC    
BNL    
CERN lcg-CE OK, CREAM fails only CREAM
CNAF    
FNAL OK for CMS OK
IN2P3CC    
KIT lcg-CE OK, CREAM fails 1 lcg-CE
NDGF n/a n/a
NIKHEF lcg-CE OK, CREAM fails only lcg-CE
PIC 11 tests OK, 1 problem OK
RAL OK (configuration fixed) OK
SARA    
TRIUMF    

Data Management & Other Tier1 Service Issues

Storage systems: status, recent and planned changes (please update)

Site Status Recent changes Planned changes
CERN CASTOR 2.1.9-5 (All)
SRM 2.9-3 (all)
None None planned
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
none none
BNL dCache 1.9.4-3 none none
CNAF CASTOR 2.1.7-27 (ALICE)
SRM 2.8-5 (ALICE)
StoRM 1.5.1-3 (ATLAS, CMS, LHCb,ALICE)
? StoRM upgrade to latest version (foreseen for 17/5), date to be agreed (done?)
FNAL dCache 1.9.5-10 (admin nodes)
dCache 1.9.5-12 (pool nodes)
none none
IN2P3 dCache 1.9.5-11 with Chimera none none
KIT dCache 1.9.5-15 (admin nodes)
dCache 1.9.5-5 - 1.9.5-15 (pool nodes)
none Change of authentication method on ATLAS dCache instance planned. Preparation ongoing - date will be discussed with ATLAS.
Migration of Alice SRM service to new machine. No date available yet.
NDGF dCache 1.9.7 (head nodes)
dCache 1.9.5, 1.9.6 (pool nodes)
none Upgrade to 1.9.8 on headnodes and some pool nodes on Tuesday (2010-05-25)
NL-T1 dCache 1.9.5-16 with chimera (SARA), DPM 1.7.3 (NIKHEF) Migrated dCache head node services to new hardware Due to the replacement of harddisks, starting from may 25th data will be migrated to other disk pools and there will be a reduced throughput. We expect this to take about a week
PIC dCache 1.9.5-17 19/05/2010 Deployment of secondary SRM in "standby mode" that should improve SRM service resilience. Tape protection still disabled to allow CMS accessing files on tape with dcap protocol. Still waiting both for the dCache patch that allows tape protection setting per VO and the CMSSW debugging of gsidcap access. In contact with dcache.org for testing the former, and no news for the latter. none
RAL CASTOR 2.1.7-27 (stagers)
CASTOR 2.1.8-3 (nameserver central node)
CASTOR 2.1.8-17 (nameserver local node on SRM machines)
CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers)
SRM 2.8-2
none none. But plan outage during morning of Tuesday 1st June (during LHC technical stop) for network change.
TRIUMF dCache 1.9.5-17 with Chimera namespace ? ?

Other Tier-0/1 issues

CASTOR news

dCache news

  • dCache 1.9.5-19 was released on May 14, fixing the problem of the unresponsive SRM observed in 1.9.5-18
  • dCache 1.9.5-20rc1 was released on May 18, supporting the experimental feature of protecting staging based on Storage Classes

StoRM news

DPM news

  • DPM 1.7.4 has been certified, awaiting staged roll-out

LFC news

  • LFC 1.7.4 has been certified, awaiting staged roll-out

FTS

  • FTA 2.2.4 patch in certification + pilot test at CERN

  • Gonzalo - APEL. New attributes in glue schema in which scaling reference for WN published. 51176 Savannah. APEL did not implement this as written in document. Bug closed (expired?) - is this in production?

  • MariaA - closed automatically - ready for review since a long time. Patch is in production. Should be visible at end of bug. gLite 3.1 update 62 on March 11. Not in gLIte 3.2 nor publicised to sites.

WLCG Baseline Versions & gLite Releases

  • MariaA - first EMI release not for some months. gLite releases will continue for sometime. WLCG will set priority for what goes into these releases and agree. Will present list at next meeting for next gLite release. Can also present status of things for rollout.

  • Flavia - baseline services updated wrt FroNTier and Squid.

  • John - one of concerns has been on ATLAS methods of assigning falover sites for T2s. CMS doesn't do this for Squid.
  • Dave - ATLAS recommends that T2/T3 sites have another as backup for Squid proxies. If there is a failure will not be noticed - not clear it is a good H/A strategy. If main site goes down will be using more resources without this being noticed.
  • Rod - not noticing when things go wrong as part of redunancy! SAM tests would pick this up.

  • Rod - Squids required at T2s for CMS and ATLAS. Can WLCG take this on? Ale - trying to follow from technical point of view.

  • Should preferably be a joint ATLAS + CMS request to MB.

Conditions data access and related services

COOL, CORAL and POOL

  • A new release of COOL, CORAL and POOL (LCGCMT_56g) was prepared for ATLAS last week. The main motivation for this new release was the upgrade to version 2.7.14 of the frontier_client library. The new library is linked to libexpat.so.0 instead of libexpat.so.1, fixing an inconsistency between the libexpat.so versions used by different libraries needed by ATLAS, which was the likely cause of some failures recently observed in ATLAS jobs (for instance in conditions POOL file access via gfal at SARA). The release notes are available on https://sftweb.cern.ch/persistency/releases
  • A patched version of the OCCI library version 11.2 for 64-bit linux has been received from Oracle Support. The patch fixes the SLC5/SELinux related bug (applications fail with "cannot restore segment prot after reloc" if SELinux is enabled). The problem is now completely fixed in the Oracle 11g client, as three fixes had already been received in previous months for the same bug affecting the OCI 32/64bit and OCCI 32bit libraries. As a reminder, the OCCI library is used by ROOT and some CMS applications, but is not needed by CORAL (which uses OCI instead). The new client libraries have been installed on AFS as /afs/cern.ch/sw/lcg/external/oracle/11.2.0.1.0p2 and are ready to be included in one of the next LCG AA releases.
  • Possible improvements to the ATLAS data management infrastructure from the use of a different GUID format for POOL files are being investigated. Switching to time-ordered GUID's could be useful to simplify the partitioning of the file catalogs and the handling of old files, but the implications and side effects of these changes still need to be more carefuly evaluated.

Frontier

Database services

  • Experiment reports:
    • ALICE:
      • ALICE production databases are planned to be patched during the LHC maintenance period starting on 31st of May.
      • 2 new schemas to be added to the replication setup.
    • ATLAS:
      • ATLAS integration database INTR and production archive database (ATLARC) have been patched with the latest security and other recommended patches from Oracle (PSU 10.2.0.4.4). ATLARC database has been successfully migrated to new RAC9 hardware.
      • ATLAS integration database INT8R will be migrated to new hardware and patched with the latest Oracle security patch and recommended updates on 27th May.
      • ATLAS production databases (ATONR, ATLR) are planned to be patched during the LHC maintenance period starting on 31st of May.
    • CMS:
      • All 4 CMS test, development and integration databases have been patched on 11th May with the latest security and other recommended patches from Oracle (PSU 10.2.0.4.4). At the same time the INT9R database has been successfully migrated to new RAC9 hardware.
      • CMS production databases (CMSONR, CMSR, CMSARC) are planned to be patched during the LHC maintenance period starting on 31st of May.
    • LHCB:
      • LHCB production databases (LHCBONR, LHCBR) are planned to be patched during the LHC maintenance period starting on 31st of May.

  • Site reports:

Site Status, recent changes, incidents, ... Planned interventions
CERN Patch to fix the high memory consumption by the queue monitor processes applied on the test environment. April security patch and recommended updates being applied on test, development and integration databases. First migrations to new hw successfully completed (d3r, int9r and atlarc) April security patch and recommended updates to be applied in production during the next LHC technical stop (end of May).
ASGC Problems found with incremental backups. Now situation is back to normal. Reason unknown. April security patch to be applied mid June (after testing it in the testbed).
BNL   Planning to apply PSU patches Cond. DB and LFC_FTS the week of 24-28.
CNAF Migration of the ATLAS database completed the 19th of May. April PSU applied on ATLAS database. Recommended patches to be applied on ATLAS database and April PSU patch to be applied on LHCb database (within end of May/early June).
KIT   April security patches scheduled for last week of June.
IN2P3 On 11th May, new AMI schema added to the ATLAS AMI streams setup. April PSU foreseen beginning of June (waiting for merge patches).
NDGF   20th May 09:00-11:00 CET: Firmware upgrade on ATLAS DB storage controllers (transparent). 27th May (tentative, will confirm ASAP): April PSU on ATLAS conditions DB (transparent).
PIC Contention observed for TAGS database when high load of updates. Adjusting parameters in order to solve this. PSU Apr'10 apply planned for 25/5 (FTS&LFC), and for 27/5 (ATLAS,TAGS and LHC). Change of MTU for the interconnect cards planned for same days of the patch apply
RAL   April PSU scheduled for the next few weeks: OGMA (ATLAS) Tuesday 25th May from 10:00am till 12:00 pm. LUGH (LHCb) Thursday 27th May from 10:00am till 12:00 pm. SOMNUS (LFC,FTS) Wednesday 2nd June from 10:00am till 12:00 pm.
SARA   Plan to apply April security patches. No date yet.
TRIUMF   Plan to apply April PSU within the next 2 weeks.

  • Note that the bug affecting the apply process failover during rolling interventions has not been fixed yet. For this reason, apply processes must be stopped before the interventions.
  • Database weekly reports:
    • Zbigniew has sent an email with the instructions.
    • Which sites have already deployed them? RAL, SARA, KIT and PIC
    • Partitioning is used. NDGF does not have license to use partitioning, any other site? No
  • Licenses:
    • Request from RAL progressing
    • Support for 2006 licenses will be covered by CERN. Details to be confirmed.
    • BNL and KIT are also interested on new licenses. Eva will send the information around.

AOB

-- JamieShiers - 13-May-2010

Edit | Attach | Watch | Print version | History: r30 < r29 < r28 < r27 < r26 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r30 - 2010-06-30 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback