WLCG Tier1 Service Coordination Minutes - 29 April 2010

Attendance

Site Name(s)
CERN Tim, Roberto, Patricia, Harry, Andrea, Luca, Maite, MariaDZ, Maarten, Manuel, Maria, Jamie, Gavin, Zbyszek, Jean-Philippe, SImone, Eva, Alessandro, Flavia, Tony
ASGC  
BNL  
CNAF AlessandroCavalli
FNAL Jon
KIT Angela
IN2P3  
NDGF Vera, Jon
NL-T1  
PIC Gonzalo
RAL  
TRIUMF  
OSG Kyle
GridPP Jeremy

Experiment Name(s)
ALICE  
ATLAS Kors, Dario, John
CMS Daniele
LHCb  

Discussion on LHC Schedule

  • Basic message from Roger Bailey - will try to hold these technical stops as per schedule. Even if there is a problem with the LHC likely to need these stops anyway. An issue this time was full maintenance of the LHC elevators - external company so very hard to shift!

  • Simone - winter shutdown? First estimate was 11 weeks but 2nd iteration likely to be shorter. Switch off ~10th December, come back end Jan / beginning Feb. TBC.

  • Tim - in the event of any changes, e.g. if one is moved forward by one week, what sort of notice? A: will be very difficult not to have these stops, even if we have to stop for other reasons. If there is an unscheduled stop ~one week just before a technical stop we might change but as this affects whole accelerator schedule very hard to change schedule.

  • Andrew - very useful to have this information!

  • Jamie - schedule service interventions through this meeting and try to limit to a) max 3 and b) 30% of total capacity.

  • Kors - technical stops relevant for export of data but not otherwise for Tier1 / Tier2 sites. Should coordinate site downtimes at any time.

  • Daniele - would not like to see a lot of sites down with short notification - if you plan to make downtimes please go through this meeting with the maximum notice possible so we can foresee impact and if necessary ask not to have it.

  • MariaDZ - can ask for this to be programmed into GOCDB.

  • Tim - and for Tier0?

  • Conclusion - schedule T1/T2 interventions independently of technical stops

Review of interventions foreseen during LHC stop

Site Name(s)
CERN  
ASGC  
BNL  
CNAF  
FNAL  
KIT  
IN2P3  
NDGF  
NL-T1  
PIC  
RAL  
TRIUMF  

Review of WLCG alarm chain

  • Jamie = need to be confident of end-to-end alarm chain. Need to understand why things break and discuss with relevant service providers why things have gone wrong.

  • MariaDZ - restart regular pre-GDB alarm chain tests?

  • Tim - loop-back test so that it can be automated?

  • Action on MariaDZ to organize meeting with relevant people to review recent problems

glexec deployment status

CERN, NIKHEF, KIT and PIC tested. Others either did not have glexec() on WNs or - as in case of RAL - did not work during this test.

RAL - just in process of configuring now - will continue over next couple of weeks.

FNAL - CMS already using it so know it is working for them (Jon)

OSG sites - had some trouble submitting to - to be followed up.

Data Management & Other Tier1 Service Issues

Storage systems: status, recent and planned changes

Site Status Recent changes Planned changes
CERN CASTOR 2.1.9-5 (All)
SRM 2.9-3 (all)
Upgraded all instances to CASTOR 2.1.9-5 and SRM 2.9-3  
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
none none
BNL dCache 1.9.4-3 none none
CNAF CASTOR 2.1.7-27 (ALICE)
SRM 2.8-5 (ALICE)
StoRM 1.5.1-3 (ATLAS, CMS, LHCb,ALICE)
26/4: 1-day transparent intervention on the SAN; reboot of the tape library.
28/4: TSM server upgraded to 6.3 to solve a potential, non-blocking issue; GPFS upgrade of all StoRM back-ends
StoRM upgrade to latest version (foreseen for 17/5), date to be agreed (next LHC stop?)
FNAL dCache 1.9.5-10 (admin nodes)
dCache 1.9.5-12 (pool nodes)
none none
IN2P3 dCache 1.9.5-11 with Chimera ? ?
KIT dCache 1.9.5-15 (admin nodes)
dCache 1.9.5-5 - 1.9.5-15 (pool nodes)
Tape library re-alignment this afternoon (already reported) Adding disk space to LHCb. Gradually adding more space during the coming days. May result in temporary bottlenecks because empty disks have higher preference and activity will skew to these nodes.
NDGF dCache 1.9.7 (head nodes)
dCache 1.9.5, 1.9.6 (pool nodes)
none none
NL-T1 dCache 1.9.5-16 with chimera (SARA), DPM 1.7.3 (NIKHEF)   On may 10th dCache head nodes services will be moved to new hardware
PIC dCache 1.9.5-17 On 27/04/2010 upgraded from v15 to v17 and also disabled tape protection temporarily to allow CMS accessing files on tape with dcap protocol. Waiting both for the dCache patch that allows tape protection setting per VO and the CMSSW debugging of gsidcap access none
RAL CASTOR 2.1.7-27 (stagers)
CASTOR 2.1.8-3 (nameserver central node)
CASTOR 2.1.8-17 (nameserver local node on SRM machines)
CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers)
SRM 2.8-2
none none
TRIUMF dCache 1.9.5-17 with Chimera namespace dCache upgrade  

Other Tier-0/1 issues

  • CNAF: 28/4: moved the ATLAS LFC database to new hardware.
  • CNAF: ATLAS conditions database to be moved to new hardware (date to be agreed with ATLAS).
  • CNAF: FTS database to be moved to new hardware (probably next LHC stop).

CASTOR news

CASTOR SRM 2.9-3 was released (and deployed) and is available in the savannah release area. Release notes and upgrade instructions are available.

dCache news

Two major issues have been fixed in dCache 1.9.5-18, to be released this week:
  • GridFTP(2) and small files (e.g. at BNL): A fix has been tested at BNL and seemed to have solved the issue
  • For quite some time sites reported that in a running system, all of a sudden, members of particular CA were not accepted by dCache anymore and only a restart could reactivate the CA.

StoRM news

LFC news

LFC 1.7.4 in certification, fixes a bug seen by BNL and implements some bulk lookup methods.

FTS

Investigating the problem seen from time to time at TRIUMF, with source files from TRIUMF being deleted as a result of an attempted transfer to a remote site. It is not clear if the problem is at the FTS or at the SRM level.

CERN would decommission myproxy-fts as soon as possible and is waiting for the green light of the experiments. ATLAS and CMS already agreed with it.

Experiment issues

  • ATLAS: understanding the problem mentioned above at TRIUMF is the biggest concern.
  • LHCb: a problem is seen at some dCache sites (PIC and IN2P3), consisting in the ID of an SRM BringOnline request being lost. To be investigated.

WLCG Baseline Versions

Conditions data access and related services

Database services

AOB

-- JamieShiers - 27-Apr-2010

Edit | Attach | Watch | Print version | History: r22 | r16 < r15 < r14 < r13 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r14 - 2010-04-29 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback