TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsWeb
>
Tier1ServiceCoordination
>
WLCGTier1ServiceCoordinationMinutes100520
(2010-06-30,
AndreaValassi
)
(raw view)
E
dit
A
ttach
P
DF
---+ WLCG Tier1 Service Coordination Minutes - 20 May 2010 %TOC% ---+++ Attendance | *Site* | *Name(s)* | | CERN |Flavia, Roberto, Andrea V, Eva, Patricia, Harry, Maarten, Jamie, Alessandro, Julia, Simone, Maria, MariaDZ, Jean-Philippe, Mara Alandes, Maite, Tim, Nicolo, Manuel, Luca, Jacek | | ASGC |Felix Lee | | BNL | Carlos Fernando Gamboa| | CNAF | AlessandroCavalli| | FNAL | Jon| | KIT |Angela Poschlad | | IN2P3 | | | NDGF | | | NL-T1 |Ron | | PIC |Gonzalo | | RAL | | | TRIUMF | Andrew | | GridPP | Jeremy | | GGUS | Guenter | | *Experiment* | *Name(s)* | | ALICE | | | ATLAS | Elisabeth Gallas, John DeStefano, Rod Walker, Kors| | CMS | Pepe Flix (CMS/PIC), Peter Kreuzer, Rapolas Kaselis, Dave Dykstra| | LHCb | | ---++ Summary of / Actions from Meeting on Alarm Chain * Downtimes and Experiment Calendars * Review of the email addresses used for notification when ALARMs tickets are opened to the Tier0 [[https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#ALARM_tickets_quality][(see ACTION_4 in dedicated meeting)]]. * Decision: Update the e-groups but no further change. * Details in: https://savannah.cern.ch/support/?114483#comment2 ---++ Downtime calendar * Presentation from Peter Kreuzer on the CMS solution. (Slides on agenda). Is there a need for a common solution? * In the discussion ATLAS already had a similar solution. * Julia - try to gather common issues / requirements: much of the problems are with the information sources. ---++ Deployment / Rollout Issues ---+++ glexec | *Site* | *"/ops/Role=pilot" job + glexec test* | *glexec capability in BDII* | | ASGC | | | | BNL | | | | CERN | lcg-CE OK, CREAM fails | only CREAM | | CNAF | | | | FNAL | OK for CMS | OK | | IN2P3CC | | | | KIT | lcg-CE OK, CREAM fails | 1 lcg-CE | | NDGF | n/a | n/a | | NIKHEF | lcg-CE OK, CREAM fails | only lcg-CE | | PIC | 11 tests OK, 1 problem | OK | | RAL | OK (configuration fixed) | OK | | SARA | | | | TRIUMF | | | ---++ Data Management & Other Tier1 Service Issues ---++++ Storage systems: status, recent and planned changes (please update) | *Site* | *Status* | *Recent changes* | *Planned changes* | | CERN | CASTOR 2.1.9-5 (All)<br>SRM 2.9-3 (all) | None | None planned | | ASGC | CASTOR 2.1.7-19 (stager, nameserver)<br>CASTOR 2.1.8-14 (tapeserver)<br>SRM 2.8-2 | none | none | | BNL | dCache 1.9.4-3 | none | none | | CNAF | CASTOR 2.1.7-27 (ALICE)<br>SRM 2.8-5 (ALICE)<br>StoRM 1.5.1-3 (ATLAS, CMS, LHCb,ALICE) | ? | !StoRM upgrade to latest version (foreseen for 17/5), date to be agreed (done?) | | FNAL | dCache 1.9.5-10 (admin nodes)<br>dCache 1.9.5-12 (pool nodes) | none | none | | !IN2P3 | dCache 1.9.5-11 with Chimera | none | none | | KIT | dCache 1.9.5-15 (admin nodes)<br>dCache 1.9.5-5 - 1.9.5-15 (pool nodes) | none | Change of authentication method on ATLAS dCache instance planned. Preparation ongoing - date will be discussed with ATLAS. <br> Migration of Alice SRM service to new machine. No date available yet. | | NDGF | dCache 1.9.7 (head nodes)<br>dCache 1.9.5, 1.9.6 (pool nodes) | none | Upgrade to 1.9.8 on headnodes and some pool nodes on Tuesday (2010-05-25) | | NL-T1 | dCache 1.9.5-16 with chimera (SARA), DPM 1.7.3 (NIKHEF) | Migrated dCache head node services to new hardware | Due to the replacement of harddisks, starting from may 25th data will be migrated to other disk pools and there will be a reduced throughput. We expect this to take about a week | | PIC | dCache 1.9.5-17 | 19/05/2010 Deployment of secondary SRM in "standby mode" that should improve SRM service resilience. Tape protection still disabled to allow CMS accessing files on tape with dcap protocol. Still waiting both for the dCache patch that allows tape protection setting per VO and the CMSSW debugging of gsidcap access. In contact with dcache.org for testing the former, and no news for the latter. | none | | !RAL | CASTOR 2.1.7-27 (stagers)<br>CASTOR 2.1.8-3 (nameserver central node)<br>CASTOR 2.1.8-17 (nameserver local node on SRM machines)<br>CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers)<br>SRM 2.8-2 | none | none. But plan outage during morning of Tuesday 1st June (during LHC technical stop) for network change.| | TRIUMF | dCache 1.9.5-17 with Chimera namespace | ? | ? | ---++++ Other Tier-0/1 issues ---++++ CASTOR news * CASTOR 2.1.9-6 was released on May 17: [[http://castor.web.cern.ch/castor/DIST/CERN/savannah/CASTOR.pkg/2.1.9-6/ReleaseNotes][Release notes]]. ---++++ dCache news * dCache 1.9.5-19 was released on May 14, fixing the problem of the unresponsive SRM observed in 1.9.5-18 * dCache 1.9.5-20rc1 was released on May 18, supporting the experimental feature of [[http://trac.dcache.org/projects/dcache/wiki/TapeProtectionExtended][protecting staging based on Storage Classes]] ---++++ !StoRM news ---++++ DPM news * DPM 1.7.4 has been certified, awaiting staged roll-out ---++++ LFC news * LFC 1.7.4 has been certified, awaiting staged roll-out ---++++ FTS * FTA 2.2.4 patch in certification + pilot test at CERN * Gonzalo - APEL. New attributes in glue schema in which scaling reference for WN published. 51176 Savannah. APEL did not implement this as written in document. Bug closed (expired?) - is this in production? * MariaA - closed automatically - ready for review since a long time. Patch is in production. Should be visible at end of bug. gLite 3.1 update 62 on March 11. Not in gLIte 3.2 nor publicised to sites. ---+++ WLCG Baseline Versions & gLite Releases * Release report: [[https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Deployment_Status][deployment status wiki page]] * WLCG Baseline versions: [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions][table]] * MariaA - first EMI release not for some months. gLite releases will continue for sometime. WLCG will set priority for what goes into these releases and agree. Will present list at next meeting for next gLite release. Can also present status of things for rollout. * Flavia - baseline services updated wrt FroNTier and Squid. * John - one of concerns has been on ATLAS methods of assigning falover sites for T2s. CMS doesn't do this for Squid. * Dave - ATLAS recommends that T2/T3 sites have another as backup for Squid proxies. If there is a failure will not be noticed - not clear it is a good H/A strategy. If main site goes down will be using more resources without this being noticed. * Rod - not noticing when things go wrong as part of redunancy! SAM tests would pick this up. * Rod - Squids required at T2s for CMS and ATLAS. Can WLCG take this on? Ale - trying to follow from technical point of view. * Should preferably be a joint ATLAS + CMS request to MB. ---++ Conditions data access and related services ---+++ COOL, CORAL and POOL * A new release of COOL, CORAL and POOL (LCGCMT_56g) was prepared for ATLAS last week. The main motivation for this new release was the upgrade to version 2.7.14 of the frontier_client library. The new library is linked to libexpat.so.0 instead of libexpat.so.1, fixing an inconsistency between the libexpat.so versions used by different libraries needed by ATLAS, which was the likely cause of some failures recently observed in ATLAS jobs (for instance in conditions POOL file access via gfal at SARA). The release notes are available on https://sftweb.cern.ch/persistency/releases * A patched version of the OCCI library version 11.2 for 64-bit linux has been received from Oracle Support. The patch fixes the SLC5/SELinux related bug (applications fail with "cannot restore segment prot after reloc" if SELinux is enabled). The problem is now completely fixed in the Oracle 11g client, as three fixes had already been received in previous months for the same bug affecting the OCI 32/64bit and OCCI 32bit libraries. As a reminder, the OCCI library is used by ROOT and some CMS applications, but is not needed by CORAL (which uses OCI instead). The new client libraries have been installed on AFS as /afs/cern.ch/sw/lcg/external/oracle/11.2.0.1.0p2 and are ready to be included in one of the next LCG AA releases. * Possible improvements to the ATLAS data management infrastructure from the use of a different GUID format for POOL files are being investigated. Switching to time-ordered GUID's could be useful to simplify the partitioning of the file catalogs and the handling of old files, but the implications and side effects of these changes still need to be more carefuly evaluated. ---+++ Frontier * The deployment status of Frontier servers and Squid caches for ATLAS can be found at https://twiki.cern.ch/twiki/bin/view/Atlas/T2SquidDeployment. We encourage sites to update this page with the correct the information. * [[https://www.racf.bnl.gov/docs/services/frontier/meetings/minutes/][ATLAS weekly FroNTier meetings]] ---++ Database services * Experiment reports: * ALICE: * ALICE production databases are planned to be patched during the LHC maintenance period starting on 31st of May. * 2 new schemas to be added to the replication setup. * ATLAS: * ATLAS integration database INTR and production archive database (ATLARC) have been patched with the latest security and other recommended patches from Oracle (PSU 10.2.0.4.4). ATLARC database has been successfully migrated to new RAC9 hardware. * ATLAS integration database INT8R will be migrated to new hardware and patched with the latest Oracle security patch and recommended updates on 27th May. * ATLAS production databases (ATONR, ATLR) are planned to be patched during the LHC maintenance period starting on 31st of May. * CMS: * All 4 CMS test, development and integration databases have been patched on 11th May with the latest security and other recommended patches from Oracle (PSU 10.2.0.4.4). At the same time the INT9R database has been successfully migrated to new RAC9 hardware. * CMS production databases (CMSONR, CMSR, CMSARC) are planned to be patched during the LHC maintenance period starting on 31st of May. * LHCB: * LHCB production databases (LHCBONR, LHCBR) are planned to be patched during the LHC maintenance period starting on 31st of May. * Site reports: | *Site* | *Status, recent changes, incidents, ...* | *Planned interventions* | | CERN | Patch to fix the high memory consumption by the queue monitor processes applied on the test environment. April security patch and recommended updates being applied on test, development and integration databases. First migrations to new hw successfully completed (d3r, int9r and atlarc) | April security patch and recommended updates to be applied in production during the next LHC technical stop (end of May). | | ASGC | Problems found with incremental backups. Now situation is back to normal. Reason unknown. | April security patch to be applied mid June (after testing it in the testbed). | | BNL | | Planning to apply PSU patches Cond. DB and LFC_FTS the week of 24-28. | | CNAF | Migration of the ATLAS database completed the 19th of May. April PSU applied on ATLAS database. | Recommended patches to be applied on ATLAS database and April PSU patch to be applied on LHCb database (within end of May/early June). | | KIT | | April security patches scheduled for last week of June. | | !IN2P3 | On 11th May, new AMI schema added to the ATLAS AMI streams setup. | April PSU foreseen beginning of June (waiting for merge patches). | | NDGF | | 20th May 09:00-11:00 CET: Firmware upgrade on ATLAS DB storage controllers (transparent). 27th May (tentative, will confirm ASAP): April PSU on ATLAS conditions DB (transparent). | | PIC | Contention observed for TAGS database when high load of updates. Adjusting parameters in order to solve this. | PSU Apr'10 apply planned for 25/5 (FTS&LFC), and for 27/5 (ATLAS,TAGS and LHC). Change of MTU for the interconnect cards planned for same days of the patch apply | | !RAL | | April PSU scheduled for the next few weeks: OGMA (ATLAS) Tuesday 25th May from 10:00am till 12:00 pm. LUGH (LHCb) Thursday 27th May from 10:00am till 12:00 pm. SOMNUS (LFC,FTS) Wednesday 2nd June from 10:00am till 12:00 pm. | | SARA | | Plan to apply April security patches. No date yet. | | TRIUMF | | Plan to apply April PSU within the next 2 weeks. | * Note that the bug affecting the apply process failover during rolling interventions has not been fixed yet. For this reason, apply processes must be stopped before the interventions. * Database weekly reports: * Zbigniew has sent an email with the instructions. * Which sites have already deployed them? RAL, SARA, KIT and PIC * Partitioning is used. NDGF does not have license to use partitioning, any other site? No * Licenses: * Request from RAL progressing * Support for 2006 licenses will be covered by CERN. Details to be confirmed. * BNL and KIT are also interested on new licenses. Eva will send the information around. ---++ AOB -- Main.JamieShiers - 13-May-2010
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r30
<
r29
<
r28
<
r27
<
r26
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r30 - 2010-06-30
-
AndreaValassi
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback