Diana, CERN ROC: comment related to last week: some tickets no associated with any alarm, hard to check what is going on, site had disappeared from SAM. Main point: if you open ticket, please, associate an alarm.
PPS Reports
Pilot of WMS at CNAF and CERn in progress. No major issues reported by CMS and Atlas. Next Wednesday the time agreed for the VOs for testing expires. If no problems are reported the SL4 WMS could be released on 29/May
gLite Release news
See the agenda for details. Summary:
Last week security fix released, rather important, wrong group assignment of pool accounts; sites that have upgraded to glite 3.1 update 20 should apply this patch.
Lots of new stuff released last week to PPS (read agenda).
This Thursday to production: VOMS certificate, job priority implementation for LCG CE
EGEE Items From ROC Reports
None this week!
URGENT upgrade of CA RPMs
The EUGridPMA have announced a new set of CA rpms.
Upgrade for this release is considered to be urgent by the EGEE project.
Based on this IGTF release, new CA RPMs have been packaged for EGEE.
Please upgrade within 1 day.
SAM started a 1 day timeout (including time needed to complete this CA release procedure).
When timeout is over, SAM will throw critical errors on CA tests if old CAs are still detected.
See the following page for more details about this new EGEE CA release :
http://grid-deployment.web.cern.ch/grid-deployment/lcg2CAlist.html
The UK CA has to be updated, this is critical. This information is distributed and followed up over the OSCT mailing list.
If anyone is interested in further information being distributed in the UK regarding the Debian vulnerability and CA update please visit: http://tinyurl.com/5howus
WLCG Items
WLCG issues coming from ROC reports
None
Upcoming WLCG Service Interventions
GOG-Singapore would like to decommission their site by June 2, 2008. The hardware and services at the site will be shutdown permanently. Please migrate data that is still needed by your VO before the site is disabled.
SARA: On may 21st from 9:00-14:00 CET there will be an outage of srm.grid.sara.nl due to network maintenance. This measure is necessary due to the installation of new storage hardware.
CYFRONET-IA64: We are going to shut down CYFRONET-IA64 completely at the end of May 2008. Please take care of your data you may have on our classic SE: ares03.cyf-kr.edu.pl.
BEIJING-LCG2: Our Dcache SE atlasse01.ihep.ac.cn is planned to be removed from production after 20th May. Please backup your data before that date
ATLAS Service
This is a short status of ATLAS M7 and CCRC08:
- M7 data distribution started yesterday (Sunday May 18th). Some site related issues have been reported and solved:
RAL : Cannot write : (https://gus.fzk.de/ws/ticket_info.php?ticket=36526) and an internal ticket.
SARA : Cannot access space token (https://gus.fzk.de/pages/ticket_details.php?ticket=36481 submitted last
friday) CNAF : srm down ? (https://gus.fzk.de/pages/ticket_details.php?ticket=36528).
Particular attention to the SARA problem (started friday, back online today at 12:00) due to problems in dCache
ugrade. From H.C. Lee:
"For SARA, the problem comes from the upgrade of dCache to 1.8.0-15p3.
It's a backward compatibility issue that has been reported to dCache team."
Overall, no current problems. Data registration happens in bursts. The system is capble to deliver with very
high throughput in short time periods. Data taking will continue at the latest till tomorrow morning. Need some
time to drain the system (reprocessing jobs in queues etc ...).
- Throughput tests: will start tomorrow late afternoon (at the latest Wednesday morning). More details tomorrow.
Basically all clouds will export also to T2s (need confirmation from CA cloud, while ES will like to run some
internal test before, since the deployment of SRM2 at T2s is not completed yet
ALICE Service
The most important point in Alice operations activities is to try to put RAL in the picture to start transfers, as they were not involved in the previous CCRC. This is being followed up.
CMS Service
iCSA/CCRC activities progresses/issues reported by mail, HNs, hard yet to keep them up-to-date in ELOGs also, on a daily basis. But: getting now to a more stable running of iCSA/CCRC tests altogether, so we are catching up in fishing from HNs/mails and filling both ELOGs (setting original dates), needed tickets, and https://twiki.cern.ch/twiki/bin/view/CMS/CCRC08-Phase2-OpsElog (bookmark and check back). Highlighted activities atm: analysis of T1-T1 tests as from last week; extension to non-regional T1-T2 transfer tests; production transfers with latency measurements to prepare for T1 workflows; T1 workflows consisting of (iCSA) re-processing and (CCRC) skimming at T1 sites, esploiting non-custodial areas also; final development on the monitoring side to accomodate feedback from CCRC running.
LHCb Service
Entering week 3 of CCRC. Problems with online service after intervention at PIC last Friday. Several DIRAC site problems. Before the weekend NIKHEF was banned to the advertised problem with cooling system, no EGEE broadcast was seen, due to problem interfacing GOCDB and the CIC portal. Being investigated. Sara was banned due to the last version of dCache (reported by Simone), bug in dCache, it will be fixed, meanwhile it was manually fixed by the site. IN2P3 also banned due to high DB load. Plans to restart at nominal rate PIC transfer and reconstruction for the rest of the T1s. Sophie: AFS UI at CERN will be updated this week. Please contact us through helpdesk or GGUS.
https://gus.fzk.de/ws/ticket_info.php?ticket=33220
It looks as if 2-3 problem sites have been fixed (UFlorida-IHEPA and Nebraska). We are still trying to get response from the UERJ resource, but have no further information about this resource.
The T0 FTS server has configured 0 retries by default, while T1s have 3 retries by default. This complicates Atlas workflow, if a transfer fails, we try to find another source with the same file. Could we have 0 retries in all FTS servers at T1s (this affects all channels, all VOs)? What is the position of the other LHC VOs? - Not a problem for LHCb - Ron (SARA): I thought this could be set up per channel, per VO agent. To be checked with Gaving & co * Answer from Gavin: The ‘retry’ count is a VO policy, so needs to be set in the relevant VO agent config for the FTS server (the default is 3 retries separated by minimum 10 minutes). I know CMS’ Phedex prefer to fail-fast (and see the error as early as possible), so have asked T1 sites to set the retry to 0. Phedex then retries externally (i.e. with another FTS jobs for the failed files). LHCb and ALICE I think are still set to the default. See: https://twiki.cern.ch/twiki/bin/view/LCG/FtsYaimValues20 Contact fts-support@cernNOSPAMPLEASE.ch is case of problems. *Update June 11th* Steve should submit tickets to all FTS sites. *Update June 13th* GGUS:37415 submitted and child tickets sent to ROCs of each Tier1. Review in two weeks time. *Update June 20th* GGUS:37415 has been responded to by all FTS instances that the changes have been made.... Except for: For USCMS-FNAL-WC1 in GGUS:37428 For BNL-LCG2 in GGUS:37427 Both will be contacted again this week. * Update June 30th* Steve will escalate, two U.S. sites are problematic. * Update July 7th* BNL and Fermi have now responded that they made the configuration change. Action item to be closed after next meeting. Steve
T0 FTS server has configured 0 retries by default, in T1s have 3 retries by default. This complicates Atlas workflow, if a transfer fails, we try to find another source with the same file. Could we have 0 retries in all FTS servers at T1s (this affects all channels, all VOs)? What is the position of the other LHC VOs? ACTION!
- Not a problem for LHCb
- Ron (SARA): I thought one could set it up per channel, per VO agent. To be checked with Gaving & co
Summary
Next Meeting
The next meeting will be Monday, 26 May 2008 15:00 UTC (16:00 Swiss local time).
Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).