* Site Name: SN-UCAD (ROC France); GGUS Ticket number(s): 44443, 44987, 42668 https://gus.fzk.de/ws/ticket_search.php?ticket=44443 https://gus.fzk.de/ws/ticket_search.php?ticket=44987 https://gus.fzk.de/ws/ticket_search.php?ticket=42668
Reason for escalation: no answer from site since 1 month and still in error on SRMv2-get-SURLs (GGUS #44443), sBDII-performance (GGUS #44987), APEL-pub (GGUS #42668)
Answer from French ROC : As this site has never reached a sustainable production level since certification, it has been decided by French ROC, with the agreement of the site, to restart the whole certification process from the beginning. By the way, this site has been put in "uncertified" status, and is now out of production.
PPS Reports and Issues
Post-mortem of recent issues with releases, we would like to improve roll out of BDII by adding a production site doing an early deployment of the BDII updates and reporting about the results BEFORE the release goes officially out
Pilot service of SC5 WN: in progress
LHCb tests on the pilot pointed out some issues with the gssklog mechanism when submitting from DIRAC3. The issue apparently arises with the newer version of VDT distributed in the WN Under investigation.
In accordance with the plans,two production CEs are being reconverted to use SLC5 They will be made available for production next week (19th of Jan)
Tier-1 sites in particular are encouraged to install 1 or more CREAM CEs.
Only available at 3 sites, one of them a T1, FZK. We would like to encourage other sites/T1s to follow, we are giving support to do this. What is the status at CERN? We are focusing on SLC5 first; once we have it ready, and depending on manpower, we will start working on that after, no clear estimation about when. What about the other T1s? no answer. Call to other sites to follow and start deploying CREAM CEs. Antonio: also the option of deploying the PPS version of the CREAM CE; difference? Functionally the same, but version in PPS works well with ICE, version in production doesn’t. More discussion on Wednesday at the GDB.
Can all ROCs contact their sites which support the BioMed VO to ask them to update their WNs with the latest version of GFAL, please.
Discussion: the problem is that when we use old versions of gfal API, older than 1.10.6, it kills the LFC; we want sites to upgrade to newer versions, we were suggested to develop a SAM test to detect this. Daniel Jouvenot (name supplied by John after the meeting) from Biomed VO has already worked with SAM and should have experience with it. This is the long term solution, the short term could be solved on the LFC side: remove LFC list replica from all LFCs; the DM team will check this option to produce a version of the LFC without this only for Biomed, no official release; to be discussed with Akos.
For the record, from the SAM team leader: Detailed version testing should not be SAM responsibility, it is more configuration management (being done by the new job wrappers, ongoing work). This is expected to be ready in a few months.
Long term solution: VO specific SAM test.
This GFAL version is from the end of 2007! operations will send a broadcast to all sites, on behalf of biomed, or to sites that support the biomed VO, requesting the upgrade.
EGEE Items From ROC Reports
Central Europe: Two cases conserning lack of procedures how site shall set default SE:
The EGEE SLA allows the site to be CE-only (Section 8: site must provide at least one CE OR SE). Not having SE affects on passing by site RM SAM tests - those tests take closest SE (default). Also setting up site in such situation is not possible because yaim require SE. Comment: maybe this is a problem with our interpretation of 8 section in SLA. Doeas this section says that Site can have "CE OR SE" or "CE with >=8CPU or SE with >=1TB"? If the secound option then the 8 section in SLA can mislead.
In case of putting SE in Scheduled downtime, site have to put also CE into downtime (otherwise will not pass RM tests) or chose (lack in procedures) other SE (from other site).
ANSWER: Originally, for the availability reports, a site needed to have all site services (CE, SE, SRM and sBDII) - but this requirement was relaxed depending on the resources being provided. However, a “close SE” still needs to be defined for the CE tests to pass. This “close SE” does not have to be at the site. The ROC can be involved to help a site define a suitable SE.
DECH: GSI-LCG2 is down because of bugs in the 64 bit WN package, see GGUS Ticket-ID: 48013 How to deescalate this situation?
COD: it was escalated 2 weeks ago, and it is deescalated now; not sure if the problem is related to the 64 bit WN, the site seems to be working on it;
SouthWest Europe: SWE will have a new site RedIRIS, which wil only host central services (Top-BDII, WMS, LFC, MYPROXY etc.). This configuration will cause problems in GSTAT because some necessary variables will not be defined. Will this configuration be supported in the future? Is there a work around for this type of sites?
We hope that this configuration will be supported. We’ll look to check if there is any similar case already. Action on the CERN-ROC to check. Kai: we’ll go ahead, see what breaks and report about it.
Consult links on the agenda page. RSS feed is now working, people can subscribe
WLCG Service Coordination
second run of atlas 5 million test starts on Wednesday, T0-T1 and T1-T1.
ATLAS Service
Problem found during Xmas vacation with SAM WMS, which caused many other problems, test framework was stuck. Problem is now understood ans should be solved soon. The SAM WMS will authorize Atlas to submit jobs through it. In addition, NIKHEF requested to not to have warnings through this WMS but a different message/exit code, simple NOTE instead of WARNING. This will be changed in Atlas tests. The change is already done in SAM and should be included in next release, coming next week.
ALICE Service
Nothing to report
CMS Service
Tier-0 = The DataOps team kept the T0 resources mostly saturated throughout the winter break. This turned into being able to repack & prompt reco all CRAFT completely twice, and CRUZET + BeamCommissioning 3 times (4th running last week). Results written to disk-only pools, and promptly recycled as needed. Main issues: 1. some on CMS T0 code(s) (--> FIXED); 2. CERN resources behaved well except for some LSF failures on the weekend Jan 3-4 (--> FIXED); 3. some lessons learned in data handling over vast datasets at T0 (CMS-specific lessons) (--> being addressed).
MC production = Summer08 phase. physics requests count for 253 M events produced (GEN-SIM-RAW, CMSSW_2_1_7); 208 M events reconstructed (CMSSW_2_1_8). --- Fall08 phase. MadGraph requests with CMSSW_2_1_17: 15.6 M evts produced + reco'ed, plus 1 RAW workflow and 1 RECO workflow still running (only some problems with a workflow, not yet working with a even patched version of ProdAgent PA_0.12.9). --- Winter09 phase. FastSim requests with CMSSW_2_2_3. 45 requests were assigned to be run during the Xmas break. 44/45 DONE, remaining 1 just skipped by DataOps. Total: 342 M evts produced. --- Summary of issues (breakdown with site issues only): just a couple of T2 sites had tmp issues, all fixed/bypassed.
Reprocessing at T1 sites = 1) CRAFT activities: CRAFT data AlCaRECO and skims ran in IN2P3, FZK, PIC; of the order of ~50k jobs/workflow. IN2P3 had storage-SRM related issues over the Xmas break. Many issues with the glideins: some solved by DataOps submitters, some jobs ran, but at a somehow limited rate: not easy. In addition, re skims: 5 workflows, problems with the RECO-RAW output and needed fix to DBS to get it sorted out. 2) re-digi and re-reco: also tried to move from glideins to glite, also had problems, had jobs running for a while, then turned out to give errors, etc. In this case, AFAICT they were mostly identified as a site issue at PIC. Unfortunately, no tickets were opened by operators (to improve much in this).
Transfer system = No major issues with the transfer system. A total of 175.11 TB transferred over winter holidays. Just one event: PhEDEx Castor-related export agents at CERN were not responsive on a Friday morning (I recall it to be Jan 2rd): auto resolved problem.
LHCb Service
Up to 15000 jobs concurrently in the grid, impressive! Sites are complaining about pattern usage (SQL lite), stored in NFS mounted shared area, which causes all processes to hang. People is working to access locally the files instead of doing it through the shared area in NFS.
OCC to send broadcast to sites requesting to upgrade the GFAL version so it is higher than 1.10.6 More details about the issue can be found here: https://gus.fzk.de/ws/ticket_info.php?ticket=43994 Update 19/1/2009: Biomed GFAL version problem, Maite will send broadcast after the meeting (seems some sites are still on SL3 and need to upgrade the O/S as well as GFAL!) Update 26/1/2009: no broadcast seen, OCC to follow up. Update 2/2/2009: broadcast not sent, problem being followed up with sites through GGUS. Agreement to close item.
The Data Management team (Akos) to provide a version of the LFC without list replica (related to the old GFAL version problem reported by Biomed) Update 19/1/2009: (mail from Akos): We have examined the issue and it does not look like a security problem, but a resource limitation: the number of threads in an LFC instance limits the number of clients that can connect concurrently and the Biomed usage patter exceeds that limit. When the clients would finish their work, LFC would be responsive again. The same problem would occur with other iterator like operations, like opendir/readdir/closedir. Removing these operations would cause old clients to fail, however it would not solve the problem, so in my opinion the upgrade of lcg_utils is the right solution. Unfortunately nobody has contacted us from the Biomed community regarding the possibility and context of a special build, so we did not progress on that side. Update 26/1/2009: Can be closed.
Long term solution to the old GFAL version problem reported by Biomed: develop VO specific SAM test to detect this, and then exclude the sites with the wrong version Update 19/1/2009: Long-term solution could be SAM tests, or adding GFAL version collection to job-wrapper scripts.
SAM and Atlas (Alessandro) to get together to understand how SAM-Atlas deals with sites with no close SE defined and see if this can be used in SAM-operations Update: 19/1/2009: The outcome of the get-together was: >> Not having SE affects on passing by site RM SAM tests - those tests take closest SE (default). This is incorrect – the defined SE doesn’t have to be at the site! >> Also setting up site in such situation is not possible because yaim require SE. Correct, but again the SE doesn’t have to be local to the site. >> In case of putting SE in Scheduled downtime, site have to put also CE into downtime (otherwise will not pass RM tests) or chose (lack in procedures) other SE (from other site). This is correct, and the only real issue. ATLAS doen’t use Replica Management tests, but believe that they should be part of the ops infrastructure tests (which are more extensive). There may be a case for making the replica management tests non-critical, but they’ve been critical for two years now and most people seem happy with this.
The way for a site to change the defined SE is to modify the variable VO_OPS_DEFAULT_SE in the WNs’ site-info.def files.
Check of existing cases of sites only hosting core services, without site services. This is to support a new site RedIRIS in SWE ROC Update 19/1/2009: CERN ROC to check sites with only core services – no progress. Update 2/2/2009: New SWE site RedIRIS will only host core services (BDII, WMS, etc.) Problems until now: 1) GIIS performance error due to: GIIS Old Entries Found: 6 - ERROR - This will make the SAM test gperf fail. 2) No Grid Version published: GridVersion: *NOTE* could not find valid LCG version - This ist just a warning in GSTAT at this moment The other tests seem to work only the gperf error is critical. Update 12th February - Steve will take a look to understand what this is about. Update 19th February - Steve - Confused , there is no RedIRIS site in gstat? http://gstat.gridops.org/gstat//SouthWesternEurope.htmlUpdate at the meeting - Kai will check. Update 3rd March The gstat errors are caused by the WMS publishing only static information. The new info provider just release publishes dynamic information so this will fix itsef.