Site: ru-Moscow-GCRAS-LCG2, GGUS:34045, GGUS:34051, GGUS:34817 Reached last escalation step, but then the site reacted with: "Still problem with certificates, including users certs and RA." The RA itself has certificate problems, and is making the papers to be renewed. We gave them the possibility to wait for this, in downtime state, because it is not a software problem to be corrected, but just a wait for new certificates to be provided by CA/RA.
Report from SWE COD
Australia-UNIMELB-LCG2: GGUS Ticket GGUS:34393 Site comments that their SE is full because of atlas VO not removing files. Is this a problem of atlas VO or should the site reserve disk space for the ops VO?
YerPHI: GGUS Ticket GGUS:26634 Site is transfered to the politiccal instance but neighter on Scheduled Downtime no suspended. What is the latest status on this?
From the meeting CERN-ROC will again follow up with YerPhi.
Assigned to
Due date
Description
State
Closed
Notify
Main.CERNROC
2008-05-13
Follow up with YerPhi site to resolve or suspend site. Update 5th May. CERNROC to provide an update next week once they have been CIC on duty for a week and cleaned everything up. Update 9th May. !YerPhi is now in state suspended and all existing COD tickets have been closed. This item should be closed after next week's meeting. Update 19th May, this action can be closed
yaim-core 4.0.4, released with gLite 3.1.0 PPS Update 22 introduces a check that blocks the configuration if read permissions are given to non-root users on the site-info file and the directory where it is stored . This causes problems in set-ups where the permissions cannot be changed to 700 (e.g. installations of UI on AFS). A bug has been opened for that (BUG:35307), and the check will be softnened in version 4.0.5. Sites installing version 4.0.4 should be prepared to change a function in yaim as described in YaimGuide400#Known_issues
gLite Releases.
gLite 3.1 Update18 went to production last Monday.
Majority of CE sites failed SAM due to wrongly advertised LFC for OPS VO. GGUS:35093 It is a weak point of the infrastructure that a site can publish anything and make all sites fail OPS tests. Are there any plans to change it?
(ROC France)
OPS test was using lfc-lhcb.grid.sara.nl as LFC server for OPS. This shows the information service cannot be trusted, it s a point of failure that allows anyone to deny service to others. Please, would it be possible to consider a GRID where nobody could just break the grid by publishing something wrong?
Ticket BUG:24812 is relavent to this, since the meeting Judit and Steve have discussed and see away forward, will update the ticket shortly.
WLCG Items
Upcoming WLCG Service Interventions
FZK Downtime
Due to the LFC DB migration from MySQL to Oracle, GridKa/FZK s LFC service will be down on Friday 18/04/2008 from 5:30 UTC to 20:00 UTC (LHCb LFC will not be affected by this).
CERN-PROD
DB downtime at CERN-PROD taking down FTS, SAM, GridView, VOMS and LFC, Thursday April 17th 2008.
PIC down completely on 1st and 2nd of May totally for power.
ATLAS Service
Last week functional test was quite good. During last week we also exportedsubdetector data (Calorimeter), 99% within the first 24h. These tests were performed using the newly written "plugin", that will allow us to swiftly react on sites having problems.
This week: T1-T1 FT, CNAF indicated they are ready,but also other T1s could try (or try again if they had already tried). Probably also this week there will be data from subdetector (Muons) to be exported, like it was done last week
CMS Service
News on Development
Logfiles archiving: post-poned to ProdAgent v.0.9. Chained processing: implementantion largely in place, still scheduled for June release; dealing with large MySQL DBs: some improvement indeed came with latest release, still working on it.
Data certification, Processing at the T0
CERN very busy with RelVal production. Validated releases: CMSSW v1.8.4, CMSSW V2.0.0_pre9. High statistics RelVal samples could not be started at FNAL due to problem, had to use CERN. Tier-0 unavailable due to production, limited to relVal queue. Upcoming release is the 2.0.0. It will take precedence over 1.1.0_pre1 if necessary, the standard set will run at CERN, the high statistics set will run at FNAL in parallel to massive FastSim production.
Re-processing
still running the never-ending CSA07 signal workflows: allrequests finished, waiting for more input datasets, transfers seem not to work as good. Soups at FNAL: work in progress. The important 1.8.4 FastSim production has started: AlcaReco & physics requests, started at all T1 (also those in don, now are used, e.g. FZK and CNAF). Problems mostly at the config level and due to start-up, not really site issues (yet).
MC production
40k cosmics data with CMSSW v1.7.7 now available to physicists in global DBS. 10M cosmics requet with CMSSW v1.8.4 has srated in OSG, plus some more samples. FastSim production: all requests injected in ProdRequest.
Data Transfers and Integrity, DDT-2/LT status
Low transfer activity (/Prod instance) from CERN to T1 sites (only RAL and FNAL, ~3 TB out of CERN). ~1 TB tape backlog from T1's seen at FNAL. The t1transfer pool at CERN had peaks all within 1k max files to be migrated to tapes. --- Running a campaign to overview production transfers which did not complete within 30 days from the subscription: it will help to cut the tails wherever useless and identify problems/bottlenecks in the production transfer system (or in the transfer tool), much work needed still on top on such provided lists, though. --- DDT status: We have 317 commissioned links (as of April 11th), +23 wrt last week (!). The breakdown is: all 56 T[01]-T1 crosslinks (some to be re-exercised to due back up&runnning after downs); 162/320 (51%) T1-T2 downlinks and 93/320 (29%) T2-T1 uplinks; 6 T2-T2 links. From the "Site Commissioning" pov, concerning the link testing, 37/40 T2 have at least 1 commissioned downlink upink to the associated T1, and - among these - 30 have at least 2 commissioned T1-T2 downlinks. In total, 93% of the previously commissioned links have already PASSED the new metric as of April 11th (2 months after the start of this DDT-2 phase). --- Day-2-day details at https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercising, and (NEW!) more details now visible again online at Nicolo's page: http://magini.web.cern.ch/magini/ddt.html.
Atlas to provide details of tests they are running. Atlas have provided the name of the test. CE-sft-vo-swspace . This item should be closed next week. Update 5th May: Small amount still to do but progress has been made. Revisit next week. Update 19th May: this action can be closed