WLCG-OSG-EGEE Ops' Minutes Mon 14 Apr 2008

Attendance

EGEE

  • Asia Pacific ROC: Min Tsai
  • Central Europe ROC: Marcin
  • OCC / CERN ROC: John Shade, Antonio Retico, Steve Traylen, Maite
  • French ROC: Gilles, Pierre
  • German/Swiss ROC: Sven Hermann
  • Italian ROC: Alessandro Cavalli
  • Northern Europe ROC: Gert Svensson
  • Russian ROC: Lev
  • South East Europe ROC: Kostas Koumantaros
  • South West Europe ROC: Kai Neuffer, Gonzalo Merino
  • UK/Ireland ROC: Jeremy Coles
  • GGUS: Guenter
  • OSCT: Absent

WLCG

  • WLCG Service Cordination: Harry Renshall, Jamie Shiers

WLCG Tier 1 Sites

  • ASGC: Min Tsai
  • BNL: Absent
  • CERN site: Ignacio Reguero
  • FNAL: Absent
  • FZK: Sven Hermann
  • IN2P3: Pierre
  • INFN: Alessandro
  • NDGF: Absent
  • PIC: Gonzalo
  • RAL: Derek Ross, Matt Hodges
  • SARA/NIKHEF: Absent
  • TRIUMF: Absent

Reports Not Received

  • VOs: Alice, BioMed, LHCb
  • EGEE ROCs (Prod Sites):

Feedback on Last Week's Minutes

None were given.

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From SWE Italy
To DECH Russia

Report from Italy COD
  1. Site: ru-Moscow-GCRAS-LCG2, GGUS:34045, GGUS:34051, GGUS:34817 Reached last escalation step, but then the site reacted with: "Still problem with certificates, including users certs and RA." The RA itself has certificate problems, and is making the papers to be renewed. We gave them the possibility to wait for this, in downtime state, because it is not a software problem to be corrected, but just a wait for new certificates to be provided by CA/RA.

Report from SWE COD
  1. Australia-UNIMELB-LCG2: GGUS Ticket GGUS:34393 Site comments that their SE is full because of atlas VO not removing files. Is this a problem of atlas VO or should the site reserve disk space for the ops VO?
  2. YerPHI: GGUS Ticket GGUS:26634 Site is transfered to the politiccal instance but neighter on Scheduled Downtime no suspended. What is the latest status on this?

From the meeting CERN-ROC will again follow up with YerPhi.

Assigned to Due date Description State Closed Notify  
Main.CERNROC 2008-05-13 Follow up with YerPhi site to resolve or suspend site.

Update 5th May.
CERNROC to provide an update next week once they have been CIC on duty for a
week and cleaned everything up.


Update 9th May.
!YerPhi is now in state suspended and all existing COD tickets have been closed.
This item should be closed after next week's meeting.

Update 19th May, this action can be closed

2008-05-22 edit

PPS Reports

Cern ROC
yaim-core 4.0.4, released with gLite 3.1.0 PPS Update 22 introduces a check that blocks the configuration if read permissions are given to non-root users on the site-info file and the directory where it is stored . This causes problems in set-ups where the permissions cannot be changed to 700 (e.g. installations of UI on AFS). A bug has been opened for that (BUG:35307), and the check will be softnened in version 4.0.5. Sites installing version 4.0.4 should be prepared to change a function in yaim as described in YaimGuide400#Known_issues

gLite Releases.

gLite 3.1 Update18 went to production last Monday.

The update contains
  • NEW: glite-MON for SL4
  • DPM 1.6.7-4
    • fix for bug #33769: incorrect pool free space after dpm-drain
    • improved ACL management for srmMkdir command
  • UI/WN/VOBOX
    • lcg-tags non longer produces Globus warnings suppressed
    • voms-admin client 2.0.6-1 providing ACL support on command line
  • vdt_globus_essentials (affecting several services and notably the CE)
    • bug fix to prevent globus-job-manager processes to pile-up on a CE (big observed at CERN after SAM WMS?RB tests were enabled )
  • voms-admin server (VOMS)
    • Refactored voms-admin-ping script
    • ACL management web service (compatible with client >= 2.0.6-1)
    • Registration web service.
    • many bug fixes

Details in: http://glite.web.cern.ch/glite/packages/R3.1/updates.asp

For details of PPS releases and upcoming release see the agenda.

EGEE Items From ROC Reports

(ROC CE)
Majority of CE sites failed SAM due to wrongly advertised LFC for OPS VO. GGUS:35093 It is a weak point of the infrastructure that a site can publish anything and make all sites fail OPS tests. Are there any plans to change it?

(ROC France)
OPS test was using lfc-lhcb.grid.sara.nl as LFC server for OPS. This shows the information service cannot be trusted, it s a point of failure that allows anyone to deny service to others. Please, would it be possible to consider a GRID where nobody could just break the grid by publishing something wrong?
  • Ticket BUG:24812 is relavent to this, since the meeting Judit and Steve have discussed and see away forward, will update the ticket shortly.

WLCG Items

Upcoming WLCG Service Interventions

FZK Downtime
Due to the LFC DB migration from MySQL to Oracle, GridKa/FZK s LFC service will be down on Friday 18/04/2008 from 5:30 UTC to 20:00 UTC (LHCb LFC will not be affected by this).
CERN-PROD
DB downtime at CERN-PROD taking down FTS, SAM, GridView, VOMS and LFC, Thursday April 17th 2008.
PIC
FTS down at same time as CERN-PROD DB downtime.
PIC
PIC down completely on 1st and 2nd of May totally for power.

ATLAS Service

  • Last week functional test was quite good. During last week we also exportedsubdetector data (Calorimeter), 99% within the first 24h. These tests were performed using the newly written "plugin", that will allow us to swiftly react on sites having problems.

  • This week: T1-T1 FT, CNAF indicated they are ready,but also other T1s could try (or try again if they had already tried). Probably also this week there will be data from subdetector (Muons) to be exported, like it was done last week

CMS Service

News on Development
Logfiles archiving: post-poned to ProdAgent v.0.9. Chained processing: implementantion largely in place, still scheduled for June release; dealing with large MySQL DBs: some improvement indeed came with latest release, still working on it.
Data certification, Processing at the T0
CERN very busy with RelVal production. Validated releases: CMSSW v1.8.4, CMSSW V2.0.0_pre9. High statistics RelVal samples could not be started at FNAL due to problem, had to use CERN. Tier-0 unavailable due to production, limited to relVal queue. Upcoming release is the 2.0.0. It will take precedence over 1.1.0_pre1 if necessary, the standard set will run at CERN, the high statistics set will run at FNAL in parallel to massive FastSim production.
Re-processing
still running the never-ending CSA07 signal workflows: allrequests finished, waiting for more input datasets, transfers seem not to work as good. Soups at FNAL: work in progress. The important 1.8.4 FastSim production has started: AlcaReco & physics requests, started at all T1 (also those in don, now are used, e.g. FZK and CNAF). Problems mostly at the config level and due to start-up, not really site issues (yet).
MC production
40k cosmics data with CMSSW v1.7.7 now available to physicists in global DBS. 10M cosmics requet with CMSSW v1.8.4 has srated in OSG, plus some more samples. FastSim production: all requests injected in ProdRequest.
Data Transfers and Integrity, DDT-2/LT status
Low transfer activity (/Prod instance) from CERN to T1 sites (only RAL and FNAL, ~3 TB out of CERN). ~1 TB tape backlog from T1's seen at FNAL. The t1transfer pool at CERN had peaks all within 1k max files to be migrated to tapes. --- Running a campaign to overview production transfers which did not complete within 30 days from the subscription: it will help to cut the tails wherever useless and identify problems/bottlenecks in the production transfer system (or in the transfer tool), much work needed still on top on such provided lists, though. --- DDT status: We have 317 commissioned links (as of April 11th), +23 wrt last week (!). The breakdown is: all 56 T[01]-T1 crosslinks (some to be re-exercised to due back up&runnning after downs); 162/320 (51%) T1-T2 downlinks and 93/320 (29%) T2-T1 uplinks; 6 T2-T2 links. From the "Site Commissioning" pov, concerning the link testing, 37/40 T2 have at least 1 commissioned downlink upink to the associated T1, and - among these - 30 have at least 2 commissioned T1-T2 downlinks. In total, 93% of the previously commissioned links have already PASSED the new metric as of April 11th (2 months after the start of this DDT-2 phase). --- Day-2-day details at https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercising, and (NEW!) more details now visible again online at Nicolo's page: http://magini.web.cern.ch/magini/ddt.html.
LINKs
Computing meetings of the week: http://indico.cern.ch/conferenceDisplay.py?confId=31923

LHCb Service

WLCG Service Coordination

OSG Items

Action Items

Review of Open Action Items

136 & 137
Are ROCS looking into this, yes they are.

Assigned to Due date Description State Closed Notify  
Main.Atlas 2008-05-13 Atlas to provide details of tests they are running. Atlas have provided the name of the test. CE-sft-vo-swspace . This item should be closed next week.

Update 5th May: Small amount still to do but progress has been made. Revisit next week.

Update 19th May: this action can be closed

2008-05-22 edit
141
Close
142
Change ownership to SAM.
150
Min working on it. Site should submit a ticket though.
156
Steve to follow up.
157
Expected to be released at least 2 months times, close item.
158
Matie to check.
159
Close the action, Gilles has done something.

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

AOB

  • Gilles, Last release CIC portal of EGEE II next week.
    • UI Cleanup of long menu items. Will be broadcast nearer the time.

Next Meeting

The next meeting will be Monday, dd mmm 2007 15:00 UTC (16:00 Swiss local time).

  • Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
  • The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0157610


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2008-05-22 - MaiteBarroso
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback