WLCG-OSG-EGEE Ops' Minutes Thu 06 Nov 2008

Summary

The experiments requested that all sites, particularly tier-1 sites, announce sufficient down-time when they need to carry out an intervention. The down-time can always be shortened if the intervention finishes early.

Attendance

EGEE

  • Asia Pacific ROC: ShuTing Liao
  • Central Europe ROC: Malgorzata Krakowian
  • OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Steve Traylen
  • French ROC: Osman, Pierre
  • German/Swiss ROC: Angela Poschlad
  • Italian ROC: Paolo Veranesi
  • Northern Europe ROC: Vera Hansper
  • Russian ROC: Absent
  • South East Europe ROC: Absent
  • South West Europe ROC: Kai Neuffer
  • UK/Ireland ROC: Jeremy Coles
  • GGUS: Absent

WLCG

  • WLCG Service Cordination: Harry Renshall

WLCG Tier 1 Sites

  • ASGC: ShuTing Liao
  • BNL: Absent
  • CERN site: Ignacio Reguero
  • FNAL: Catalin Dumitrescu
  • FZK: Men Wei
  • IN2P3: Pierre
  • INFN: Paolo
  • NDGF: Vera Hansper
  • PIC: Kai Neuffer
  • RAL: Derek Ross, Gareth Smith
  • SARA/NIKHEF: Absent
  • TRIUMF: Absent

LHC Experiments

  • ATLAS: Alessandro di Girolamo
  • LHCb: Absent
  • CMS: Stefano Belforte
  • ALICE: Absent

Feedback on Last Week's Minutes

None was given.

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From North Europe ROC CERN ROC
To ROC Italy ROC France

  • Details of tickets reaching final step of escalation: ROC_North, ITPA-LCG2, GGUS:42015, Nothing since 16th October
    ROC NE: Will follow up.

  • JP-HIROSHIMA-WLCG: id#9165 - GGUS Ticket #41683 No response whatsoever from this site. Nothing. Not a single bit.
    Asia Pacific ROC: Being followed up by the Asia Pacific ROC: We are still working on this issue with site admin through Email communication but we forgot to update ticket. However, we have updated the ticket with the following information last Sat. Sorry for the late reply. Site MON still block by central registry, service request have been submit since Oct 10 and look forward validation done by central registry admin. after that, we could resume the problem tracking why site accounting fail to publish normally. would like to extend the 4 more days for next escalation level.

  • SDU-LCG2: id#9164 - GGUS Ticket #41680 Absolutely no response from the site for 30 days. Not sure why we keep wasting our time on this one. The site has been dead with the very same "File not available.Cannot read JobWrapper output, both from Condor and from Maradona." error. Maybe escalation will trigger a response. Set expiry to 3/NOV
    This will be followed up by the CERN ROC.

  • This week saw a lot of SE downtime that affects the associated CEs. Especially ELTE-HU where the SE iSCSI interface is broken!
  • R-GMA at ELTE should be up according to mail exchange of follow-up, but is actually still down.
  • STORM front end at INFN-T1 remains unstable. The issue is acknowledged but associated errors keep popping up.

PPS Report and gLite Release News

  • As of last Monday, the VOMS pilot service is installed with the voms from PATCH:2390; voms proxies are available from it. All PPS sites are invited to re-configure their UIs to use this pilot service.
gLite 3.1 PPS Update 38 went through the deployment test and was distributed to PPS sites. This update contains:
  • PATCH:2002 Bug fixes for CREAM Client ( affects UI )
  • VOMS
    • VOMS Admin Client and Server 2.0.8-1 (affect VOMS, UI, VOBOX) - PATCH:2063
    • Various bug fixes, including fixes for FQAN order, short FQANs.... - PATCH:2390
    • VO-level configuration parameter to enable short FQANs - PATCH:2072
  • PATCH:2253 New JobManager?, Information Dynamic plugin and yaim utils versions for SGE
  • dcache-server and dcache-client upgrades (PATCH:2398 and PATCH:2399)
  • GFAL-client-1.10.18-3 and lcg_util-1.6.17-2 (bug fixes) (PATCH:2512, PATCH:2513)

Release notes in: https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update38 Deployment test reports in: http://www.cern.ch/pps/index.php?dir=./release/testreports/gLite3.1.0/gLite3.1.0-PPS-UPDATE38/

EGEE Items From ROC Reports

  • ROC Germany/Switzerland:BDII Problems. Region experienced problems with the new (Top-level) BDII release: some queries give no output. With old versions this problem did not occur. Are other sites also affected? For example the WMS show entries like:
    DATUM -I: [Info] fetch_bdii_ce_info(ldap-utils.cpp:567): zeus: skipped due to empty ACBR.

Angela: FZK downgraded from version 3.1.10 to 3.1.8 of the BDII and this helped.
Kai: Version 3.1.10 worked OK for PIC. Version 3.1.9 not.
Angela: We noticed that the problems were load related.
Nick: Can people please submit GGUS tickets.
Angela: DESY is in direct contact with the developers.
Steve: There are tickets flying around for this.

  • ROC SWE: SRM failures explained: PIC supplied details concerning one hour of SAM failures on 30-Oct-2008. ATLAS were running jobs at PIC which were reading several files via SRM, using lcgcp (up to 14k srmget/hour). This generated a high load in the SRM, which didn't service the SAM tests quickly enough.
    Solution: The ATLAS contact person has asked to change the local access protocol for reading from lcgcp to dcap (dccp). However, until the change is made, the problem could come back. As medium/long term solution they're thinking of a SRM server upgrade (x64+more RAM for catalina), and possibly splitting the service over several servers.

WLCG Items

WLCG issues coming from ROC reports

  • AP ROC: No specific issue, but it might interest ATLAS to know that TAIWAN-LCG2 is currently working on a couple of problems:
    • Source File Preparation Problem from TAIWAN-LCG2 Storage Element (ATLASMCDISK Space Token).
    • File transfer problem at TAIWAN-LCG2_MCDISK in ASGC Cloud

  • ROC France: ATLAS pilot jobs at CCIN2P3: For several months now, ATLAS has been submitting a huge number of pilot jobs even when there is no task to be treated. Despite having notified French ATLAS production team of this, and attempting manual regulation of pilot job submission, 25% of ATLAS pilot jobs are still doing nothing once running.
    Could ATLAS Production please adapt its execution engine to automatically regulate pilot job submission according to the number of tasks in their central queue?

IN2P3: We are seeing literally thousands of pilot jobs sitting on WNs and doing nothing.
Alesandro: It shouldn't be so many. Should be of the order of 10-30. Will ask production team. May be due to errant ATLAS users. Eric at IN2P3 is aware of the percentage of resources being used by pilot jobs.

Upcoming WLCG Service Interventions

  1. UKI-SOUTHGRID-RALPP will be down as of Thursday & including the weekend to fix air-conditioning
  2. SEE/TR-03-METU will be down as of Wednesday, also including the weekend for similar reasons
  3. CE/BMEGrid will be down for 3 days starting today
  4. In Italy, ENEA-INFO is down while they sort out what they publish as Glue sub-cluster information, and INFN-CS are down while they solve some cooling problems (two weeks)

JOHN (with SAM hat on): Interestingly, the ENEA-INFO site saw problems when the SAM switched from using RB to using WMS and so originally it was thought that the WMS had a problem. However it turned out that they were publishing no info on sub-cluster and so the WMS actually helped to find this problem.

Ad-hoc item from NDGF on publishing services in >1 site

NDGF and FNAL also raised again the issue that they cannot publish services as belonging to more than one site.
SAM: This requires a fundamental change to the SAM/GridView database. It is in the work plan.
ANGELA: Could you use a DNS alias?
VERA: I’ll relay this to the relevant people at NDGF.
CATALIN: Unfortunatley DNS alias doesn’t work with SRM.
VERA: GGUS ticket number is GGUS:42341.

WLCG Service Coordination

ATLAS Service

Nothing to report.

ALICE Service

Not present.

CMS Service

Stefano: having a problem with sites not specifying long enough down times. Sites should err on the side of caution with longer down-times that can then be shortened.

LHCb Service

Not present.

OSG Items

Under particular scrutiny from Maria:
  • GGUS:41670 Problem with the OSG voms server, assigned 1 month ago, not updated since.
  • GGUS:42058 Problem to download dataset from Boston Univ. Discussed last week also, assigned Oct 8th, not updated since.
  • GGUS:42221 ATLAS transfer problem from AGLT2. Discussed last week also, assigned Oct 11th, not updated since.
  • GGUS:42646 ATLAS transfer problem from AGLT2. Seems to be closed in the OSG helpdesk system. Can the same supporters close the GGUS ticket too? It is marked urgent.
  • GGUS:42647 ATLAS transfer problem from MidwestT2. Assigned on 2008-10-22, not updated since. It is also marked urgent.

Rob: The only one still open is the final one (GGUS:42647), which is being followed up. However, we’re seeing that not all of the status updates are making it through to GGUS.
Maria: Can OSG please close the tickets in GGUS?
Rob: We would expect GGUS to do this.
Diana: But we don't know the solution. In these special cases can you do it please?
Rob: For these 3 special cases, yes.

Action Items

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
Main.OCC 2007-03-05 Example Action Item 2007-03-06 SteveTraylen   edit

Review of Open Action Items

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

Next Meeting

The next meeting will be Monday, 10 November 2008 16:00 UTC+1 (Swiss local time).

  • Attendees can join from 15:45 UTC+1 onwards.
  • The meeting will start promptly at 16:00 UTC+1
  • The WLCG section will start at the fixed time of 16:30 UTC+1.
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0148141


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2008-11-07 - NickThackray
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback