WLCG-OSG-EGEE Ops' Minutes Mon 02 Jun 2008

Attendance

EGEE

  • Asia Pacific ROC: Min Tsai
  • Central Europe ROC: Malgorzata Krakowian
  • OCC / CERN ROC: Antonio Retico, Steve Traylen
  • French ROC: Pierre
  • German/Swiss ROC: Clemens Koerdt,
  • Italian ROC: Alessandro Cavalli
  • Northern Europe ROC: Jules Wolfrat
  • Russian ROC: Lev
  • South East Europe ROC: Kostas Koumantaros
  • South West Europe ROC: Kai Neuffer, Gonzalo Merino
  • UK/Ireland ROC: Jeremy Coles
  • GGUS: Torsten Antoni
  • OSCT: Absent

WLCG

  • WLCG Service Cordination: Harry Renshall, Jamie Shiers

WLCG Tier 1 Sites

  • ASGC: Min Tsai
  • BNL: Absent
  • CERN site: Harry Renshall
  • FNAL: Joe Kaiser
  • FZK: Sven Hermann
  • IN2P3: Pierre
  • INFN: Alessandro, Alfredo
  • NDGF: Leif
  • PIC: Gonzalo
  • RAL: Derek Ross, Matt Hodges
  • SARA/NIKHEF: Absent
  • TRIUMF: Absent

Reports Not Received

  • VOs:
  • EGEE ROCs (Prod Sites):

Feedback on Last Week's Minutes

None were given.

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From ROC AP ROC SWE
To ROC RU ROC FR

Two sites were transferred to political instances

  1. VGTU-gLite: SRM failure on se.grid.vgtu.lt Site is very unstable and no input received since the 9 of may.
  • Ron (NE) will follow up with the site
  1. (UKI-LT2-QMUL: RGMA failure on mon01.esc.qmul.ac.uk No input received since the 5 of may.
  • Jeremy (UKI) will follow up with the site

PPS Reports

See reports from the ROCs and info in https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps

  • No comments received

See information about forthcoming releases in https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases

  • No comments received

EGEE Items From ROC Reports

Maite points out that very few issues were reported. We are happy for that but we welcome sites to give feedback should anything be of obstacle for them to provide reports in time.

  1. ROC France: With our UIs we got some problems with Python for several VOs because those VOs use their own Python version (> 2.3.x). Unfortunately, UI installation provides standard python2.3 libraries within the externals directory, and set the PYTHONPATH accordingly. By the way, to be able to use their own python installation, VOs must convenably update the PYTHONPATH variable to ensure that the right version of the required libraries are firstly taken into account. Make sure also that you call the right python binary.

Pierre: This happens with latest versions of the UI, specifically the ones installing python libraries (3.1.4/5)

Maite: We will check it off on-line, apparently CERN-PROD is also impacted, to be followed-up with integration team and UI administrators at CERN

WLCG Items

Harry: News from the CCRC08: Atlas announces that the current planning is finish transferring M7 data then, about midday tomorrow, switch to the FDR2 exercise and start transferring the 6 hours worth of generated MC data

WLCG issues coming from ROC reports

  • none reported

Upcoming WLCG Service Interventions

  • IN2P3-CC (T1, T2 and PPS) is down on tuesday 3rd june. Due to an Oracle maintenance, all our LFCs will be down from 8:00 to 14:00. In particular, Central LFC for Biomed will be unreachable during this maintenance operation.

  • PIC Tier-1 will have an Scheduled Downtime next tuesday the 10th of july in order to perform interventions on various grid services: SRM (dcache upgrade), CE (NFS sw area migration), FTS (head node migration), LFC-atlas (DB backend migration) and LHCb-DIRAC (IP changes). The SD and broadcast have been already filed in the GOCDB.

ATLAS Service

Request to recalculate the Site availability for all the Tier1 (also if only half of them were affected) following the problem of the SAM critical update that had not worked for the ATLAS SAM ui. The tier1s site availability should be recalculated between the 20th and the 26th of May.

Alessandro (Atlas): The issue was discussed further with John Shade: we applied as recommended the latest critical update to the SAM client but the upgrade did not work smoothly. Finally we re-installed our machine last Monday. now we are requesting to re-calculate the availability because we don't want to put the blame on sites for bad results due to broken SAM client.

Judit(SAM): Normally we don't re-calculate availability. This is also a bit of a special case, because we should filter out part of the test results from the input data. When events like this produce for the ops availability report we attach to the availability report a link to the list of SAM unavailabilities, which is normally enough for the quality assurance groups to filter out unreliable results form their metrics. Would that kind of solution be acceptable for Atlas?

Alessandro: It would be accaptable

The decision is made to modify the SAM unavailability list on twiki adding a section for availability of clients run by the VOs

. we pecify th esolution is acceptable. Putting a grey means to recalculate. for ops we attach the unavailability list. ALE: it is fine. Judit. add a sam availability VO specific page. Judit also a link in gridview would be site.

Clemens
how the SAM UIs are going to be upgraded
Judit
this week tomorrow a validation instance of upgraded UI the next week on production UI

Judit (SAM): We

ALICE Service

CMS Service

LHCb Service

Roberto (LHCb) reported of an issue occurred in some sites during LHCb operation which is affecting an increasingly number of sites.

Apparently the cronjobs cleaning up /tmp directory on the WN are cleaning up files that are still to be accessed/used by stuill running jobs.

Workarounds (experiment side) would be to touch systematically all files still in use to avoid scripts to scrap them before the conclusion of the job but we really think this is a general issue that might affect all VOs at all sites deciding to provide the working area for the grid jobs via /tmp areas on the WNs.

Currently (as also reported in the observation elog https://prod-grid-logger.cern.ch/elog/CCRC'08+Observations/103 it affects ~1/4 of our jobs and are not completing properly and logs are not fully available.

Original submission in: https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/349

WLCG Service Coordination

OSG Items

Action Items

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
JeremyColes(UKI) 2008-06-09 follow-up reported site UKI-LT2-QMUL (transferred to Political Instance by COD on 2-Jun-08).

30/6/08 - Jeremy reckoned this action can be closed.

2008-07-01 edit
RonTrompert(NE) 2008-06-09 follow-up reported site VGTU-gLite (transferred to Political Instance by COD on 2-Jun-08)

30/6/08 Site is still failing SAM tests and should be suspended.

3/07/08 Ron reported that the site has now reacted and fixed the situation, close this action after the next operations meeting.

2008-07-08 edit
Main.ROC_France 2008-06-09 follow-up the following issue reported by ROC France: With our UIs we got some problems with Python for several VOs because those VOs use their own Python version (> 2.3.x). Unfortunately, UI installation provides standard python2.3 libraries within the externals directory, and set the PYTHONPATH accordingly. By the way, to be able to use their own python installation, VOs must convenably update the PYTHONPATH variable to ensure that the right version of the required libraries are firstly taken into account. Make sure also that you call the right python binary

*Update 11th June* Nick will look into this.

*Update 21st June* Waiting for Nick

*Update 28th July* Response from SA3 - _The tarball is produced to work with SL4, so python 2.3 has to be the default. To fully support python 2.5 (for example), you need to distribute the interpreter, reconfigure the environment and, ideally, have all your language extensions recompiled against the new python API. We are looking into how to do the last part, but the first two things are up to the site or VO.

Update 11th August this was raised by a VO in France, is the answer given by SA3 OK? how do we move from here? Helene will pass the feedback to the relevant people
Update 1st Septemberthe action can be closed; finally the real solution was in a savannah bug and it was a problem with the YAIM environment

2008-09-01 edit
JudiNovak 2008-06-09 Modify the SAM unavailability list on twiki adding a section for availability of clients run by the VOs

*Update 11th June* This will be followed up with John and Judit immediately after the meeting on the 9th.

2008-06-26 edit

Review of Open Action Items

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

Summary

  • LHCb reported that tmpwatch scripts and python's tar implementation did not work well. Jobs unpacked were deleted at the very first run of tmpwatch.

Next Meeting

The next meeting will be Monday, 09 Jun 2008 15:00 UTC (16:00 Swiss local time).

  • Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
  • The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0157610


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2008-09-01 - MaiteBarroso
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback