WLCG-OSG-EGEE Ops Minutes Mon 28 Jan 2008

Attendance

EGEE

_ _ *Nb.* To avoid the time-consuming roll call, we will use the list provided by the concall software. Please put your affiliation in parentheses after your name when you register for the call. Also, be sure to check the minutes to ensure that your presence has been correctly noted.

  • Asia Pacific ROC: Min Tsai
  • Central Europe ROC: Marcin Radecki (CE)
  • OCC / CERN ROC: John Shade, Antonio Retico
  • French ROC: Gilles, Osman
  • German/Swiss ROC: Clemens Koerdt, Sven Hermann
  • Italian ROC: Alessandro Cavalli
  • Northern Europe ROC: Jules Wolfrat.
  • Russian ROC: Alexander Kryukov, Emanouil Atanassov
  • South East Europe ROC: Kostas Koumantaros
  • South West Europe ROC: Kai Neuffer, Gonzalo Merino
  • UK/Ireland ROC: Jeremy Coles
  • GGUS: Helmut Dres
  • OSCT: Absent

WLCG

  • WLCG Service Cordination: Harry Renshall

WLCG Tier 1 Sites

  • ASGC: Min
  • BNL:
  • CERN site:
  • FNAL: Joe Kaiser
  • FZK: Sven Hermann
  • IN2P3: Pierre
  • INFN: Alessandro
  • NDGF: Leif
  • PIC: Gonzalo
  • RAL: Jeremy
  • SARA/NIKHEF: Ron
  • TRIUMF: Rod Walker

Reports Not Received

  • WLCG Tier 1s:
  • VOs:
  • EGEE ROCs (Prod Sites):
  • EGEE ROCs (PPS Sites): AP, IT, SEE, SWE

Feedback on Last Week's Minutes

UKI were actually present, although listed as absent. Chair encouraged everyone to check the minutes for accuracy before each meeting.

Chair mentioned that Jamie Shiers had requested that, if possible, Tier 1 site reps attend the CCRC'08 concall at 17:00, so we would ensure that this meeting finished in plenty of time for that. The link to the CCRC'08 conference page was given: http://indico.cern.ch/conferenceDisplay.py?confId=26923

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From AP CERN
To SWE IT

Issues and replies available on the agenda page.

CIC dashboard had been reported as unstable. Problem was an abnormally high load on the server. Problem is understood, but not fixed yet (Osman).

GOCDB should only contain nodes that are meant to be in production. No comments to Cyril's explanation on the agenda page.

PPS Reports

  • No reports received from ROCs
  • Antonio (PPS) gave a reply to a point raised by CE ROC about bugs found in the release soon after the deployment in production (issue raised in EGEE ROC reports). Part of the issue was that PPS cannot detect all bugs because it cannot test all scenarios (e.g. classic SE), and part was due to a problem in the communication process (which has been addressed). More details in the next section.
  • Release News:
    There were three releases to PPS last week, mainly dealing with software needed for CCRC08. Details available in
    https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes

"Transparent" Interventions

How to handle notifications of transparent interventions?
Rolf: when the new Scheduled Downtime procedure was introduced, a new state appeared: downtimes at risk. No definition was given. The new state might cover what we describe, but GridView treats it as normal downtime (i.e. site unavailable). Rolf underlined that coordination between the different tools is needed.
For "at risk" Jeremy found https://tycho.gocdb.eu/help/downtime/ but this does not mention the GridView issue. Also, it does not mention the word "transparent", though this might be the case.

Update: Answer: Use GOCDB to declare an AT_RISK severity downtime. This is effectively equivalent to declaring a "transparent" intervention. GridView has been modified (30/1/08) to not treat "AT_RISK" as downtime in their availability calculations.

EGEE Items From ROC Reports

  1. (ROC CERN): SE and SRM tests failed at FNAL with timeouts after 600s. FNAL admins are proposing to increase the timeout to 1200s (20 minutes). A feature request was also submitted to Savannah @ CERN for that. SAM developers don't see major issues to increase the timeout. Other ROCs are requested to comment, otherwise the new timeout is approved. FNAL admins continue in parallel the work with the SRM team for better handling of overloads
    • Andrea Sciaba (CMS): the decision of increasing the timeout should be discussed also with VOs because so far the timeout is a global parameter affecting all the SAM tests. It is feasible to configure SAM in a way that the duration of the timeout is different for each sensor, but again this should be done with an eye to the constraints imposed by the VO application on the data transfers.
    • Chair commented that impacting all SAM tests because of problems at one site during specific times of the day was a sledge-hammer approach.
    • Joe Kaiser (FNAL): we would like the item to be reported to the next meeting and we would like to know which is the body where this decision has to be made.
    • OCC to investigate further the issue and propose a solution. (action)
  2. (ROC Central Europe): Three gLite releases in a row contain bugs resulting in SAM errors at sites just after deployment. Could we do something to avoid that? update 10: https://gus.fzk.de/ws/ticket_info.php?ticket=31596 update 11: https://savannah.cern.ch/patch/?1654 update 12: https://gus.fzk.de/ws/ticket_info.php?ticket=31802 Two of these bugs were found in PPS but despite of that got into Production release. Problems with a release procedure between PPS and the release team?
    • Antonio (PPS): Almost all the patches released last week were quickly deployed in production as soon as they were certified. The PPS phase was reduced to the pre-deployment testing (installation and configuration tests). The pre-deployment test is a PPS internal facility, meant to protect the pre-production from the introduction of bugs and does not cover for the time being all the scenarios existing in production (namely the ClassicSE is not covered). The PPS stage of these patches (~2 weeks) was clearly not completed. This was done in order to provide the T1 sites with the software they urgently needed for CCRC08.
      Regarding the two bugs raised in preproduction, that was actually a communication flaw in the procedure, which was revealed because of the high rate of releases. The flaw has been fixed, so the risk for this kind of glitch is from now on reduced.
  3. _(ROC North Europe): It is not clear how sites should proceed if they want to test services before going into production, see GGUS ticket 31311.
The suggested procedure:
  1. create entry in GOCDB
  2. put service in scheduled downtime
  3. create BDII entry
is clearly a workaround to the problem. A different, better, approach has to be studied and discussed. Vera (NDGF) stated that "all these things have changed in a way that is not positive for us". Antonio said that the changes were decided by the ROCs, but Jules (ROC NE) begged to differ. He said that the mismatch between GOCDB and BDII raises tickets - and that was not necessarily the intention. Observing mismatch is OK, but claiming that service is not OK as a result is not. Helene agreed, and said the COD has been told to raise tickets by Steve & Nick. Chair will ask Nick to clarify things for the next meeting; there seems to be general unhappiness with the current state of affairs.

WLCG Items

Tier1 Reports

None presented

WLCG issues coming from ROC reports

None discussed

Upcoming WLCG Service Interventions

Min (ASGC): Major intervention starting next Wednesday to move some servers from one computer room to another. Some service disruption to be expected.
Gavin (CERN): Two interventions happening tomorrow (Tuesday)
  • Upgrade to the latest CASTOR version - 2.1.6. This new version brings important bug fixes and some new interesting features. Intervention will start at 9h00 and is expected to finish by 15h00.
  • Transparent upgrade of all CERN LFC servers. Scheduled from 9AM onwards, in order to move to SLC4 and upgrade the LFC servers version to 1.6.8
Maria (CERN): Oracle Critical Patch Update will be applied on ATLR and ATONR databases (Atlas offline and Atlas online databases). The patch is rolling, no user-visible downtime foreseen.

FTS Service Review

Nothing to report.

ATLAS Service

https://cic.gridops.org/index.php?section=vo&page=weeklyreport&view_report=1523&view_week=2008-05&view_vo=1#rapport

Alessandro (Atlas) asked for an update about the issue raised two weeks ago of sites publishing wrong storage space information. He would like information on instructions for publishing storage space. These were apparently sent to Nick by Flavia, but we need to ensure that the sites have received them.

[ This information may, or may not be of interest (RALs implementation of a publisher for Castor): https://www.gridpp.ac.uk/wiki/RAL_Tier1_CASTOR_Accounting -Ed. ]

Antonio (OCC, offline): Additional configuration information was provided by Flavia Donno. The action is on OCC to submit a ticket to the ROCs (action) Some DPM problems observed at a few sites (DPM query-conf not giving output). ATLAS would like to cross-check the info with what's published in the BDII.

ALICE Service

No report received

CMS Service

No report received

LHCb Service

https://cic.gridops.org/index.php?section=vo&page=weeklyreport&view_report=1503&view_week=2008-05&view_vo=3#rapport

GGUS ticket #31800 assigned to the operations. https://gus.fzk.de/pages/ticket_details.php?ticket=31800

Roberto: GGUS ticket opened requesting all sites to provide a detailed SRMv2 status page. Antonio (OCC) has the ticket & will forward to the ROCs (with a template from Flavia Donno). (action)

WLCG Service Coordination

Harry simply reminded the attendees about the 17:00 CCRC'08 meeting.

OSG Items

Nothing to report, other than they're working with SAM team to get some OSG information from the RSV project into SAM.

AOB

Maria Girone(CERN 3D DB): In a couple of week we will apply an Oracle security patch on production RACs for CMS and LHCb (it was done for Atlas). The intervention will concern also services at CNAF for LHCb LFC and will requre a downtime of SARA.

Review of Action Items

The list of actions has not been updated since December, so Chair proposed to sort things out and do a full review next time. The list of actions is henceforth to be found in WlcgOsgEgeeOpsMeetingMinutes.

New Action Items from this Meeting


  Action Owner Start Due date
1 Clarify "at risk" downtime & interaction with tools (esp. GridView) John 28/1 Done 31/1
2 What to do about FNAL & SAM timeouts? John 28/1 3/2
3 How to handle BDII/GOCDB mismatches, and the issue of introducing new sites? Nick 28/1 3/2
4 Ensure instructions for publishing storage space reaches sites (ATLAS) Antonio 28/1 3/2
5 Request all LHCb sites to provide a detailed SRMv2 status page Antonio 28/1 3/2

1. Submitted Savannah bug 33104 against GridView. They fixed the GOCDB synchronizer code (gocdb3_query.php ) to handle AT_RISK downtime (intervention) correctly.

2. Piotr (Mr SAM) confirmed that site-specific timeouts are not an option. Also, modifying timeouts just for the DPM tests would take a while, and would require agreement from all VOs & ROCs (it would potentially increase the time to detect real DPM problems). One could argue that if the SRM tests are timing out after ten minutes, the SRM is probably not of much use to users at that time either. Therefore, tweaking SAM to mask the problem is not a good solution. Nevertheless, he suggested that FNAL investigate a local workaround, such as increasing the priority of ops monitoring jobs. Joe was notified of this, & we await his feedback.

3. Will be discussed by the ROC managers in Lyon next week (Tuesday 5th).

4. & 5. Currently being tracked by Antonio

  • 4 → tickets GGUS:32064 (ROC UKI), GGUS:32065 (ROC Russia), GGUS:32067 (ROC DECH), GGUS:32068 (ROC AP), GGUS:32070 (ROC France) submitted to track the issue → Action being tracked in GGUS → Close.
  • 5 → after discussions with LHCb we agreed that asking to each sysadmin to setup and to keep up-to-dated pages with such a level of detail is prohibitive. On the other hands the information from PIC seem to be already fairly exhaustive and it might be considered as a valid template for the rest of the T1s. We converged to the following:
    1. Flavia will maintain (centrally) these pages starting from: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCommonComputingReadinessChallenges
    2. each site should however provide some information that would be otherwise very difficult to gather: space name setup, tokens defined and disk space allocated for each and pools configuration. (PIC style)
    3. Flavia will complement with her tests result and her indirect measurements what is missing so that LHCb has a clear picture about what is
    4. Each ROC/sysadmin should check these pages and update them in case they do not reflect the real situation or they feel lacking someinformation
    5. Each ROC/sysadmin is free (and welcome) to modify the content of these twikies.
      As the action is being tracked in GGUS we suggest to close it as far as the ops meeting is concerned.

Next Meeting

The next meeting will be Monday, 03 Feb 2008 15:00 UTC (16:00 Swiss local time).

  • Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
  • The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0157610


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2008-03-03 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback