WLCG-OSG-EGEE Ops' Minutes Mon 04 Aug 2008

Summary

Errors are being reported in GStat for the FNAL EGEE grid site (USCMS-FNAL-WC1). This is due
to CEs and SEs at the site being published in different site BDIIs. Discussions are under way
to decide whether this is a valid site configuration.
(https://gus.fzk.de/pages/ticket_details.php?ticket=34338).

The FTS development team have been asked to write suitable Information Providers to publish the
FTM end-points.

A check will be made on whether the "KCA" CA needs to remain in the set of CA RPMs which are
distributed to the EGEE grid.

Attendance

EGEE

  • Asia Pacific ROC: Min Tsai
  • Central Europe ROC: Malgorzata Krakowian
  • OCC / CERN ROC: Maite Barroso, Nick Thackray, Steve Traylen
  • French ROC: Cyril l'Orphelin, Helene Cordier, Pierre-Emmanuel, Brinette, David Bouvet
  • German/Swiss ROC: Absent
  • Italian ROC: Absent
  • Northern Europe ROC: Gert Svensson
  • Russian ROC: Absent
  • South East Europe ROC: Kostas Koumantaros
  • South West Europe ROC: Gonzalo Merino
  • UK/Ireland ROC: Jeremy Coles
  • GGUS: Torsten Antoni

WLCG

  • WLCG Service Cordination: Harry Renshall

WLCG Tier 1 Sites

  • ASGC: Min Tsai
  • BNL: Absent
  • CERN site: Absent
  • FNAL: Catalin Dumitrescu
  • FZK: Absent
  • IN2P3: David Bouvet
  • INFN: Absent
  • NDGF: Absent
  • PIC: Gonzalo
  • RAL: Derek Ross
  • SARA/NIKHEF: Absent
  • TRIUMF: Absent

LHC Experiments

  • ATLAS: Alessandro di Girolamo
  • LHCb: Roberto Santinelli
  • CMS: absent
  • ALICE: absent

Feedback on Last Week's Minutes

None was given.

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From ROC Central Europe ROC North Europe
To ROC France ROC Asia Pacific

Report from Central Europe COD:

  1. Request to ROC's to remind people of not putting ticket comments in other languages then English.
  2. There is GGUS ticket GGUS:34338 assigned to GStat with last update 2008-07-02. Some COD tickets depend on it. Request to GStat for update.
    ACTION on OCC to follow up on this.

Report from North Europe COD:

  1. The SAMAP tests put some sites in critical error for not yet having the new CA rpms. Normal SAM tests give a warning for this now, as noted in the lcg rollout list.
    ACTION on OCC to followed this up off-line.

PPS Reports

  • UK/I ROC: No UKI PPS sites appear in the site reports area.
    Nick will look into this.

gLite Release News

Now in PPS:
  • gLite3.1 PPS Update34 to PPS has successfully passed through deployment testing and will be released to the PPS within the next days. This update contains:
    • DPM and LFC 1.6.11 (see details in PATCH:1987)
    • dCache 1.8.0-15p5 with new YAIM nodule for configuration

Soon in Production

  • gLite3.1 Update28 in preparation. This update has been delayed due to issues with the release process but will be released within the next days. The release contains:
    • glite-CONDOR_utils for lcg-CE(PATCH:1856)
    • New version of gsoap plugin with a vulnerability fix (affecting LB, WMS, UI, WN, VOBOX, CE)(PATCH:1846)
    • Several bug fixes on WMS and clients (PATCH:1780)
    • New Short Lived Credential Service (SLCS), allowing to get short-lived personal certificate based on Shibboleth AAI identity (PATCH:1693)
    • MyProxy version 1.6.1-7 (fixes build issue related to globus flavour, already deployed in production) (PATCH:1978)
    • Various improvements on lcg-extra-jobmanagers (CE) (PATCH:1942)
    • GFAL and lcg_util update with new function gfal_removedir and Several bug fixes
    • FTS SL4 release (32 and 64 bit)

ACTION on Nick to find the probable date of the release of the CREAM CE into production.

Experience of countries/regions with the WMS?

Reminder to sites/ROCs to send us one paragraph on your WMS experience for compilation.

In the UK we are still trying to understand when to move to relying on the WMS and how many we require. What are the experiences of other countries/regions? Here is some background from a recent GridPP meeting:

"The RAL WMS lcgwms01 (SL3 host with gLite-WMS-2.4.9-0 and glite-LB-2.3.5-0) became heavily loaded on 22nd and user throughput suffered as a result. The underlying problem was not understood as the service returned to normal without a clear intervention required. This prompted SL to comment on WMS and RB availability in the UK. He noted 5 RBs (3 RAL; 1 Glasgow and 1 IC). He was only aware of the 1 WMS instance at RAL. As of today, the default server in Glasgow is a gLite 3.1 WMS instance (RB to be removed at the end of July and possibly replaced with another WMS). RAL maintains one test instance on SL4 – to be moved to production after further testing. IC has PPS-glite-WMS.i386 3.1.8-1. This WMS is stable with 20-30,000 jobs a day not causing a problem. NGS has an unadvertised WMS hosted at RAL. Grid Ireland run a WMS and has seen “quite a few issues” while working with users to get their apps working via it. Throughput performance of the WMS is good.

Stephen recently noticed that YAIM will soon be configuring UIs to work with service discovery (WMS and LBs will be discoverable through the information system using appropriate UI commands): https://savannah.cern.ch/bugs/?31211.&#8221"

EGEE Items From ROC Reports

  • France: CA Update 1.24 has not been followed-up properly one again as repositories have not been updated along with SAM tests update. Sites have been in Warning state. Proposal: Ask that CA Update Procedure is followed up. When a delay occurs, the 7 days SAM count down should be reset to 7 days.
    Several of the integration team left at the same time so there are new people in the team. They found it difficult to carry out the procedure following the documentation. The documentation will be updated to fix this.
    ACTION on OCC to pass this issue on to the owners of the process.

Add-hoc points on the GOC DB

  • ATLAS asked for an update on the development for adding a "country" column to the table of down-times.
    GILLES said that this will be included in the update to the GOC DB which is to be released on Wednesday.

  • ATLAS noted that sometimes in the evenings it is difficult to access the GOC DB.
    UK/I ROC: What time of the day exactly?
    ATLAS: Will get the exact times when the problem is seen and submit a ticket.
    ACTION on ATLAS to submit a GGUS ticket with details of the issue.

WLCG Items

WLCG issues coming from ROC reports

  • None.

End points for FTM service at tier-1 site

There is a request to know what are the FTM endpoints at the Tier-1 sites.
We can collect these manually now, but how should the list be kept up-to-date?
ACTION on OCC to request the FTS developers to write suitable information providers.
ACTION on OCC to send out an e-mail to request this information from those tier-1 sites which haven't already provided it.

The list of FTM end-points we have so far is:

ASGC: http://w-ftm01.grid.sinica.edu.tw/transfer-monitor-report/
FZK: http://ftm-fzk.gridka.de/transfer-monitor-report/
IN2P3: http://cclcgftmli01.in2p3.fr/transfer-monitor-report/
INFN: https://cmsfts3.fnal.gov:8443/transfer-monitor-report/
https://cmsfts3.fnal.gov:8443/transfer-monitor-gridview/
PIC: http://ftm.pic.es/transfer-monitor-report/
RAL: Endpoint is still under test.

Upcoming WLCG Service Interventions

  1. Reminder that tomorrow, 5-Aug-2008, PIC will have a scheduled downtime from 6:00 to 18:00 UTC. The main services (CE and SE) will be affected.

WLCG Operations Review

See the report at https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek080728

ALICE Service

No-one from Alice was present at the meeting.

ATLAS Service

  1. Atlas events in August: Kors - We will organize a last Jamboree before LHC turn-on on Thursday August 28 and a preliminary agenda can be found at: http://indico.cern.ch/conferenceDisplay.py?confId=38738 We would really appreciate if representatives of at least all Tier-1's but also of the major Tier-2's will be there, but of course everybody is welcome. The Friday we will use for tutorials and training but we can also organize some extra meetings if needed. The Monday through Wednesday of that same week there will be an Analysis workshop with a focus on tools and development. We have reserved the IT Aud. for that whole week and a video link will be set up also.
    Jeremy (UK) asked who should be contacted for the agenda?
    ACTION on Alessandro to send an e-mail to Maite and Jeremy with details.


  2. Xavi (by email) : We will organize a Tutorial and Training session on the 29th of August, just the day after the ATLAS Tier-1&2&3 Jamboree. Preliminary agenda can be found at: http://indico.cern.ch/conferenceDisplay.py?confId=38864 I specially encourage potential future shifters, actual shift crowd and site contacts to assist. We will have tutorials for the fundamental services and systems in ATLAS, and also a special monitoring training session based on the ATLAS dashboards (specially interesting for site contacts, as one can see if a site is performing well -either in data management or in simulation production- which is very useful to spot, track and debug problems)

  3. Massimo & Johannes - as discussed in the last month and presented in the last two ADC weekly meetings, we are going to have an ATLAS analysis workshop on August 25-27 at CERN. The outline of the session is online on indico ( http://indico.cern.ch/conferenceDisplay.py?confId=38560 ). We feel that the workshop will be a good opportunity to consolidate the successful experience in grid analysis of our Ganga and pAthena and continue to build on that. We insist on the "workshop" format because we feel that the three days of the event will be best used in technical discussion (with little formal presentations).

ATLAS will be in near-to-continuous cosmic data taking from now on.

CMS Service

Apologies received.

LHCb Service

  1. EGEE broadcast sent today about the new VOMS "pilot" role that must be configured on every site. This role will be supposed to run generic pilot and then used only to submit through a CE and run glexec.

  1. Remark the importance of Savannah bug http://savannah.cern.ch/bugs/?39641 (User proxy mix-up for job submissions too close in time) to be escalated at the EMT.
    ATION on Nick to escalate this at the EMT.

  1. SAM tests results when the experiment framework changes: we migrated indeed from DIRAC2 to DIRAC3 the SAM suite for CE and we would like to advertize (a posteriori) that most of the bad results for this service are due of that. What is the recommended procedure to disable these tests results from the final site availability computation?
    Answer from SAM: It's better if we discuss it offline with them, but either they set the test as 'non-critical' (a priori). Or they come to us and say, from day X to day Y, we would like to have test Z as 'non-critical' (a posteriori) but before the end of the month (before calculating the sites' availability). We are discussing the way to deal with this particular situation, while we have already implemented other mechanisms ~ to deal with cases when test are submitted and fail due to SAM problems.

Ad-hoc discussion on VOMS "issues" during the week

  • Steve reported that the upgrade of VOMS today went fine.

  • However, the UK CA certificate was also updated and much confusion was caused as VOMS sent out an e-mail to the effected users telling them that their certificate was no longer valid: the e-mail was referring to the old certificate but this wasn't obvious.

OSG Items

No-one from OSG was present at the meeting.

AOB

  • Why is the "experimental" CA, KCA, added to the IGTF distribution? Answer: it is "trusted" by LCG and has always been there.
    FNAL stated that US CMS do not use KCA.
    ACTION on Steve to find out if we can drop this CA from the list.

Action Items

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
Main.OCC 2008-09-01 Follow up on GGUS:34338. Concerns gstat sanity error at FNAL.

Update 8th August. Solved, as was suggested in April the CEs should not present in the GOCDB. GOCDB contains a list of EGEE siteBDIIs and services not under those siteBDIIs are not at the site as far as GOCDB/gstat or EGEE is concerned. Steve.

Close this action after next meeting.

2008-08-11 edit
Main.OCC 2008-08-18 SAMAP is giving critical errors rather than warnings when sites do not update their CA RPMs 7 days prior to the deadline for update.

Update 25th August SAMAP will follow-up "later"
Update 8/9/08: Nick will follow-up.
Update 13th October The tool development team has fixed the bug.

2008-10-17 edit
Main.OCC 2008-08-25 Find the probable release date of the CREAM CE.

Update 25th August: This will be released in the next update to gLite 3.1 - within 1-2 weeks.
1st September After teh update at today's meeting, this action can be closed: the EMT made the decision to delay the deployment of the CREAM CE (the certified patch). This is because not-ICE-enabled WMS could accidentally match the Cream CE and cause a submission failure. Waiting for the ICE-WMS to be deployed, as a workaround, Cream will be released with a GlueServiceStatus?? = ‘Production’, to be changed again later. One issue is represented by the old version of WMS on SL3 (unsupported). As they will not be integrated with ICE, once the Cream CE will be advertised again in real production mode, they would fail to submit. In order to size this issue up we would like to get from the WLCG EGEE Operation Meeting an estimation of the number of old SL3 WMS still in production.

2008-09-01 edit
Main.OCC 2008-08-18 Make the owners of the CA RPM release process aware of the issues raised by ROC France.

Update August 19th: Maite has some news?

Update 1st September SAM agrees to extend the 7 days period in this specific case: the CA RPMs are not put in the repository in the 1 day scheduled for this. Technically it is feasible and already implemented. See diagram and explanations here:
https://twiki.cern.ch/twiki/bin/view/LCG/SAMSensorsTests#CE_sft_caver

Shorty, the diagram shows that it is possible to configure:
- time-stamp from which countdown of timeout starts
- delay of warning
- timeout before sites will get CRIT error

Update 10/9/08: Although Nick doesn't understand the text, he said that the ticket can be closed (SAM implemented what was asked).

2008-09-12 edit
Main.ATLAS 2008-08-18 ATLAS to submit a GGUS ticket detailing the problems of slow response of the GOC DB seen in the evenings.

Update 19th August - Slow response no longer obvious, will close and reopen if need be.

2008-08-19 edit
Main.OCC 2008-08-18 Submit a request to the FTS developers to provide suitable information providers for publishing the FTM end-points.
August 8th 2008 , bug now submitted GGUS:39906

The action can be closed

2008-08-11 edit
Main.Alessandro_di_Girolamo 2008-08-18 Send details to Maite (Maria . Barroso . Lopez @ cern . ch) and Jeremy Coles (j . coles @ rl . ac . uk) of who to contact regarding the agenda of the ATLAS events at CERN during the week 25-29 August. 2008-08-19 edit
NickThackray 2008-08-11 On the request of LHCb, escalate the bug BUG:39641 at the EMT.

the action can be closed

2008-08-11 edit
SteveTraylen 2008-08-20 Check if KCA is still needed in the lcg-CA CA set.

Was raised at this week's LCG MB. Fermilab representatives are checking internally if it is still needed.

Update 19th of August: The LCB MB meets today and this will hopefully be resolved.

Update 25th August: The KCA will soon be officially approved as a trusted CA. Also, it is being used by the CDF VO. Therefore, KCA will remain in the list of CAs.

2008-08-27 edit

Review of Open Action Items

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

Next Meeting

The next meeting will be Monday, 11 August 2008 16:00 UTC+2 (16:00 Swiss local time).

  • Attendees can join from 15:45 UTC+2 onwards.
  • The meeting will start promptly at 16:00 UTC+2.
  • The WLCG section will start at the fixed time of 16:30 UTC+2.
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0157610


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2008-10-17 - NickThackray
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback