WLCG-OSG-EGEE Ops' Minutes Mon 03 Mar 2008

Attendance

EGEE

  • Asia Pacific ROC: Min
  • Central Europe ROC: Marcin
  • OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Steve Traylen, Maite, Maria, Harry, Roberto, Alessandro, Flavia.
  • French ROC:
  • German/Swiss ROC: Sven Hermann, Clemens
  • Italian ROC: Poalo
  • Northern Europe ROC: Gert
  • Russian ROC: Lev
  • South East Europe ROC: Kostas
  • South West Europe ROC: Kai
  • UK/Ireland ROC: Derek, Matt, Catalin
  • GGUS: Torsten
  • OSCT: Absent

WLCG

  • WLCG Service Cordination: Harry

WLCG Tier 1 Sites

  • ASGC: Min
  • BNL: Absent
  • CERN site: Ulrich
  • FNAL:
  • FZK: Sven Hermann, Torsten
  • IN2P3:
  • INFN:
  • NDGF: Leif
  • PIC: Kai
  • RAL: Derek, Matt, Catalin
  • SARA/NIKHEF:
  • TRIUMF:

Reports Not Received

  • VOs:
  • EGEE ROCs (Prod Sites): SouthWest Europe

Feedback on Last Week's Minutes

None were given.

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From CERN CE
To Italy SWE

  • Issues from the CERN ROC
    • The CEs at CERN-PROD were under heavy load and generated many alarms for the COD. However, the CEs are behaving as expected because they disappear from the information system when they are overloaded in order not to receive more jobs. The monitoring tools should be able to detect this condition doing the appropriate correlation between SAM and gstat results. Eventually a warning to the site about the overload could be sent, but not an error report.
      • An action item will be created the OCC.
    • Next COD should not open tickets for alarms about "RGMA-host-cert-valid failed on LRZ-LMU". Bug in SAM tests BUG:32497.

PPS Reports and Issues.

  • Test of 64-bit WNs in PPS: The 64bit natively compiled WNs have reached the pre-production phase and they are ready to be deployed. We want to do it in PPS in the most convenient way for the VOs to use them. We have sent last week two messages:
    • To the VOs (through the EIS team), asking for an expression of interest and possible suggestions and feedback about possible testing scenarios. We received a reply from LHCb and we are now working to address their reuirements. Are there other VOs interested in being involved in a pre-production activity of 64-bit WNs?
      • Comments in meeting, LHCb ship all the middleware anyway in the LCG area but we would still like to check this out. We expect it to work but still good to be done. Would like a CE advetised somehow into CERN in such away not to be in production..e.g GlueServiceStatus.
      • ATLAS will check if someone is interested.
    • To the PPS sites, asking for sites willing to dedicate 64-bit machines to this deployment and testing activity. So far only CERN and one site in Baltic grid have volunteered to pilot their 64-b deployment through PPS. Are there other sites willing to support this testing platform?
  • PPS Release News:
    1. Glite 3.1.0 PPS Update 19 passed the pre-deployment testing and it is now being deployed in PPS
      • WN 3.1 for sl4 64bits
      • glite-LSF_utils
      • lcg-vomscerts-4.8.0 adds next cert for biomed + egeode
      • new version of lcg-ManageVOTag fixing bug GGUS:31848 Release notes in: PPSReleaseNotes_310_PPS_Update19
    2. gLite3.1.0 PPS Update20 was released to PPS and it is going through the pre-deployment test. The update introduces the MONBOX on the 3.1 baseline (for SLC4)

EGEE Items From ROC Reports

  • [Question] As the classic SE will be phased out soon, are there plans to continue development of that MW service? There are other developments built on the classic SE like http://www.isgtw.org/?pid=1000820 (bridging the islands of SRM and SRB). What is the current status of the future SRM2 interface for SRB (ASGC)? (FhG SCAI)
    • There is an SRM 2.2 interface to SRB but the status is unknown. Flavia is waiting for an endpoint.
    • Min: There is an alpha/beta endpoint... Min will follow up produce some links or status.

gLite Release News.

An update to gLite (3.1 Update 15) will be released very soon (today) containing the new certificate of the VOMS server for the VOs biomed and egeode

  • In meeting discussion about the need to do this , given that YAIM now supports the DN only to trust a VOMS server
  • Consequently this is hopefully the last time this is done.

Operations Tools downtimes this week

Operations Tools downtimes this week
SAM
Downtime will begin at: 07:45h UTC, 4th March (08:45h Geneva time) Downtime will end at: 10:45h UTC, 4th March (11:45h Geneva time) Affected services are: GRIDVIEW, SAM and FCR
GOC DB
GOCDB was down on 28/02 (announced by CIC portal team). No announcement from GOCDB about this failure, neither about the return to service... (from ROC France)

Extra items on CIC Portal.

  • A cache is now in place on the CIC portal to survive a GOCDB going down.

Heinz Stockinger still blocked at some sites

Heinz Stockinger is still blocked at some sites and has asked if these sites can grant him access again. The list of CEs is:
  • ce00.hep.ph.ic.ac.uk
  • ce01.marie.hellasgrid.gr
  • ce01.tier2.hep.manchester.ac.uk
  • ce02.esc.qmul.ac.uk
  • ce02.tier2.hep.manchester.ac.uk
  • ce05.pic.es
  • ce06.pic.es
  • ce07.pic.es
  • dgc-grid-40.brunel.ac.uk
  • dgc-grid-44.brunel.ac.uk
  • egee-ce1.gup.uni-linz.ac.at
  • grid002.jet.efda.org
  • gw-2.ccc.ucl.ac.uk
  • helmsley.dur.scotgrid.ac.uk
  • mars-ce2.mars.lesc.doc.ic.ac.uk
  • serv03.hep.phy.cam.ac.uk
  • svr016.gla.scotgrid.ac.uk
  • t2ce02.physics.ox.ac.uk
  • t2ce03.physics.ox.ac.uk

Please pass on to sites to ask if possible if he can be permitted to access the resources again.

WLCG Items

WLCG recommendation: DPM and filesystem choice

It has been proven that the ext3 filesystem is far less performing then the xfs filesystem for file deletion operations. In particular, deleting 2048 files with 1.5GB size takes 5 seconds on XFS and 90 minutes on ext3. Therefore, I think we should recommend that sites running DPM migrate from ext3 to xfs, if possible. In fact, running XFS does not have any counter effect, only benefits.

Upcoming WLCG Service Interventions

CERN
There will be SCHEDULED Downtime for SRM at CERN on 06-03-2008 from 8:00 to 12:00 (UTC+1) for 2.1.6 CASTORLHCB upgrade (The machines are: castorsrm, srm.cern.ch, srm-durable-lhcb, srm-lhcb.cern.ch)
CERN
There was an UNSCHEDULED downtime uring the week-end: lcg-voms.cern.ch was down due to a hardware problem. It is now back to work. although the problem is not fixed, we will do our best to prevent this happening again (requires a hardware change in the future). Note that voms.cern.ch wasn't affected, so voms proxy and gridmap file generation were fine during the week-end. This effected ALICE, ATLAS, CMS, LHCb, DTEAM, OPS, Sixt, Unosat, Geant4, Gear.

ATLAS Service

  • ATLAS ask to all the T2 to implement srm2.2 before the 2nd of April, to have the time to test the full system before the CCRC08 phase 2.
  • ATLASMCDISK and ATLASDATADISK are the two space tokens that need to be implemented first.
  • Details about space tokens (configurations, quota, etc) are in SpaceTokens#Tier_2 and in ATLAS-Document-for-FDR-and-CCRC-v10.pdf (in particular page 7 and 9)

After the meeting the following data was produced: atlas-gluece-by-os.txt. It shows the number of GlueSubClusters in the whole grid that have each OS and then for each OS the sites and queues that atlas can use.

ALICE Service

Absent.

CMS Service

Nothing.

LHCb Service

  • NIKHEF longest queue is to short, make a request to have it made longer. Ticket will be submitted.
  • Normalisation of CPU on the Grid is being discussed again.
    • Within CERN for instance we see that a machine with a normalization factor 2n takes three times less the time needed by a machine with factor n instead of only two times.
      • Will be followed up in particular with CERN.
    • In WLCG as a whole the normalization of the CPUTime - even at batch system level - might not be (is not) accurate.

WLCG Service Coordination

CCRC08 formally ended last week. CMS stable at 800 Mb/s. Atlas stopped to do cosmic run which will start next week. CCRC will continue on with the May run as a focus. Everything will carry on with the daily meetings. Tomorrow there is a F2F concerning all of this clashing nicely with ROC managers meeting.

Question from Sven: If CCRC continues with releases at the same rate this may be difficult for sites. The would like releases more like they were before, it is difficult with releases need to be being installed that afternoon.

Nick: We would hope that the releases now do slow down as compared from the February frantic pace of releases.

Nick: Some discussion going on anyway about having released on a per-service basis.

OSG Items

  • GGUS:33220 - Has been given back to OSG.
  • GGUS:31037 - Needs to be investigated why it is not closed, possibly a problem with the Fermi<->GGUS interface.
No one present so will be followed up.

Action Items

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
Main.SAM, Main.team 2008-05-27 Need to consider what SAM, alarm system and CIC portal should do mitigate against a high load CE.

*Update March 12th*
Had discussion with Ulrich and also submitted a new test for sam, it needs some
thought as to if it is a good idea but it would be a non-critical test on the
"GlueCEStateStatus: Production" attribute that the then critical CE tests would depend on.
The same logic as the existing SE free space tests.
https://savannah.cern.ch/bugs/?34443

*Update 31 March:* Request for new SAM sensor passed to SAM team.

*update 21 April:* On SAM work-list (Savannah). John thought that the item could be closed as far as the ROC managers are concerned, but Kostas was worried that the issue risked being forgotten. He suggested the possibility of a pending state for items that get transferred to other tracking mechanisms. Nick will think about it.

*Update 5th of May* Now on the SAM worklist. Nothing changed for now, ignore for 3 weeks.

*Update 2nd of June* No progress recorded.

*Update 11th June* This is present as BUG:34443 anyway so close here.

2008-06-11 edit
Main.LHCb 2008-03-17 LHCb and Kostas to contact one another about middleware version tickets within SouthEast region.


** solved:
LHCb runs a custom SAM test that checks the version of lcg_utils and spots out sites with obsolete version installed.

The person in LHCb following these tickets submitted twice 24 tickets for 24 different sites because his first attempt (using mail ticketing system of GGUS) failed to return the GGUS reference. For your information this problem was due to a missed mapping of the submitters mail address (used by GGUS for submissions of tickets via mail) and his certificate.

2008-03-05 edit

Review of Open Action Items

102
Close here, is in ROC managers list.
138
Pass to ROC managers , close here.
139
Last week to pass comment.
140
Close with reference to node tracker.
136
List nextweek of SL3 sites. Steve.
137
Less urgent. ROC managers please check with your sites.
141
Stay the same, nothing to add.

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 
Main.MaiteBarroso Action on all ROCs to check gstat2 storage numbers with sites 2009-12-03 2010-02-15 all.ROCs
000047 Main.MaiteBarroso Action on GGUS to investigate the site support metrics defined in the SLA. Draft implementation plan by the end of October.
20091103 Update by Maria. Implementation date is the end Nov. GGUS release. Specifications in https://savannah.cern.ch/support/?110706
01/12/2009: pending on Maite to have a look at the proposal and give feedback
16/03/2010: ongoing implementation
2009-09-24 2009-10-30 Torsten.Antoni edit
000058 Main.MaiteBarroso prepare a procedure to deal with most urgent vulnerabilities, including timelines to start suspending sites, so this can be easily enforced by ROCs.

This was discussed at the EGEE PMB level, with the following conclusion:

SUMMARY: it is the PMB who decides the timeline to start suspending sites. In previous critical vulnerabilities, it was set to 7 days.


The PMB considers the security of the infrastructure as of paramount to the operation and reputation of the project. The potential damage that such vulnerabilities could cause to the infrastructure both in terms of loss of service and the damaging publicity were of great concern to the PMB.

The project has mandated the Grid Security Officer to work with the ROC Security Contacts (as well as the JSPG and GSVG) to pro-actively manage security policy and its operational implementation. The use of the security monitoring tool to mimic normal user site access patterns and to discover non-intrusively the host configuration information in order to infer potential site vulnerabilities was fully supported by the PMB. The PMB was disappointed that despite the established and agreed operational structures, sites were slow in responding, or refused to perform the necessary routine systems maintenance.

All sites are reminded that there is an established policy and procedure that allows a site to be suspended in general if the site is deemed to pose an immediate threat for the infrastructure. This is stated in the Grid Site Operations Security Policy (https://edms.cern.ch/file/819783/2/GridSiteOperationsPolicy-v1.4a.pdf):

"When notified by the Grid of software patches and updates required for security and stability, you shall, as soon as reasonably possible in the circumstances, apply these to your systems. Other patches and updates should be applied following best practice.
[...]
The Grid may control your access to the Grid for administrative, operational and security purposes and remove your resource information from resource information systems if you fail to comply with these conditions."

The federation representative of the PMB are following up within their own regions to understand why their sites were not immediately patched. However, the PMB noted that if the fix has not been installed at a particular site by a given deadline then the PMB reserves the right to remove the offending site from the EGEE production infrastructure in accordance with the established security policy. As the Grid Security Officer we ask you to inform the sites of these issues and as part of the site access agreement, the sites are mandated to follow these instructions or their access to the infrastructure could be curtailed.

We are now approach the deadline for addressing the vulnerability that raised this issue. If, following circulation of this notice, sites are still after 7 days exhibiting these vulnerabilities, please work with the ROC Security Contacts to curtail access of these sites to the infrastructure or ensure that there is a clear upgrade plan in place to eliminate these vulnerabilities.


2009-11-18 2009-11-30 Romain.Wartel edit
000083 Main.AntonioRetico Provide instructions on how to preserve local policies during the upgrade of the Argus server to a newer version both in an e-mail to the sites and in the PATCH:3536 2010-01-02 2010-02-03 Main.ChadLaJoie edit
000090 Main.AntonioRetico Provide functional specification of glexec tests being implemented at SRCE 2010-02-16 2010-02-19 Main.GianniPucciani edit
000118 Nick.Thackray Maite to check with Spanish NGI if there are any significant issues for the transition, regarding O-E-2. 2010-04-20 2010-04-27 Maite.Barroso edit
000119 Nick.Thackray James will get information on the process for including uncertified sites into the regional Nagios instances. He will then work with Vera to get this information into the operations procedures document. 2010-04-20 2010-05-04 James.Casey edit
000120 Nick.Thackray All ROCs to give details on how many sites still need 32 bit middleware, which middleware services they need this for and how many worker nodes the sites has. 2010-04-20 2010-04-27 ALL.ROCS edit
000121 Nick.Thackray Check on the status of the DPM bug with regard to Gstat 2. 2010-04-20 2010-04-27 All.OCC edit
000122 Nick.Thackray Give feedback on the updated version of the EGI VO Management specifications. 2010-04-20 2010-04-27 All.ROCs edit
000562 Main.MaiteBarroso Steve and ROCs to find a few representative sites to understand what the main issues are with the storage installed capacity published in gstat, work with them to solve them, and document the solutions (if relevant). After that, we will re-discuss it here to involve all sites.
01/12/2009: all ROCs to check Gstat2 published capacity with their sites
2009-07-28 2009-08-30 Main.AllROCs edit

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

AOB.

Maria
Need to investigate how people of shift can handle tickets submitted as an individual. Support 103378 and Support 103578
Maria
A test has been done of simple turn around time... Some T1s took 10days.
Kostas
LHCb appear to submitting auto tickets about lcg-utils.
Nick
SAM UI is back in production.
Nick
SAM job Submission now being submitted asking 600seconds of time.

Next Meeting

The next meeting will be Monday, 10 Mar 2008 15:00 UTC (16:00 Swiss local time).

  • Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
  • The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0157610


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2008-06-11 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback