WLCG-OSG-EGEE Op's Minutes Mon 11 Feb 2008

Attendance

EGEE

  • Asia Pacific ROC: Min
  • Central Europe ROC: Marcin
  • OCC / CERN ROC: Antonio Retico, Nick Thackray, Farida Naz, Maite Barroso
  • French ROC: Rolf
  • German/Swiss ROC: Clemens Koerdt, Sven Hermann
  • Italian ROC: Alessandro
  • Northern Europe ROC: Apologies
  • Russian ROC: Lev
  • South East Europe ROC: Kostas
  • South West Europe ROC: Kai, Gonzalo,
  • UK/Ireland ROC: Jeremy
  • GGUS: Thorsten
  • OSCT: Absent

WLCG

  • WLCG Service Cordination: Harry, Jamie

WLCG Tier 1 Sites

  • ASGC: Min
  • BNL: Absent
  • CERN site:
  • FNAL: Absent
  • FZK: Sven Hermann
  • IN2P3: Piere
  • INFN: Alessandro, Alfredo
  • NDGF: Leif
  • PIC: Gonzalo
  • RAL: Abesent
  • SARA/NIKHEF: Ron
  • TRIUMF: Rod Walker

Reports Not Received

  • WLCG Tier 1s:
  • VOs: Alice, Atlas, Biomed, CMS, LHCb
  • EGEE ROCs (Prod Sites): AP, Italy
  • EGEE ROCs (PPS Sites): AP, CE, IT, SEE, SWE

Feedback on Last Week's Minutes

None were given.

EGEE Items

Grid Operator on Duty Hand Over

  Primary Team Secondary Team
From DECH UKI
To SWE Russia

Issues: - 7th-8th GOCDB outage due to a power cut at RAL. No other problems.

PPS Release News:

  1. gLite 3.0.2 PPS Update45 was released to pre-production last Tuesday. It is currently in phase of pre-deployment testing. The update contains: * YAIM module for 3.0 WMS to fix the bug of limit on uid for gridftp server All details in: https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_302_PPS_Update45
  2. gLite 3.1.0 PPS Update17 was released to pre-production last Thursday. It is currently being istalled at PPS sites after pre-deployment testing. The update contains: * glite-MPI_utils metapackage for gLite 3.1 * Improved globus-gridftp startup script * various improvements for glite-info-provider-ldap * lcg_util v1.6.8 (SLC4) All details in: https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update17

EGEE Items From ROC Reports

  1. (ROC France): This site had to change in emergency its domain name from "mrs.grid.cnrs.fr" to "in2p3.fr". A scheduled downtime is ongoing, but all old node names (and IPs) has already been replaced by the new ones into the GOC DB. During those operations, this site wondered whether this is possible or not to set an alias on a CE node. Is it possible ? Did any other site try this ? At the meeting nobody seemed to be aware of the possibility to set alias on CEs. Here are the comments from the expert (Maarten Limaath): An alias can be used, but only for a single CE, i.e. no round-robin.The host cert must be for the new (i.e. the canonical) name. So, it is OK for the case at hand.
  2. (ROC Russia): I would like to pay your attention at long and unsuccessful history of updates of lcg_util. The new one was issued for PPS. However, the update did not include the patch of lcg-rep and Classic SE (see BUG:32999 in Savannah). However, this bug was fixed two week ago. Maarten Litmaath said that "release to production would be 1 or 2 weeks later" (see "Re: [LCG-ROLLOUT] RM SAM test on CE and Classic SE" in Fri, 8 Feb 2008). So, the sites which applied "update" as recommended operation procedure and still use Classic SE can not work properly during month or so. Is it really so complicated problem to rollback to situation before "updates"? Who can send a recommendation for site administrators to rollback manually at least? Btw, I think that the story like this may occur in future. I propose to think about rollback procedure on emergency. Manually or automatic. The fix stayed 2 weeks ready to be integrated. It will be checked offline what happened this time, as the release process was not fully followed. Rollback is not so easy, as there are dependencies and related configuration. This case will be analyzed so we check what can go better next time. Checked offline after the meeting: The bug was accidentally closed after the certification. It has now been reopened and is following the release process.
  3. (ROC SWE): In the last days several sites in the SWE federations are experiencing problems with the Information System. Not clear wether this can be correlated with upgrading to the last version of the m/w or its yaim configuration. Want to raise this in the GridOps meeting to see if sites in other federations are seeing something similar. Some other ROCs also faced different BDII related problems with glite3.1 upgrade. At SWE, sites who installed CE + site_BDII on one node couldn't publish correct information about the site. Downgrading the YAIM version solved the problem. At DECH the problem was a missing x-bit on site-bdii (on a CE). It was reported through GGUS: https://gus.fzk.de/ws/ticket_info.php?ticket=32473
  4. (ROC UKI): RAL suffered a power cut last week this impacted the GOCDB on which the Broadcast system relies. The UK will examine how well communications were managed during this incident.
  5. (ROC UKI): Most UK CA certificates were taken out of CERN VOMS last week as it was thought that they had been revoked. A user ticket hinted that a specific problem being experienced might be due to old CA DN information still being present and certificates based on it were then removed without further investigation – i.e. the old root certificate was suspected compromised. This resulted in the majority of UK issued certificates not working for a period of 24hrs hours and until all users had been registered with both issuer names. There are lessons to be learnt about cross-checks required for critical changes like removing a CA issuer name. We also need to look at VOMS updates since recent revisions allow an option to not check the CA DN but not all VOMS are up-to-date on this... there are currently 4-5 CAs having recently gone through a rollover and can be affected. The related processes at cern have been modified so this doe not happen again. We will check the VOMS production server and apply the mentioned update in case it is not deployed yet. This was checked offline after the meeting: The CERN VOMS servers are up to date, but this option is not enabled. The reason is that at CERN we're using VOMRS which authenticates users with their DN and CA, and is not able to do it based only on the DN. Hence, after a CA rollover, corresponding users would always be able to get VOMS proxies, but wouldn't be able to resign VO AUPs on VOMRS anymore! This will result in a big confusion for everyone. Hence, we cannot enable this option in VOMS at CERN, in order to force people to add their new certificate in VOMRS. However, Tanya, the VOMRS' developer, announced that in a future version of VOMRS there will be a new service allowing VO admins to simply update DN/CA of users, which will help a lot during CA rollover.

  • gLite Release News

gLite 3.1 Update13 released to production today. The update contains: o A Major upgrade to dcache (patch#1395) o An updtae from VDT to fix a gridftp issue o voms-admin client for UI and VOBOX o v dcacheVoms2Gplasma required for proxies created with grid-proxy-init All details in: http://glite.web.cern.ch/glite/packages/R3.1/updates.asp

  • Phase out of classic SE (05')

Sites/VOs are requested to migrate in the next 3 months, before the end of May. A broadcast will be sent with the details. A migration to DPM is the suggested solution. https://twiki.cern.ch/twiki/bin/view/LCG/ClassicSeToDpm

(In the case of Atlas, the classic SEs at CERN will be replaced by gridftp servers.)

WLCG Items

CCRC'08 Operational Review (30')

Minutes of daily CCRC08 meetings

Weekly review of on-going CCRC'08 activities based on 3 agreed metrics: (see slides attached to agenda page)

      1. Experiments' scaling factors for functional blocks exercised in the challenge
      2. Experiments' critical services lists
      3. MoU targets

ATLAS Service

Transfer failed from Tier0 to Tier1 for some time due to the problem on SRM at CERN.

ALICE Service

No report

CMS Service

File transfer test was OK. Some problem related to LSF logs.

LHCb Service

No report

WLCG Service Coordination

  • NDGF: File transfer failed for users with error: file exists.
  • SRM at CERN: Due to some problem bunch of files were lost on srm. Some files recovered for ATLAS at CERN but 5000 files are still missing.
  • SRM(durable part): Few people wants to write there but got error.

OSG Items

OSG use the MSG messaging system to send their SAM results from RSV (the OSG testing framework). On friday night there was a problem which crashed both the broker at CERN and the publisher in OSG GOC. On restarting the broker things were again ok. We will put an alarm on the broker to catch this in the future and automatically restart the broker.

Review of Action Items

See separate list of actions

AOB

* Simone: Last time in GDB meeting , a request was made by CMS users to move all the production glite-WNs from SLC3 to SLC4. Now ATLAS has the same requirement. They want to move all the WNS from SLC3 to SLC4. The deadline will be 15 March 2008. Its because they need the last version of srm in lcg-utils. A broadcast will be sent with the details of the request and the timeline to implement it.

* Atlas requests all sites to incease the shared sw installation area at the sites from 10 to 100 Gb. A broadcast will be sent with the details of the request and the timeline to implement it.

* There was a broadcast sent by Russia: a failure of hard disks in lcg60.sinp.msu.ru was impossible to recover and all data stored on that SE is lost. Please clean the links to the replicas in the LFC. CERN saw the broadcast and quickly reacted and unregistered the corresponding replica entries from the CERN LFC servers. Thanks to them. Is this the correct procedure? In this case yes, as Russia did not know the files implied and the users/VOs affected

* Kostas: Is there any news on 64 bit? To be discussed at next meeting.

Next Meeting

The next meeting will be Monday, 18 Feb 2007 15:00 UTC (16:00 Swiss local time).

  • Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
  • The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0157610


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2008-03-03 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback