WLCG-OSG-EGEE Ops' Minutes Mon 19 May 2008

Attendance

EGEE

  • Asia Pacific ROC: Min Tsai
  • Central Europe ROC: Marcin, Malgorzata
  • OCC / CERN ROC: Antonio Retico, Nick Thackray, Diana Bosio
  • French ROC:
  • German/Swiss ROC: Torsten Antoni
  • Italian ROC: Alessandro Cavalli
  • Northern Europe ROC: Jules Wolfrat
  • Russian ROC: Lev Shamardin
  • South East Europe ROC: Kostas Koumantaros
  • South West Europe ROC: Kai Neuffer, Gonzalo Merino
  • UK/Ireland ROC: Jeremy Coles, Derek Ross, Catalin
  • GGUS: Torsten Antoni
  • OSCT:

WLCG

  • WLCG Service Cordination: Jamie Shiers

WLCG Tier 1 Sites

  • ASGC: Min Tsai
  • BNL: Absent
  • CERN site: Sophie
  • FNAL:
  • FZK:
  • IN2P3:
  • INFN: Alessandro
  • NDGF:
  • PIC: Gonzalo
  • RAL: Derek Ross
  • SARA/NIKHEF: Ron
  • TRIUMF:

Reports Not Received

  • VOs:
  • EGEE ROCs (Prod Sites):

Feedback on Last Week's Minutes

None were given.

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From ROC Italy ROC DECH
To ROC UKI ROC CE

  • Diana, CERN ROC: comment related to last week: some tickets no associated with any alarm, hard to check what is going on, site had disappeared from SAM. Main point: if you open ticket, please, associate an alarm.

PPS Reports

  • Pilot of WMS at CNAF and CERn in progress. No major issues reported by CMS and Atlas. Next Wednesday the time agreed for the VOs for testing expires. If no problems are reported the SL4 WMS could be released on 29/May

gLite Release news

  • See the agenda for details. Summary:
  • Last week security fix released, rather important, wrong group assignment of pool accounts; sites that have upgraded to glite 3.1 update 20 should apply this patch.
  • Lots of new stuff released last week to PPS (read agenda).
  • This Thursday to production: VOMS certificate, job priority implementation for LCG CE

EGEE Items From ROC Reports

  • None this week!

URGENT upgrade of CA RPMs

The EUGridPMA have announced a new set of CA rpms.

Upgrade for this release is considered to be urgent by the EGEE project. Based on this IGTF release, new CA RPMs have been packaged for EGEE.

Please upgrade within 1 day. SAM started a 1 day timeout (including time needed to complete this CA release procedure). When timeout is over, SAM will throw critical errors on CA tests if old CAs are still detected.

See the following page for more details about this new EGEE CA release : http://grid-deployment.web.cern.ch/grid-deployment/lcg2CAlist.html

The UK CA has to be updated, this is critical. This information is distributed and followed up over the OSCT mailing list. If anyone is interested in further information being distributed in the UK regarding the Debian vulnerability and CA update please visit: http://tinyurl.com/5howus

WLCG Items

WLCG issues coming from ROC reports

  • None

Upcoming WLCG Service Interventions

  • GOG-Singapore would like to decommission their site by June 2, 2008. The hardware and services at the site will be shutdown permanently. Please migrate data that is still needed by your VO before the site is disabled.
  • SARA: On may 21st from 9:00-14:00 CET there will be an outage of srm.grid.sara.nl due to network maintenance. This measure is necessary due to the installation of new storage hardware.
  • CYFRONET-IA64: We are going to shut down CYFRONET-IA64 completely at the end of May 2008. Please take care of your data you may have on our classic SE: ares03.cyf-kr.edu.pl.
  • BEIJING-LCG2: Our Dcache SE atlasse01.ihep.ac.cn is planned to be removed from production after 20th May. Please backup your data before that date

ATLAS Service

This is a short status of ATLAS M7 and CCRC08:

- M7 data distribution started yesterday (Sunday May 18th). Some site related issues have been reported and solved:

RAL : Cannot write : (https://gus.fzk.de/ws/ticket_info.php?ticket=36526) and an internal ticket. SARA : Cannot access space token (https://gus.fzk.de/pages/ticket_details.php?ticket=36481 submitted last friday) CNAF : srm down ? (https://gus.fzk.de/pages/ticket_details.php?ticket=36528).

Particular attention to the SARA problem (started friday, back online today at 12:00) due to problems in dCache ugrade. From H.C. Lee:

"For SARA, the problem comes from the upgrade of dCache to 1.8.0-15p3. It's a backward compatibility issue that has been reported to dCache team."

Overall, no current problems. Data registration happens in bursts. The system is capble to deliver with very high throughput in short time periods. Data taking will continue at the latest till tomorrow morning. Need some time to drain the system (reprocessing jobs in queues etc ...).

- Throughput tests: will start tomorrow late afternoon (at the latest Wednesday morning). More details tomorrow. Basically all clouds will export also to T2s (need confirmation from CA cloud, while ES will like to run some internal test before, since the deployment of SRM2 at T2s is not completed yet

ALICE Service

The most important point in Alice operations activities is to try to put RAL in the picture to start transfers, as they were not involved in the previous CCRC. This is being followed up.

CMS Service

iCSA/CCRC activities progresses/issues reported by mail, HNs, hard yet to keep them up-to-date in ELOGs also, on a daily basis. But: getting now to a more stable running of iCSA/CCRC tests altogether, so we are catching up in fishing from HNs/mails and filling both ELOGs (setting original dates), needed tickets, and https://twiki.cern.ch/twiki/bin/view/CMS/CCRC08-Phase2-OpsElog (bookmark and check back). Highlighted activities atm: analysis of T1-T1 tests as from last week; extension to non-regional T1-T2 transfer tests; production transfers with latency measurements to prepare for T1 workflows; T1 workflows consisting of (iCSA) re-processing and (CCRC) skimming at T1 sites, esploiting non-custodial areas also; final development on the monitoring side to accomodate feedback from CCRC running.

LHCb Service

Entering week 3 of CCRC. Problems with online service after intervention at PIC last Friday. Several DIRAC site problems. Before the weekend NIKHEF was banned to the advertised problem with cooling system, no EGEE broadcast was seen, due to problem interfacing GOCDB and the CIC portal. Being investigated. Sara was banned due to the last version of dCache (reported by Simone), bug in dCache, it will be fixed, meanwhile it was manually fixed by the site. IN2P3 also banned due to high DB load. Plans to restart at nominal rate PIC transfer and reconstruction for the rest of the T1s. Sophie: AFS UI at CERN will be updated this week. Please contact us through helpdesk or GGUS.

WLCG Service Coordination

OSG Items

https://gus.fzk.de/ws/ticket_info.php?ticket=33220 It looks as if 2-3 problem sites have been fixed (UFlorida-IHEPA and Nebraska). We are still trying to get response from the UERJ resource, but have no further information about this resource.

Action Items

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
SteveTraylen 2008-05-26 The T0 FTS server has configured 0 retries by default, while T1s have 3 retries by default. This complicates Atlas workflow, if a transfer fails, we try to find another source with the same file. Could we have 0 retries in all FTS servers at T1s (this affects all channels, all VOs)? What is the position of the other LHC VOs?
- Not a problem for LHCb
- Ron (SARA): I thought this could be set up per channel, per VO agent. To be checked with Gaving & co

* Answer from Gavin:

The ‘retry’ count is a VO policy, so needs to be set in the relevant VO agent config for the FTS server (the default is 3 retries separated by minimum 10 minutes).

I know CMS’ Phedex prefer to fail-fast (and see the error as early as possible), so have asked T1 sites to set the retry to 0. Phedex then retries externally (i.e. with another FTS jobs for the failed files).

LHCb and ALICE I think are still set to the default.

See: https://twiki.cern.ch/twiki/bin/view/LCG/FtsYaimValues20

Contact fts-support@cernNOSPAMPLEASE.ch is case of problems.

*Update June 11th* Steve should submit tickets to all FTS sites.

*Update June 13th* GGUS:37415 submitted and child tickets sent to ROCs of each Tier1.
Review in two weeks time.

*Update June 20th* GGUS:37415 has been responded to by all FTS instances that
the changes have been made.... Except for:

For USCMS-FNAL-WC1 in GGUS:37428
For BNL-LCG2 in GGUS:37427

Both will be contacted again this week.
* Update June 30th* Steve will escalate, two U.S. sites are problematic.

* Update July 7th* BNL and Fermi have now responded that they made the
configuration change. Action item to be closed after next meeting. Steve

2007-03-06 SteveTraylen edit

Review of Open Action Items

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

AOB

T0 FTS server has configured 0 retries by default, in T1s have 3 retries by default. This complicates Atlas workflow, if a transfer fails, we try to find another source with the same file. Could we have 0 retries in all FTS servers at T1s (this affects all channels, all VOs)? What is the position of the other LHC VOs? ACTION! - Not a problem for LHCb - Ron (SARA): I thought one could set it up per channel, per VO agent. To be checked with Gaving & co

Summary

Next Meeting

The next meeting will be Monday, 26 May 2008 15:00 UTC (16:00 Swiss local time).

  • Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
  • The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0157610


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2008-07-14 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback