WLCG-OSG-EGEE Ops' Minutes Mon 01 Sep 2008

Summary

  • The EMT made the decision to delay the deployment of the CREAM CE. This is because the WMS currently in production is not ICE enabled, but could accidentally match any CREAM CEs in production and cause a submission failure. A work-around is being introduced into YAIM to configure the CREAM CE with the GlueCEStateStatus parameter set ‘Special’. When this is ready, the CREAM CE will be released (on the order of 1-2 weeks). Users will then be able to submit jobs directly to the CREAM Ces (but not through a WMS). When the ICE enabled WMS is ready for production, this work-around will be removed.
  • The release team presented a proposal for a centralized distribution mechanism for the gLite clients (WN) to all production sites. It will be discussed in detail in the coming weeks (LCG GDB and Mgt Board, and EGEE SA1 coordination meetings).
  • GOCDB and the CIC portal have now the required features to declare downtimes for the operations tools so they are broadcasted automatically. Each operations tool will be registered in GOCDB. They will be able to declare downtimes which will be broadcasted to sites, ROCs and grid operators on duty.

Attendance

EGEE

  • Asia Pacific ROC: Min Tsai
  • Central Europe ROC: Malgorzata Krakowian
  • OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Steve Traylen, Maite Barroso, Diana Bosio
  • French ROC: Pierre, Osman
  • German/Swiss ROC: Torsten Antoni
  • Italian ROC:
  • Northern Europe ROC: David Groep, Mattias Wadenstein
  • Russian ROC: Lev Shamardin
  • South East Europe ROC: Kostas Koumantaros
  • South West Europe ROC: Kai Neuffer
  • UK/Ireland ROC: Jeremy Coles
  • GGUS: Torsten Antoni

WLCG

  • WLCG Service Cordination: Harry Renshall

WLCG Tier 1 Sites

  • ASGC: Min Tsai
  • BNL: Absent
  • CERN site:
  • FNAL:
  • FZK: Angela Poschlad
  • IN2P3: Pierre
  • INFN:
  • NDGF: Mattias Wadenstein
  • PIC: Kai
  • RAL: Derek Ross
  • SARA/NIKHEF: David Groep
  • TRIUMF:

LHC Experiments

  • ATLAS: Alessandro di Girolamo
  • LHCb:
  • CMS: Daniele
  • ALICE:

Feedback on Last Week's Minutes

None was given.

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From ROC Russia ROC Italy
To ROC SEE ROC SWE

  • No issues this week

PPS Reports

  • The EMT made the decision to delay the deployment of the CREAM CE. This is because the WMS currently in production is not ICE enabled,
but could accidentally match any CREAM CEs in production and cause a submission failure. A work-around is being introduced into YAIM to configure the CREAM CE with the GlueCEStateStatus parameter set ‘Special’. When this is ready, the CREAM CE will be released (on the order of 1-2 weeks). Users will then be able to submit jobs directly to the CREAM Ces (but not through a WMS). When the ICE enabled WMS is ready for production, this work-around will be removed.

EGEE Items From ROC Reports

  • None this week

WN distribution mechanism

SA3 put forward a proposal for a centralized distribution mechanism for the gLite clients (WN).
Several responses have been received so far and are attached to the agenda page.
David Groep started describing the position of the Benelux and Nordic federation (attached to the agenda page). Summarizing, they don't agree both for procedural and technical reasons.
Oliver: the goal is to make client distribution much easier, lighter; to propagate updates more quickly; this will not increase the number of updates, and will not replace the present mechanism; this is requested by the LHC VOs, for the benefit of these VOs.
David: it might be beneficial for small sites that update automatically their software, but nor for all the sites.
Oliver: there is no indication that we would be overwriting any site defaults. We try to solve a problem: there are sites that are still running LCG 2_7, and the LHC VOs complain about sites not upgrading fast enough to the client versions they need.
David: this should be replaced by monitoring, and leave to the federations to chase to sites to update according to the VO requirements
FZK: scalability problems, already seen with this model applied by the experiments; also at Sara.
Kostas: not acceptable to have just a short time at the end of the summer to give our opinion on such an important new issue (new to non LCG sites)
Maarten: this could also be used for scalability testing at a big scale
Flavia: this is also a problem for other VOs, the client updates are not done as quickly as they would like
What to do next: gather feedback from all sites, and from other VOs. Encourage sites to participate in the coming discussions: tomorrow's LCG management board, next week's GDB, next week's SA1 coordination meeting.

Broadcasting of downtimes of Operations Tools

Osman: in collaboration with GOCDB we have implemented this. Each operations tool will be registered in GOCDB. They will be able to declare downtimes which will be broadcasted to sites, rocs and CIC on duty. We will try it with the next GOCDB scheduled downtime, and put in production after, following a broadcast.

WLCG Items

WLCG issues coming from ROC reports

* [SWE ROC]: CMS opened a ticket to the site LIP-Coimbra telling that the disk space for CMS is full. Would it not be better to assign this kind of ticket to the VO instead of the site supposing that the site while fulfills the capacities agreed by a MoU or similar?: this was a miscommunication inside CMS. The ticket was originally opened for tracking and got wrongly assigned to teh site. This is now fixed.

Upcoming WLCG Service Interventions

* Item 1

End points for FTM service at tier-1 sites

* ASGC: http://w-ftm01.grid.sinica.edu.tw/transfer-monitor-report/

* BNL: ???

* CERN: https://ftsmon.cern.ch/transfer-monitor-report/

* FNAL: https://cmsfts3.fnal.gov:8443/transfer-monitor-report/
https://cmsfts3.fnal.gov:8443/transfer-monitor-gridview

* FZK: http://ftm-fzk.gridka.de/transfer-monitor-report/

* IN2P3: http://cclcgftmli01.in2p3.fr/transfer-monitor-report/

* INFN: https://tier1.cnaf.infn.it/ftmmonitor/

* NDGF: Being installed.

* PIC: http://ftm.pic.es/transfer-monitor-report/

* RAL: No endpoint in produciton yet.

* SARA/Nikhef: http://ftm.grid.sara.nl/transfer-monitor-report
http://ftm.grid.sara.nl/transfer-monitor-gridview

* TRIUMF: http://ftm.triumf.ca/transfer-monitor-report/

FTS SL4 - required by the experiments or tier-1 sites?

* Alice: Neutral (as long as there is no disruption to the service.

* ATLAS: Prefer not to; to avoid introducing problems this close to data taking.

* CMS: Priority is stability for data taking days. Whatever is scheduled in advance and allows some pre-testing can be negotiated, though. On CERN migration, instead, PhEDEx /Prod vs /Debug instance can be played with to allow testing before going into prod (talked to Gavin)

* LHCb: Neutral (as long as there is no disruption to the service.

* ASGC: ???

* BNL: Need to migrate (Has a fairly pressing need to move to SL/RHEL4 because of our site security situation. If it is made available in production soon, we would definitely switch over.)

* FNAL: Need to migrate (Hardware is dating fast. May be issues with maintenance.)

* FZK: Prefer to wait (to include patch for SRM1 requests issued by FTS)

* IN2P3: Can wait until next shutdown.

* INFN: ???

* NDGF: Prefer to wait until next shutdown.

* PIC: ???

* RAL: ???

* SARA/Nikhef: ???

* TRIUMF: Can wait until next shutdown.

ATLAS Service

nothing to report

ALICE Service

CMS Service

General: Global Run data taking with the magnet at 3T over some part of the weekend.
CERN-IT and T0 workflows: Migration data transferred into the local CAF-DBS instance for public information and access got slow for an issue debugged over the weekend and now understood, 11k blocks to go, may take up to 3 days to digest, does not worth any action, just let it go, since insertion of CAF-urgent datasets can (and was already successfully) be forced manually, thus causing no troubles for CERN-local analysis access.
Distributed sites issues:
  • T1_ES_PIC failures in CMS-specific SAM analysis test (missing input dataset: already fixed, thanks to Pepe Flix)
  • T1_DE_FZK failures in CMS-specific SAM analysis test (missing input dataset)
  • T2_CH_CSCS: No JobRobot jobs assigned (bdII ok?) + CMS-specific js and jsprod tests fail ("no compatible resources")
  • T2_US_NEBRASKA: No JobRobot jobs assigned (bdII ok?)
  • T2_UK_London_Brunel: Aborted JobRobot jobs ("Job got an error while in the CondorG queue")
  • T2_US_Wisconsin: No JobRobot jobs assigned (bdII ok?)
  • T2_ES_CIEMAT: CMS-specific SAM errors in analysis and js tests (timeout executing tests)
  • T2_PT_LIP_Coimbra : CMS-specific SAM CE errors in jsprod + dCache "No space left on device" (acknowledged)
  • T2_US_MIT: CMS-specific SAM Frontier error ("Error ping from t2bat0080.cmsaf.mit.edu to squid.cmsaf.mit.edu": the latter is down.)
  • T2_US_Wisconsin: CMS-specific SAM tests not running since 8/29 (some problems in bdII? JobRobot is not running too)

LHCb Service

nothing to report

WLCG Service Coordination

OSG Items

Action Items

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
Main.OCC 2007-03-05 Extract from the information system the list of WMS 3.0
Update from Steve:
Does not look too bad, this is only those who are publishing at all.

Those with old WMS (SL3 in fact)

EENet (Estonia)
ITEP (Russia)
RTUETF ( Latvia)
UNI-FREIBURG (Germany)

Those with new WMS (SL4 in fact)

AEGIS01-PHY-SCL
Australia-ATLAS
BY-UIIP
CERN-PROD
CESGA-EGEE
CGG-LCG2
CNR-PROD-PISA
CY-01-KIMON
CYFRONET-LCG2
DESY-HH
FZK-LCG2
GR-01-AUTH
GRIF
HG-06-EKT
INFN-CNAF
INFN-PADOVA
ITEP
JINR-LCG2
KR-KISTI-GCRT-01
NCP-LCG2
pic
prague_cesnet_lcg2
RAL-LCG2
RO-03-UPB
RTUETF
RU-Phys-SPbSU
ru-PNPI
SARA-MATRIX
Taiwan-LCG2
TR-01-ULAKBIM
UKI-SCOTGRID-GLASGOW
Uniandes
VU-MIF-LCG2

Note there may well be other WMS not included by siteBDIIs out there we know nothing about.

Update 10/9/08: The four sites running WMS on SL3 were asked to upgrade ASAP.

2007-03-06 SteveTraylen edit

Review of Open Action Items

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

Next Meeting

The next meeting will be Monday, 08 September 2008 15:00 UTC (16:00 Swiss local time).

  • Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
  • The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0157610


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2008-09-29 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback