The EMT made the decision to delay the deployment of the CREAM CE. This is because the WMS currently in production is not ICE enabled, but could accidentally match any CREAM CEs in production and cause a submission failure. A work-around is being introduced into YAIM to configure the CREAM CE with the GlueCEStateStatus parameter set ‘Special’. When this is ready, the CREAM CE will be released (on the order of 1-2 weeks). Users will then be able to submit jobs directly to the CREAM Ces (but not through a WMS). When the ICE enabled WMS is ready for production, this work-around will be removed.
The release team presented a proposal for a centralized distribution mechanism for the gLite clients (WN) to all production sites. It will be discussed in detail in the coming weeks (LCG GDB and Mgt Board, and EGEE SA1 coordination meetings).
GOCDB and the CIC portal have now the required features to declare downtimes for the operations tools so they are broadcasted automatically. Each operations tool will be registered in GOCDB. They will be able to declare downtimes which will be broadcasted to sites, ROCs and grid operators on duty.
Attendance
EGEE
Asia Pacific ROC: Min Tsai
Central Europe ROC: Malgorzata Krakowian
OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Steve Traylen, Maite Barroso, Diana Bosio
French ROC: Pierre, Osman
German/Swiss ROC: Torsten Antoni
Italian ROC:
Northern Europe ROC: David Groep, Mattias Wadenstein
The EMT made the decision to delay the deployment of the CREAM CE. This is because the WMS currently in production is not ICE enabled,
but could accidentally match any CREAM CEs in production and cause a submission failure.
A work-around is being introduced into YAIM to configure the CREAM CE with the GlueCEStateStatus parameter set ‘Special’.
When this is ready, the CREAM CE will be released (on the order of 1-2 weeks). Users will then be able to submit jobs directly to the CREAM Ces (but not through a WMS).
When the ICE enabled WMS is ready for production, this work-around will be removed.
EGEE Items From ROC Reports
None this week
WN distribution mechanism
SA3 put forward a proposal for a centralized distribution mechanism for the gLite clients (WN).
Several responses have been received so far and are attached to the agenda page.
David Groep started describing the position of the Benelux and Nordic federation (attached to the agenda page). Summarizing, they don't agree both for procedural and technical reasons.
Oliver: the goal is to make client distribution much easier, lighter; to propagate updates more quickly; this will not increase the number of updates, and will not replace the present mechanism; this is requested by the LHC VOs, for the benefit of these VOs.
David: it might be beneficial for small sites that update automatically their software, but nor for all the sites.
Oliver: there is no indication that we would be overwriting any site defaults. We try to solve a problem: there are sites that are still running LCG 2_7, and the LHC VOs complain about sites not upgrading fast enough to the client versions they need.
David: this should be replaced by monitoring, and leave to the federations to chase to sites to update according to the VO requirements
FZK: scalability problems, already seen with this model applied by the experiments; also at Sara.
Kostas: not acceptable to have just a short time at the end of the summer to give our opinion on such an important new issue (new to non LCG sites)
Maarten: this could also be used for scalability testing at a big scale
Flavia: this is also a problem for other VOs, the client updates are not done as quickly as they would like
What to do next: gather feedback from all sites, and from other VOs. Encourage sites to participate in the coming discussions: tomorrow's LCG management board, next week's GDB, next week's SA1 coordination meeting.
Broadcasting of downtimes of Operations Tools
Osman: in collaboration with GOCDB we have implemented this. Each operations tool will be registered in GOCDB. They will be able to declare downtimes which will be broadcasted to sites, rocs and CIC on duty. We will try it with the next GOCDB scheduled downtime, and put in production after, following a broadcast.
WLCG Items
WLCG issues coming from ROC reports
* [SWE ROC]: CMS opened a ticket to the site LIP-Coimbra telling that the disk space for CMS is full. Would it not be better to assign this kind of ticket to the VO instead of the site supposing that the site while fulfills the capacities agreed by a MoU or similar?: this was a miscommunication inside CMS. The ticket was originally opened for tracking and got wrongly assigned to teh site. This is now fixed.
FTS SL4 - required by the experiments or tier-1 sites?
* Alice: Neutral (as long as there is no disruption to the service.
* ATLAS: Prefer not to; to avoid introducing problems this close to data taking.
* CMS: Priority is stability for data taking days. Whatever is scheduled in advance and allows some pre-testing can be negotiated, though. On CERN migration, instead, PhEDEx /Prod vs /Debug instance can be played with to allow testing before going into prod (talked to Gavin)
* LHCb: Neutral (as long as there is no disruption to the service.
* ASGC: ???
* BNL: Need to migrate (Has a fairly pressing need to move to SL/RHEL4 because of our site security situation. If it is made available in production soon, we would definitely switch over.)
* FNAL: Need to migrate (Hardware is dating fast. May be issues with maintenance.)
* FZK: Prefer to wait (to include patch for SRM1 requests issued by FTS)
* IN2P3: Can wait until next shutdown.
* INFN: ???
* NDGF: Prefer to wait until next shutdown.
* PIC: ???
* RAL: ???
* SARA/Nikhef: ???
* TRIUMF: Can wait until next shutdown.
ATLAS Service
nothing to report
ALICE Service
CMS Service
General: Global Run data taking with the magnet at 3T over some part of the weekend.
CERN-IT and T0 workflows: Migration data transferred into the local CAF-DBS instance for public information and access got slow for an issue debugged over the weekend and now understood, 11k blocks to go, may take up to 3 days to digest, does not worth any action, just let it go, since insertion of CAF-urgent datasets can (and was already successfully) be forced manually, thus causing no troubles for CERN-local analysis access.
Distributed sites issues:
T1_ES_PIC failures in CMS-specific SAM analysis test (missing input dataset: already fixed, thanks to Pepe Flix)
T1_DE_FZK failures in CMS-specific SAM analysis test (missing input dataset)
T2_CH_CSCS: No JobRobot jobs assigned (bdII ok?) + CMS-specific js and jsprod tests fail ("no compatible resources")
T2_US_NEBRASKA: No JobRobot jobs assigned (bdII ok?)
T2_UK_London_Brunel: Aborted JobRobot jobs ("Job got an error while in the CondorG queue")
T2_US_Wisconsin: No JobRobot jobs assigned (bdII ok?)
T2_ES_CIEMAT: CMS-specific SAM errors in analysis and js tests (timeout executing tests)
T2_PT_LIP_Coimbra : CMS-specific SAM CE errors in jsprod + dCache "No space left on device" (acknowledged)
T2_US_MIT: CMS-specific SAM Frontier error ("Error ping from t2bat0080.cmsaf.mit.edu to squid.cmsaf.mit.edu": the latter is down.)
T2_US_Wisconsin: CMS-specific SAM tests not running since 8/29 (some problems in bdII? JobRobot is not running too)
Extract from the information system the list of WMS 3.0 Update from Steve: Does not look too bad, this is only those who are publishing at all. Those with old WMS (SL3 in fact) EENet (Estonia) ITEP (Russia) RTUETF ( Latvia) UNI-FREIBURG (Germany) Those with new WMS (SL4 in fact) AEGIS01-PHY-SCL Australia-ATLAS BY-UIIP CERN-PROD CESGA-EGEE CGG-LCG2 CNR-PROD-PISA CY-01-KIMON CYFRONET-LCG2 DESY-HH FZK-LCG2 GR-01-AUTH GRIF HG-06-EKT INFN-CNAF INFN-PADOVA ITEP JINR-LCG2 KR-KISTI-GCRT-01 NCP-LCG2 pic prague_cesnet_lcg2 RAL-LCG2 RO-03-UPB RTUETF RU-Phys-SPbSU ru-PNPI SARA-MATRIX Taiwan-LCG2 TR-01-ULAKBIM UKI-SCOTGRID-GLASGOW Uniandes VU-MIF-LCG2 Note there may well be other WMS not included by siteBDIIs out there we know nothing about. Update 10/9/08: The four sites running WMS on SL3 were asked to upgrade ASAP.