WLCG-OSG-EGEE Operations Minutes Mon 3 Dec 2007
Attendance
There were many problems with the conferencing facilities and as such
it is expected that some of the absentees were in fact unable to connect.
EGEE
- Asia Pacific ROC: Min
- Central Europe ROC: Someone?
- OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Steve Traylen
- French ROC: Gilles
- German/Swiss ROC: Clemens , Sven
- Italian ROC: Alessandro
- Northern Europe ROC: Apoligies.
- Russian ROC: Lev
- South East Europe ROC: Kostas
- South West Europe ROC: Kai, Gonzalo,
- UK/Ireland ROC: Unable to connect.
- GGUS: Thorsten
- OSCT: Absent
WLCG
- WLCG Service Cordination: Absent
WLCG Tier 1 Sites
- ASGC: Min
- BNL: Absent
- CERN site: Ignacio Reguero
- FNAL: Absent
- FZK: Sven
- IN2P3: Absent
- INFN: Alesandro, Alfrede
- NDGF: Absent
- PIC: Gonzalo
- RAL: Unable to connect.
- SARA/NIKHEF: Apolgies
- TRIUMF: Rod Walker
Reports Not Received
- WLCG Tier 1s: None
- VOs: CMS, ALICE, ATLAS
- EGEE ROCs (Prod Sites): South Eastern Europe
- EGEE ROCs (PPS Sites): AP, IT, SWE, SEE
Feedback on Last Week's Minutes
None were given.
EGEE Items
Grid Operator Hand Over on Duty
|
Primary Team |
Secondary Team |
From |
Germany/Switzerland |
Taiwan |
To |
SouthWestern |
UK/I |
- There are quite some node appearing in the alarm table although they have monitoring disabled in GOCDB. Might be the change has been done only recently:
- srm-v2.cr.cnaf.infn.it
- gridse2.pg.infn.it
- dcsrmv2.usatlas.bnl.gov
- Open GGUS ticket and put as many information as possible.
- 11/28 Due to the connection problem to GOCDB can not access CIC dashboard.
- Ru-Trcitsk-INR-LCG2 and KTU-BG-GLITE did not update their CA rpm to the latest version. Sent mail to site managers.
- New tickets:24
- 2nd mail:8
- Quarantine:14
- Extend:11
- Close:19
PPS Reports
This is an invitation to interested sites to show-up and possibly contact Mario, David as coordinator of the pre-deployment, who will gladly provide them with the technical info they need.
We would be particularly happy to receive volunteers for this activity, in the framework of the "Special support to PPS Operations" among those certified PPS sites which still don't appear in the lists in
http://www.cern.ch/pps/index.php?dir=./panel/
, namely:
- DESY-PPS
- FZK-PPS
- GSI-LCG2-PPS
- SCAI-PPS
- PPS-SiGNET
- PreGR-01-UoM
- PreGR-01-UPATRAS
Suggestions from the ROCs /PPS sites dealing with possible deployment scenarios of the AMGA service in PPS are also very welcome.
We are actively looking for a user community interested to try out the newly released postgres-based version of AMGA.
Release News
- gLite3.1.0-PPS-UPDATE10 was released to PPS This update introduces a number of new services to gLite 3.1 for SL4 (32 bit)
- glite-AMGA_postgres
- glite-LFC_mysql
- glite-LFC_oracle
- glite-PX
- glite-SE_dpm_disk
- glite-SE_dpm_mysql
- glite-VOMS_mysql
- glite-VOMS_oracle
- Records of the pre-deployment testing can be found in http:www.cern.ch/pps/index.php?dir=./release/testreports/
- release of gLite3.1 Update07 to production in preparation: (To be announced early this week) This release will contain:
- JobWrapper tests - new version with no R-GMA dependencies
- glite-VOMS_mysql metapackage for gLite 3.1 and SL(C)4
- glite-VOMS_oracle metapackage for gLite 3.1 and SL(C)4
- Bug fixes for UI and WN
EGEE Items From ROC Reports
- SAM Apel test
- When it is scheduled to become a critical test? This is discussed at ROC managers’ meeting. Roc managers will take care of sites that aren't publishing account data. When sites start to publish the data test will be become critical. In this moment a lot of sites will be fail. This will be discussed again at the next ROC managers’ meeting (next Tuesday). Hopefully by then we should have reasonable number of sites.
- Availability report
- CYFRONET-LCG2 Tier-2 site remarks that while analyzing availability reports it is hard to determine the reason for decreased availability because the tools which affects (FCR) and computes (GridVIEW) availability base on SAM results which are available only for last 7 days. We are aware the longer history is a performance problem but maybe it would be possible to provide an interface to show some short period of SAM results in the past? We will check this with GridView people and inform you the next week. $ SFU-LCG: We have 400 queued atlasprd jobs for 10-cpu cluster. Some SFT job fail because they could not be run for a long time. Rod Walker - problem is fixed. $ CERN-PROD: Soon after the release of GGUS we received a number of update e-mails from GGUS concerning the verification done by the users of (sometimes) very old tickets. As the corresponding tickets were already frozen in our internal TT system, this caused a lot of new tickets to be opened. The issue was not systematic, in the sense that it did not concern tickets in the whole history, but it was however significant We are asking the GGUU team if thy are aware of possible causes. We reckon a post mortem analysis as envisageable in order to correclty record and address the same issue for future releases. GGUS reported in the meeting that there was a mistake and they should not have gone out. It was a one off anyway.
- ROC-DECH
- Some sites are unsure about the correct procedure to introduce new service nodes in the production environment. Now that GOCDB no longer allows sites to switch off the monitoring the sites should put the nodes initially in ''maintenance''!?! This was discussed and this seems logical. A request will be to the operational manual to make add this to the operations document.
- RUSSIA
- It seems like some users try to submit jobs to the sites bypassing RB/WMS system, directly using CE job submission APIs or globus tools. What should we do with this (i.e.: don't care, encourage, prohibit in some way)?
- There was some discusion, in particular Ireland block all submissions except the trusted RBs but this is not suitable everywhere. It comes down to tracking down the user and contacting them if they are being anti social
- RUSSIA
- ALICE have asked the Russian ROC to install a working pbs client on their VOBoxes. Some discussion occured at the meeting and a link is to be provided for the published description ALICE VO box. VoBoxesInfo
WLCG Items
None
Tier1 Reports
None
WLCG issues coming from ROC reports
None
Upcoming WLCG Service Interventions
* Major intervention on all VOMS and VOMRS production services at CERN on Monday. All components will be changed dramatically
as well as the database schema itself. All LHC VOMS and VOMRS services are will be unavailable on Monday morning 8-11 UTC.
FTS Service Review
TransferOperationsWeeklyReports though there does not appear to be a report this week?
ATLAS Service
ALICE Service
CMS Service
LHCb Service
Last Friday CNAF site admins went through an extra-ordinary emergency intervention and, using the usual procedure described at
http://cic.gridops.org/index.php?section=home&page=SDprocedure
they put in Scheduled Downtime the CNAF batch farm (until today).
The procedures foresee a broadcast message sent to affected people (for LHCb this is the lhcb-production mailing list). We didn't receive any message. It would be nice to understand the reason of that. Being the procedure very well defined (and then the possibility of errors from the sysadmin side minimized) I tend to believe that the broacast tool didn't work properly this time causing some perturbation in the daily activity of LHCb. Can relevant people (maintaining these tools) look into that?
From the meeting both CNAF and
IN2P3 (CIC portal) are looking into it.
WLCG Service Coordination
Review of Action Items
Next Meeting
The next meeting will be Monday, 10 Dec 2007 14:00 UTC (16:00 Swiss local time).
- Attendees can join from 13:45 UTC (15:45 Swiss local time) onwards.
- The meeting will start promptly at 14:00 UTC.
- The WLCG section will start at the fixed time of 16:30.
- To dial in to the conference:
- Dial +41227676000
- Enter access code 0157610
These minutes can only be changed by members of: