WLCG Operations Coordination Minutes, January 7th 2016
Highlights
- WLCG workshop
: Registration closes on 22nd January.
- The HTTP TF will be able to close when 90% of the sites will show correctly configured without interrupt for over a week. The TF opened GGUS tickets to give the sites all relevant instructions.
- The Multicore Deployment TF announced that WLCG users should mainly use the Tier1 and Tier2 views
which now use the same data as the production portal (ie include cores).
Agenda
Attendance
- local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath, David Cameron, Andrea Manzi, Jerome Belleman, Julia Andreeva, Gavin McCance, Helge Meinhard, Oliver Keeble, Marian Babik, Xavier Espinal,
- remote: Michael Ernst, Christoph Wissing, Jeremy Coles, Massimo Sgaravatto, Catherine Biscarat, Alessandra Doria, Gareth Smith, Rob Quick, Ulf Tigerstedt, Alessandra Forti, Di Qinq, Renaud Vernet, Dave Mason, Daniele Bonacorsi, Antonio Yzquierdo, Josep Flix, Zoltan Mathe (LHCb), Javier Sanchez, Federico Melaccio, Anton Gamel, B. Jashal (T2_IN_TIFR).
- apologies: Vincenzo Spinoso (EGI)
Operations News
- Andrea Sciaba has stopped working in WLCG Operations. Many thanks for his valuable contribution! Maria Dimou and Maria Alandes will remain part of the Operations Coordination team at CERN, together with Pepe and Alessandra.
- WLCG workshop
: Registration closes on 22nd January.
- Memory limits for batch queues: At the MB of 27.10.2015
, it was decided to put an action on WLCG Operations to produce a set of recipes about how to best configure memory limits for batch queues. Operations coordination will open a set of GGUS tickets to a selection of sites (mostly T1s and a few T2s). Please, be ready to provide the necessary input. Thanks in advance.
Middleware News
- Useful Links:
- Baselines:
- As reported before the holidays, a problem affected dCache pool version > 2.12 when using Berkeley DB as metadata backend which could lead to data loss. For this particular installations, baselines are now dCache 2.12.28, 2.13.6, 2.14.5 and upgrade details circulated by dCache devs are available at : https://twiki.cern.ch/twiki/pub/LCG/WLCGBaselineVersions/dcache-bug.txt
- Issues:
- T0 and T1 services:
- JINR
- dCache upgraded to v 2.10.48
- Triumf
- dCache upgraded to v 2.10.44
Maria Alandes asked whether Redhat released the openldap fixes that we tested successfully. The answer is 'not yet'.
Tier 0 News
- Condor: 86 kHS06 → 96 kHS06 out of a total of 784 kHS06 since 10 Dec
DB News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Best wishes for 2016!
- Normal to high activity levels during the break
- Thanks to the sites for keeping things in good shape!
- The first round of the heavy-ion reconstruction finished!
- CASTOR issues
- Dec 18: alarm ticket GGUS:118443
because the transfer manager was stuck
- Fixed later that afternoon, thanks!
- Dec 31: team ticket GGUS:118554
because of same problem
- OK again since Jan 1 00:00, thanks!
- Jan 5: team ticket GGUS:118619
ditto
- debugged live by the devs
- root cause was not found yet
- EOS issues
- Dec 31: team ticket GGUS:118559
for EOS at CERN
- Partly due to EOS-ALICE being ~full !
- Some disk servers were unavailable
- Mitigated by the admins, thanks!
- KIT
- Dec 31: tape SE working again, thanks!
ATLAS
- Smooth operations over the whole xMas break, almost steadily between 230-250k running parallel slots.
- Reprocessing:
- Almost completely finished the whole reprocessing campaign (around 1.8PB of RAW input data) during the Xmas break.
- This is quite a remarkable result, in the past comparable reprocessing campaigns took 4-6 weeks.
- Thanks to the effort of the sites which were extremely stable during the xmas period and to some experts who made sure that the few issues were quickly understood and solved.
- FTS3:
- Another possibly quite dangerous bug.
- Noticed few lost files (registered in Rucio but not on storage) on Monday, it took few days to understand it, today an email has been sent to the FTS devels.
- Minor: some sites noticed that some jobs (very few, event generation using MadGraph library) were creating troubles to the WN where they run.
- This is because they produce large amount of outputs and the output log tarball contained data files.
- The problem has been understood, there is the need of a fix in the ATLAS transformation which can take few weeks to be done and be put in production, so we decided also to add some "safety" on the pilot which will make sure that this problem will be caught before it will create trouble on the WNs. This fix will be most probably released in one week/10days from now.
Maria Alandes asked which were the reasons for the reprocessing time reduction. The reasons are multiple: many more cores, better network performance and improved software quality.
CMS
- Happy New Year to everyone!
- Rather high production load over Xmas break
- Run more 100k jobs in parallel at many days
- HLT (High Level Trigger) contributed a few thousand cores
- No major issues
- Tier-0 / PromptRECO
- Backlog of pending jobs not fully cleared during the break
- Partly due to lacking resources at CERN
- Needed help from experts to provision fresh VMs GGUS:118546
- Tape operations
- Had a rather long backlog of not approved tape migrations at FNAL before Xmas break
- Sorted out via CMS site contacts
- Some datasets not moving at RAL
LHCb
- Activities:
- Monte Carlo and user analysis.
- Pre-staging the data for re-stripping is almost finished.
- Issue:
- Problem pre-staging files at RRCKI
- Nickname VOMS attribute can not be retrieved (GGUS:118361
)
There was a discussion on the reasons why the above ticket has no activity since Dec. 16th and status "On hold". It should be followed up by LHCb offline.
Ongoing Task Forces and Working Groups
gLExec Deployment TF
Machine/Job Features TF
HTTP Deployment TF
- ETF is up and running in preprod
- A first set of tickets, around 20, have been assigned to sites.
- The next TF meeting has been confirmed as 20th Jan - https://indico.cern.ch/event/473194/
- The meeting will concentrate on setting up the operational plan for the campaign to get the monitoring green.
Information System Evolution
- IS TF meeting scheduled tomorrow Friday 8th January. ( Agenda
)
- Definitions: summary of the proposed definitions and feedback from sys admins.
- Status of new IS: news on the feedback given so far by experiments.
- Preparation for the WLCG workshop discussion about the IS.
IPv6 Validation and Deployment TF
Middleware Readiness WG
The
JIRA dashboard
shows per experiment and per site the product versions pending for Readiness verification. Changes since the Ops Coord. meeting of Dec. 17th are few due to the year end holidays. Details:
Multicore Deployment
- Accounting:
- John Gordon update: The default EGI has not changed but WLCG users should mainly use the Tier1 and Tier2 views (eg http://accounting.egi.eu/tier1.php
) which now use the same data as the production portal (ie include cores). The EMI3(WLCG) view also includes cores and would be useful to view an integrated view of a country including both its Tier1, Tier2, Tier3 and other sites.
- On ATLAS side working on comparing accounting records in the dashboard and in APEL site by site for the T1 and region by region for T2s.
Network and Transfer Metrics WG
- Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard), minor instability in the dashboard reported yesterday, being followed up by OSG
- Additional monitoring metrics will be added to psomd.grid.iu.edu to capture collector's efficiency and report on freshness of the metadata in the OSG Datastore (for each sonar).
- Proposed re-organization of the WG meetings, split into two areas, perfSONAR operations (throughput calls) and research/pilot projects
- perfSONAR operations - main scope would be to continue with perfSONAR support, follow up on the existing infrastructure while at the same time start looking into issues already shown by the existing tools and try to fix them at the source. As this scope is well aligned with the existing North American throughput calls, we could alternate the meetings and publish common notes.
- Research/pilot projects - will have separate on-demand meetings with notes published to WG mailing list
- F2F meeting once a year, co-located with GDB or other workshop/conference
- Pilot projects: LHCb DIRAC bridge available online
RFC proxies
Squid Monitoring and HTTP Proxy Discovery TFs
Action list
Creation date |
Description |
Responsible |
Status |
Comments |
2015-06-04 |
Status of fix for Globus library (globus-gssapi-gsi-11.16-1 ) released in EPEL testing |
Maarten |
ONGOING |
GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Jan 7 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress). Maarten will follow-up the progress of these tickets. They will be mentioned at the 3pm Ops call on Jan 11th |
2015-10-01 |
Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites |
SCOD team |
CLOSE & Open New |
A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting. Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's. It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale. At the Jan 7th meeting, Maria Alandes reported that she is in touch wiht GOCDB and more news will hopefully come next week. |
2015-12-17 |
Recommend site configurations to enforce memory limits on jobs |
|
CREATED |
1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F. Status of Jan 12th: A new twiki BatchSystemsConfig was finally decided as a better idea. Tickets opened. |
Specific actions for experiments
Specific actions for sites
AOB
--
MariaDimou - 2016-01-05