WLCG Operations Coordination Minutes, August 20th 2015
Highlights
- Calling for volunteers to write articles on interesting topics for WLCG Operations. There is a section in the portal for this: http://wlcg-ops.web.cern.ch/articles
. In case you are interested, please send a mail to wlcg-ops-coord-chairpeople for details.
- CERN is going to disable in few weeks write operations via RFIO v2 to Castor in the context of the RFIO access decommission.
- A downtime for the Argus service @ CERN is expected due to the pending filer migration. No dates yet but it will affect ARGUS, all CEs and all batch worker nodes (glExec) running GridJobs. The downtime is foreseen to last 1h-2h.
- Issues with VMs in Tier-0 infra also reported by CMS
- LHCb is trying to integrate submission to the HTCondorCE instance @ CERN
- A message broker pre-prod infra has been setup @ CERN to enable distribution of perfSONAR data to the experiments. OSG is enabling the data publication from the ITB collector service
Agenda
Attendance
- local: Andrea Manzi (Chair, Minutes), Andrea Valassi, Marian Babik, Luca Tomassetti, Alberto Peon, Christoph Wissing, Giuseppe Lo Presti
- remote: Hung-Te Lee, Thomas Hartmann, Jeremy Coles, Steve Jones, Rob Quick, Di Qing
- apologies: Catherine Biscarat, Maarten Litmaath, Josep Flix, Andrea Sciaba', Maria Alandes, Maria Dimou, Alessandra Forti, Maite Barroso
Operations News
- The new WLCG Operations Portal is now online http://wlcg-ops.web.cern.ch/
. Please, check the portal and do not hesitate to send us your feedback!
- We are also calling for volunteers to write articles on interesting topics for WLCG Operations. There is a section in the portal for this: http://wlcg-ops.web.cern.ch/articles
, please, help us keeping the portal attractive and useful for everyone in WLCG by submitting an article where you share your expertise as a sys admin, member of an experiment, task force or working group coordinator, describing new technologies, experience running a service, etc. In case you are interested, please, send a mail to wlcg-ops-coord-chairpeople and we will give you more details.
Middleware News
- Baselines:
- We would like to understand if the experiments are pushing for the deployment of Xrootd v4 in order to set it as baseline.
- Issues:
- NL-T1 reported an issue during the last week ops meeting with dCache 2.10.35: if the pool queries its inventory which is in a Berkeley DB, but the DB is locked for another query and is not available within the default 500ms timeout the pools is disabled. ( this happened only under heavy load). In case some other installations have the same issue one possible workaround is to move the DB to a different block device to improve the performances.
- T0 and T1 services
- CERN
- As RFIO is obsolete, its possible decommissioning will be discussed at the end of 2015 to take place in 2016 or later. As a first step, RFIO v2 will be closed for write operations only (rfcp will keep working both for reads and for writes) in few weeks. No RFIOv2 activity has been observed since several months in the CASTOR LHC instances.
- NDGF
- dCache upgraded to 2.13.4
Tier 0 News
- A downtime for the Argus service is expected due to the pending filer migration. Since we don't have the NFS server box yet we cannot say exactly yet when this will be but it will affect Argus, all CEs (all flavors in all instances) and all batch worker nodes (glExec) running GridJobs. Thus also running jobs will be affected. The downtime is foresee to last 1h-2h.
Christoph Wissing commented that 1 or 2 hours downtime is not an issue for CMS.
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- high activity
- new record of 97k concurrently running jobs briefly reached on Aug 10
- taking advantage of resources normally occupied by other VOs
- new US T2 site ORNL has ramped up this week
- 896 job slots
- 1 PB EOS storage
- CERN
- continued intermittent issues with raw data access by reco jobs
- the CASTOR team applied various mitigations to allow those jobs to progress
- a more robust solution will be investigated when all experts are back
Giuseppe Lo Presti commented that they have worked in order to fix the issue with Castor, they plan to install the fix during the next Technical stop in September
ATLAS
- Apologies, due to holidays no one from ATLAS can attend today
- Last week a CERN-wide update of SLC6 suspiciously coincided with problems in pilot factories, which drained ATLAS jobs for a few days
- Apart from that, normal high activity. The last expected bulk reprocessing of 2015 data this year is getting underway now
- Problems copying data from EOS to Castor are discussed in GGUS:115680
CMS
- It's a holiday period
- Rather little activity for Computing
- New larger MC campaign (~1B events) still in preparation
- Tier-0
- PromptRECO (mostly) switched to multi-threaded applications
- Still some issues being cured
- Transfer Problem from CASTOR to (CERN) EOS
- Issues with third-party transfers
- Needed a fix on CASTOR
- Details: GGUS:115598
- Outages of DAS (Data Aggregation Service) last week
- Service got overloaded and introduced further problem to other CMS web-services
- Some tools are changed to query underlying services directly (bypassing the Aggregation Service)
- Followup from last meeting
- Badly performing VMs in Tier-0 infrastructure
- Also observed by CMS
- Causes occasionally long tails
- Suspected to be caused oversubscribed hypervisors
- Details: GGUS:115120
- Ticket handling at CERN
- Some tests by Maite and CMS experts
- Close to establish a work flow
Andrea Valassi commented that the issue with VMs in Tier0 was already reported by the other experiments and there was a presentation on this topic also during the last WLCG MB by Helge Meinhard
Christoph Wissing commented that CMS would like to have Xrootd 4 deployed in the infrastructure.
LHCb
- Operations
- Currently finishing the 25ns data validation (with a limited number of runs). The second production with a new calibration for the reconstruction of Full, Turbo and Turbo-calibration streams has ~90% data processed.
- Stripping production will be released and launched later today.
- Sometime next week we will likely go to full production for the processing of 25ns data. Input data is on disk resident BUFFER spaces.
- Developments
- Submission to HTCondorCE (previous solution with ARC CE to HTCondor was successfully tested and used in production): problem with certificates preventing submission under investigation.
- Issues
- Temporary glitches at some T1s, promptly solved
The issue reported with
HTCondorCE refers to the instance running
@CERN
Ongoing Task Forces and Working Groups
gLExec Deployment TF
RFC proxies
Machine/Job Features
Middleware Readiness WG
Multicore Deployment
IPv6 Validation and Deployment TF
Squid Monitoring and HTTP Proxy Discovery TFs
Network and Transfer Metrics WG
- Established production and validation ActiveMQ brokers at CERN (netmon-mb.cern.ch and netmon-test-mb.cern.ch), they will be used to broadcast data collected by perfSONARs to experiments.
- OSG will test-enable publishing of the perfSONAR results to the netmon-test-mb.cern.ch from the ITB collector service.
- Proximity service - developed mapping matrix that experiments could use to map storages to sonars and use it to process the perfSONAR stream from. Currently tested by LHCb, which is developing a perfSONAR to DIRAC connector.
- New project to analyse meshes and report infrastructure issues vs network problems is being developed at AGLT2 (MadAlert http://maddash.aglt2.org/madalert.html
). Plan is to continue to develop it targeting an eventual way to automate problem finding.
- perfSONAR operations status
- Progress made on the WLCG-wide meshes, latency mesh now with 70 sonars.
- Validation of the perfSONAR 3.5rc1 started, final release expected in October.
- ESNet is finalizing the development design document on the perfSONAR configuration interface - open source project motivated by the existing version developed for WLCG.
Rob Quick commented that OSG will provide the SLA doc regarding the ITB data publication to
ActiveMQ next week
HTTP Deployment TF
Information System Evolution
Action list
Creation date |
Description |
Responsible |
Status |
Comments |
2015-06-04 |
Status of fix for Globus library (globus-gssapi-gsi-11.16-1 ) released in EPEL testing |
Andrea Manzi |
ONGOING |
GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. |
Specific actions for experiments
Creation date |
Description |
Affected VO |
Affected TF |
Comments |
Deadline |
Completion |
2015-07-02 |
Provide a description of each experiment computing support structure, so tickets wrongly assigned to the T0 (via SNOW or GGUS) can be properly redirected; evaluate the creation of SNOW Functional Elements for the experiments, if this is not already the case |
all |
n/a |
ALICE done. ATLAS draft under discussion. CMS almost done. LHCb done |
July 30th. Extended to September 3rd |
~75% |
Specific actions for sites
AOB
--
AndreaManzi - 2015-08-11