WLCG Operations Coordination Minutes, August 20th 2015

Highlights

  • Calling for volunteers to write articles on interesting topics for WLCG Operations. There is a section in the portal for this: http://wlcg-ops.web.cern.ch/articles. In case you are interested, please send a mail to wlcg-ops-coord-chairpeople for details.
  • CERN is going to disable in few weeks write operations via RFIO v2 to Castor in the context of the RFIO access decommission.
  • A downtime for the Argus service @ CERN is expected due to the pending filer migration. No dates yet but it will affect ARGUS, all CEs and all batch worker nodes (glExec) running GridJobs. The downtime is foreseen to last 1h-2h.

  • Issues with VMs in Tier-0 infra also reported by CMS
  • LHCb is trying to integrate submission to the HTCondorCE instance @ CERN
  • A message broker pre-prod infra has been setup @ CERN to enable distribution of perfSONAR data to the experiments. OSG is enabling the data publication from the ITB collector service


Agenda

Attendance

  • local: Andrea Manzi (Chair, Minutes), Andrea Valassi, Marian Babik, Luca Tomassetti, Alberto Peon, Christoph Wissing, Giuseppe Lo Presti
  • remote: Hung-Te Lee, Thomas Hartmann, Jeremy Coles, Steve Jones, Rob Quick, Di Qing
  • apologies: Catherine Biscarat, Maarten Litmaath, Josep Flix, Andrea Sciaba', Maria Alandes, Maria Dimou, Alessandra Forti, Maite Barroso

Operations News

  • The new WLCG Operations Portal is now online http://wlcg-ops.web.cern.ch/. Please, check the portal and do not hesitate to send us your feedback!
  • We are also calling for volunteers to write articles on interesting topics for WLCG Operations. There is a section in the portal for this: http://wlcg-ops.web.cern.ch/articles, please, help us keeping the portal attractive and useful for everyone in WLCG by submitting an article where you share your expertise as a sys admin, member of an experiment, task force or working group coordinator, describing new technologies, experience running a service, etc. In case you are interested, please, send a mail to wlcg-ops-coord-chairpeople and we will give you more details.

Middleware News

  • Baselines:
    • We would like to understand if the experiments are pushing for the deployment of Xrootd v4 in order to set it as baseline.
  • Issues:
    • NL-T1 reported an issue during the last week ops meeting with dCache 2.10.35: if the pool queries its inventory which is in a Berkeley DB, but the DB is locked for another query and is not available within the default 500ms timeout the pools is disabled. ( this happened only under heavy load). In case some other installations have the same issue one possible workaround is to move the DB to a different block device to improve the performances.

  • T0 and T1 services
    • CERN
      • As RFIO is obsolete, its possible decommissioning will be discussed at the end of 2015 to take place in 2016 or later. As a first step, RFIO v2 will be closed for write operations only (rfcp will keep working both for reads and for writes) in few weeks. No RFIOv2 activity has been observed since several months in the CASTOR LHC instances.
    • NDGF
      • dCache upgraded to 2.13.4

Tier 0 News

  • A downtime for the Argus service is expected due to the pending filer migration. Since we don't have the NFS server box yet we cannot say exactly yet when this will be but it will affect Argus, all CEs (all flavors in all instances) and all batch worker nodes (glExec) running GridJobs. Thus also running jobs will be affected. The downtime is foresee to last 1h-2h.

Christoph Wissing commented that 1 or 2 hours downtime is not an issue for CMS.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • high activity
    • new record of 97k concurrently running jobs briefly reached on Aug 10
      • taking advantage of resources normally occupied by other VOs
  • new US T2 site ORNL has ramped up this week
    • 896 job slots
    • 1 PB EOS storage
  • CERN
    • continued intermittent issues with raw data access by reco jobs
      • the CASTOR team applied various mitigations to allow those jobs to progress
      • a more robust solution will be investigated when all experts are back

Giuseppe Lo Presti commented that they have worked in order to fix the issue with Castor, they plan to install the fix during the next Technical stop in September

ATLAS

  • Apologies, due to holidays no one from ATLAS can attend today
  • Last week a CERN-wide update of SLC6 suspiciously coincided with problems in pilot factories, which drained ATLAS jobs for a few days
  • Apart from that, normal high activity. The last expected bulk reprocessing of 2015 data this year is getting underway now
  • Problems copying data from EOS to Castor are discussed in GGUS:115680

CMS

  • It's a holiday period
    • Rather little activity for Computing
    • New larger MC campaign (~1B events) still in preparation
  • Tier-0
    • PromptRECO (mostly) switched to multi-threaded applications
    • Still some issues being cured
  • Transfer Problem from CASTOR to (CERN) EOS
    • Issues with third-party transfers
    • Needed a fix on CASTOR
    • Details: GGUS:115598
  • Outages of DAS (Data Aggregation Service) last week
    • Service got overloaded and introduced further problem to other CMS web-services
    • Some tools are changed to query underlying services directly (bypassing the Aggregation Service)
  • Followup from last meeting
    • Badly performing VMs in Tier-0 infrastructure
      • Also observed by CMS
      • Causes occasionally long tails
      • Suspected to be caused oversubscribed hypervisors
      • Details: GGUS:115120
    • Ticket handling at CERN
      • Some tests by Maite and CMS experts
      • Close to establish a work flow

Andrea Valassi commented that the issue with VMs in Tier0 was already reported by the other experiments and there was a presentation on this topic also during the last WLCG MB by Helge Meinhard

Christoph Wissing commented that CMS would like to have Xrootd 4 deployed in the infrastructure.

LHCb

  • Operations
    • Currently finishing the 25ns data validation (with a limited number of runs). The second production with a new calibration for the reconstruction of Full, Turbo and Turbo-calibration streams has ~90% data processed.
    • Stripping production will be released and launched later today.
    • Sometime next week we will likely go to full production for the processing of 25ns data. Input data is on disk resident BUFFER spaces.
  • Developments
    • Submission to HTCondorCE (previous solution with ARC CE to HTCondor was successfully tested and used in production): problem with certificates preventing submission under investigation.
  • Issues
    • Temporary glitches at some T1s, promptly solved

The issue reported with HTCondorCE refers to the instance running @CERN

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

RFC proxies

  • NTR

Machine/Job Features

  • NTR

Middleware Readiness WG

  • NTR

Multicore Deployment

  • NTR

IPv6 Validation and Deployment TF

  • NTR

Squid Monitoring and HTTP Proxy Discovery TFs

  • NTR

Network and Transfer Metrics WG


  • Established production and validation ActiveMQ brokers at CERN (netmon-mb.cern.ch and netmon-test-mb.cern.ch), they will be used to broadcast data collected by perfSONARs to experiments.
  • OSG will test-enable publishing of the perfSONAR results to the netmon-test-mb.cern.ch from the ITB collector service.
  • Proximity service - developed mapping matrix that experiments could use to map storages to sonars and use it to process the perfSONAR stream from. Currently tested by LHCb, which is developing a perfSONAR to DIRAC connector.
  • New project to analyse meshes and report infrastructure issues vs network problems is being developed at AGLT2 (MadAlert http://maddash.aglt2.org/madalert.html). Plan is to continue to develop it targeting an eventual way to automate problem finding.
  • perfSONAR operations status
    • Progress made on the WLCG-wide meshes, latency mesh now with 70 sonars.
    • Validation of the perfSONAR 3.5rc1 started, final release expected in October.
    • ESNet is finalizing the development design document on the perfSONAR configuration interface - open source project motivated by the existing version developed for WLCG.

Rob Quick commented that OSG will provide the SLA doc regarding the ITB data publication to ActiveMQ next week

HTTP Deployment TF

  • NTR

Information System Evolution

  • NTR

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-07-02 Provide a description of each experiment computing support structure, so tickets wrongly assigned to the T0 (via SNOW or GGUS) can be properly redirected; evaluate the creation of SNOW Functional Elements for the experiments, if this is not already the case all n/a ALICE done. ATLAS draft under discussion. CMS almost done. LHCb done July 30th. Extended to September 3rd ~75%

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-06-18 Some sites have still not enabled multicore accounting All Multicore Deployment Instructions here a.s.a.p. Almost DONE. HERE is the list of the remaining still pending sites.
2015-06-18 CMS asks all T1 and T2 sites to provide Space Monitoring information to complete the picture on space usage at the sites. Please have a look at the general description and also some instructions for site admins. CMS -   None yet ~10 T2 sites missing, Ticket open

AOB

-- AndreaManzi - 2015-08-11

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2015-08-24 - AndreaManzi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback