WLCG Operations Coordination Minutes, August 20th 2015

Highlights


Agenda

Attendance

  • local: Andrea Manzi
  • remote:
  • apologies: Catherine Biscarat, Maarten (ALICE)

Operations News

  • The new WLCG Operations Portal is now online http://wlcg-ops.web.cern.ch/. Please, check the portal and do not hesitate to send us your feedback!
  • We are also calling for volunteers to write articles on interesting topics for WLCG Operations. There is a section in the portal for this: http://wlcg-ops.web.cern.ch/articles, please, help us keeping the portal attractive and useful for everyone in WLCG by submitting an article where you share your expertise as a sys admin, member of an experiment, task force or working group coordinator, describing new technologies, experience running a service, etc. In case you are interested, please, send a mail to wlcg-ops-coord-chairpeople and we will give you more details.

Middleware News

  • Baselines:
    • We would like to understand if the experiments are pushing for the deployment of Xrootd v4 in order to set it as baseline.
  • Issues:
    • NL-T1 reported an issue during the last week ops meeting with dCache 2.10.35: if the pool queries its inventory which is in a Berkeley DB, but the DB is locked for another query and is not available within the default 500ms timeout the pools is disabled. ( this happened only under heavy load). In case some other installations have the same issue one possible workaround is to move the DB to a different block device to improve the performances.

  • T0 and T1 services
    • CERN
      • As RFIO is obsolete, its possible decommissioning will be discussed at the end of 2015 to take place in 2016 or later. As a first step, RFIO v2 will be closed for write operations only (rfcp will keep working both for reads and for writes) in few weeks. No RFIOv2 activity has been observed since several months in the CASTOR LHC instances.
    • NDGF
      • dCache upgraded to 2.13.4

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • high activity
    • new record of 97k concurrently running jobs briefly reached on Aug 10
      • taking advantage of resources normally occupied by other VOs
  • new US T2 site ORNL has ramped up this week
    • 896 job slots
    • 1 PB EOS storage
  • CERN
    • continued intermittent issues with raw data access by reco jobs
      • the CASTOR team applied various mitigations to allow those jobs to progress
      • a more robust solution will be investigated when all experts are back

ATLAS

  • Apologies, due to holidays no one from ATLAS can attend today
  • Last week a CERN-wide update of SLC6 suspiciously coincided with problems in pilot factories, which drained ATLAS jobs for a few days
  • Apart from that, normal high activity. The last expected bulk reprocessing of 2015 data this year is getting underway now
  • Problems copying data from EOS to Castor are discussed in GGUS:115680

CMS

  • It's a holiday period
    • Rather little activity for Computing
    • New larger MC campaign (~1B events) still in preparation
  • Tier-0
    • PromptRECO (mostly) switched to multi-threaded applications
    • Still some issues being cured
  • Transfer Problem from CASTOR to (CERN) EOS
    • Issues with third-party transfers
    • Needed a fix on CASTOR
    • Details: GGUS:115598
  • Outages of DAS (Data Aggregation Service) last week
    • Service got overloaded and introduced further problem to other CMS web-services
    • Some tools are changed to query underlying services directly (bypassing the Aggregation Service)
  • Followup from last meeting
    • Badly performing VMs in Tier-0 infrastructure
      • Also observed by CMS
      • Causes occasionally long tails
      • Suspected to be caused oversubscribed hypervisors
      • Details: GGUS:115120
    • Ticket handling at CERN
      • Some tests by Maite and CMS experts
      • Close to establish a work flow

LHCb

  • Operations
    • Currently finishing the 25ns data validation (with a limited number of runs). The second production with a new calibration for the reconstruction of Full, Turbo and Turbo-calibration streams has ~90% data processed.
    • Stripping production will be released and launched later today.
    • Sometime next week we will likely go to full production for the processing of 25ns data. Input data is on disk resident BUFFER spaces.
  • Developments
    • Submission to HTCondorCE (previous solution with ARC CE to HTCondor was successfully tested and used in production): problem with certificates preventing submission under investigation.
  • Issues
    • Temporary glitches at some T1s, promptly solved

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

RFC proxies

  • NTR

Machine/Job Features

Middleware Readiness WG

  • NTR

Multicore Deployment

IPv6 Validation and Deployment TF

Squid Monitoring and HTTP Proxy Discovery TFs

  • Nothing to report

Network and Transfer Metrics WG

HTTP Deployment TF

Information System Evolution

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-07-02 Provide a description of each experiment computing support structure, so tickets wrongly assigned to the T0 (via SNOW or GGUS) can be properly redirected; evaluate the creation of SNOW Functional Elements for the experiments, if this is not already the case all n/a ALICE done. ATLAS draft under discussion. CMS will discuss with Maite next week. LHCb pending July 30th. Extended to August 20th ~50%

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-06-18 Some sites have still not enabled multicore accounting All Multicore Deployment Instructions here a.s.a.p. Almost DONE. HERE is the list of the remaining still pending sites.
2015-06-18 CMS asks all T1 and T2 sites to provide Space Monitoring information to complete the picture on space usage at the sites. Please have a look at the general description and also some instructions for site admins. CMS -   None yet ~10 T2 sites missing, Ticket open

AOB

-- AndreaManzi - 2015-08-11

Edit | Attach | Watch | Print version | History: r18 | r16 < r15 < r14 < r13 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r14 - 2015-08-20 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback