WLCG Operations Coordination Minutes, July 30th 2015

Highlights


Agenda

Attendance

  • local: Maria Alandes (Minutes), Maria Dimou
  • remote:

Operations News

  • Following yesterday's GGUS Release, the Did you know? article this month reminds of existing GGUS features requested at the WLCG Site Survey last autumn, as decided in this meeting a month ago.

Middleware News

  • Baselines:
    • The end of support for dCache 2.6.x was May 2015. The deadline for decommissioning is 21/09/2015 and starting from 31/08/2015 sites still running dCache 2.6.x will be ticketed. ( more details at https://wiki.egi.eu/wiki/Software_Calendars#dCache_v._2.6.x). ~20 instances still running 2.6.x ( no T1s).

  • Issues
    • Critical vulnerability affecting RedHat 5, 6, 7 broadcasted by EGI CSIRT (https://wiki.egi.eu/wiki/EGI_CSIRT:Alerts/libuser-2015-07-24) which allows local root exploit. This vulnerability is present in the case of access via a local user account, so only UIs where access is given via local passwd file could be affected. A fix for the affected package ( libuser ) is available for RedHat 6 and 7, but not for Red Hat 5, so sites should either upgrade to a new OS release or apply the workaround described at https://access.redhat.com/articles/1537873. According to the CSIRT procedures, sites not upgrading by 31/07 risk site suspension.

  • T0 and T1 services
    • JINR
      • dCache upgraded to 2.10.36
    • NL-T1
      • DPM upgraded to v 1.8.9 at NIKHEF in order to fix a data transfer issue.
    • PIC
      • dCache upgraded to 2.10.37

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • high activity
    • new record 83k briefly reached on Jul 29
  • CERN:
    • raw data copies from Point 2 to CASTOR were timing out (GGUS:115145)
      • raw data reconstruction jobs were keeping many disk server slots busy
      • many more slots appeared to be occupied by stale transfers
      • to be followed up further with the CASTOR devs
    • job submissions became really slow multiple times (GGUS:115153 and GGUS:115238)
      • some issues were cured on the Argus side
      • the real cause of such problems has not yet been identified
  • NDGF reported inefficient data transfers and noise in their logs
    • due to failed attempts with 2 methods before the 3rd succeeds
    • Xrootd client only checks if the source supports 3rd party copies
      • also the destination should be checked
      • a bug has been opened for the Xrootd devs
    • meanwhile a workaround has been applied on the ALICE side

ATLAS

CMS

LHCb

  • Operations
    • Currently finishing a restripping of the Run1 legacy data and of the 50 ns Run2 ramp
    • Discussion with CERN/LSF team about the queue capabilities, problems found both in LSF and Dirac (GGUS:115027)
    • Preparations for the 25ns ramp up ongoing.
  • Developments
    • Hammercloud testing for LHCb is currently being re-vitalized. The probe will check the possiblity to run user analysis jobs with protocol access at sites.
    • perfSonar data extraction from WLCG sources is almost finished, currently working on the publishing of the data into LHCbDIRAC

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

RFC proxies

  • NTR

Machine/Job Features

  • A nagios probe checking the availability and sanity of machine / job features has been developed. It's currently running in preprod for the LHCb SAM instance. Results can be seen at http://cern.ch/go/Gzn8 . LHCb sites providing MJF are
    • CERN
    • GRIDKA
    • LPNHE
    • Imperial College

Middleware Readiness WG


Multicore Deployment

IPv6 Validation and Deployment TF


Squid Monitoring and HTTP Proxy Discovery TFs

Network and Transfer Metrics WG


  • Successfully tested publishing of the perfSONAR results to the message bus directly from the OSG collector. Discussing possible SLA to run this as a production service in collaboration with OSG.
  • OSG datastore on track to go production at the end of July, this will be a service provided to the WLCG, it will store all the perfSONAR data and provide an API
  • Started testing proximity service, which helps to map sonars to storages and thus enables integration of the network and transfer metrics.
  • Review of the experiments use cases was presented/discussed at the last meeting, see slides for details (https://indico.cern.ch/event/393101/)
  • FTS performance study update - see slides for details (https://indico.cern.ch/event/393101/), observations from the report so far:
    • Peak transfer rates between Europe and North America are less asymmetric than they were last month (to be followed up)
    • Almost all incoming to BNL uses TCP=1 (Alejandro confirmed this is how BNL is configured right now, the other FTS instances use auto-tuning)
    • CMS T1s have better transfer rates compared to ATLAS and LHCb (to be followed up)
    • CMS uses TCP=1 more often than ATLAS and LHCb for large files
    • TCP stream=1 transfer do timeout about 2-3% of the time, however timeouts are concentrated at a few sites.
    • Throughput dependence on TCP streams possibly understood (see http://egg.bu.edu/lhc/fts/docs/2015-05-26-status/results_so_far.pdf)
  • perfSONAR operations status
    • Agreed to establish WLCG-wide meshes for top 100 sites (based on the contributed storage and location). This will enable full mesh testing of latencies, traceroutes and throughput (ongoing).
    • ESNet interested in the perfSONAR configuration interface developed for WLCG, development design document for an open-source project based it is currently discussed.

HTTP Deployment TF

Information System Evolution


  • The first TF meeting took place last week ( agenda, minutes)
    • It was agreed to implement in REBUS a set of easy fixes. For more details, please check REBUS known issues
    • A set of action items were defined, for more details, please check Task tracking and timeline. A summary below:
      • Requirements to remove information (Physical CPU) or change how information is collected (HS06) in REBUS will be followed up
      • Agree on a better definition of Installed Capacities, or even decide to change this name and better use "Available capacities" or something similar
      • Discuss at the MB the possibility of adding T3s and also publish pledges per sites in REBUS
  • A draft document describing use cases from experiments and project activities relying on the information system has been circulated among TF members for their contribution. This will be presented in the future MB (date to be confirmed) although we are aiming to have the document ready by end August

Action list

Creation dateSorted ascending Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-07-02 Provide a description of each experiment computing support structure, so tickets wrongly assigned to the T0 (via SNOW or GGUS) can be properly redirected; evaluate the creation of SNOW Functional Elements for the experiments, if this is not already the case all n/a ALICE, ATLAS, CMS have made progress after discussing with the T0 manager. They will present at the next meeting. July 30 ~40%

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-06-18 Some sites have still not enabled multicore accounting All Multicore Deployment Instructions here a.s.a.p. Almost DONE. HERE is the list of the remaining still pending sites.
2015-06-04 ALL ATLAS sites implementing a cap to their multicore resources (whether their configuration is dynamic just for a portion of nodes or it is a static partition) should review the cap to give 80% of the ATLAS production resources to multicore. As a reminder the shares for ATLAS jobs are as follows T1: 5% analysis and 95% production; T2: 50% analysis and 50% production. So multicore should get 80% of 95% at T1s and 80% of 50% at T2. More info here ATLAS Multicore   None CLOSED
2015-06-04 LHCb T1s requested to make sure that all the RAW data will be stored on the same tape set in each tape system when it is feasible LHCb - More details in GGUS:114018    
2015-06-18 CMS requests an adjustment of the Tier-1 fair share target for the following VOMS roles: /cms/Role=production 90% (was 95%), /cms/Role=pilot 10% (was 5%). Note that for CMS SAM tests the role cms/Role=lcgadmin is used, it basically needs very little fair share but should be scheduled asap to have the test not timing out. Overall at least 50% of the pledged T1 CPU resources should be reachable via multi-core pilots (this is as before - just mentioned for completeness) CMS     None yet CLOSED (confirmed regarding config) Verification is longer term
2015-06-18 CMS asks all T1 and T2 sites to provide Space Monitoring information to complete the picture on space usage at the sites. Please have a look at the general description and also some instructions for site admins. CMS -   None yet ~10 T2 sites missing, Ticket open

AOB

GGUS: How do users (e.g. VO shifters) receive GGUS downtime notifications?

https://its.cern.ch/jira/browse/GGUS-1454

-- MariaALANDESPRADILLO - 2015-07-27


This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsWeb > WLCGOpsCoordination > WLCGOpsMinutes150730
Topic revision: r12 - 2015-07-30 - MariaALANDESPRADILLO
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback