WLCG Operations Coordination Minutes, July 2nd 2015

Highlights

  • The next WLCG workshop will take place in Lisbon in the first week of February 2016 and organised by LIP. The expected duration is 2.5 days, the exact days of the week will be defined a.s.a.p. The workshop agenda will contain a limited amount of topics to leave enough time for discussion.
  • A new version of UMD (3.13) has been released, including bug fixes for gLExec, CREAM, StoRM, DPM-xrootd and fetch-crl
  • The size of the HTCondor pilot service at CERN is now 300 cores and fully used
  • Agreed to establish a full, WLCG-wide perfSONAR mesh for the top 100 sites (based on location and storage size), allowing full mesh testing of latencies, throughput and traceroutes.


Agenda

Attendance

  • local: Maria Dimou (chair), Andrea Sciabà (minutes), Maarten Litmaath, Christoph Wissing, Andrea Manzi, Thomas Hartmann, Alberto Aimar, Catherine Biscarat, Pablo Saiz, Maite Barroso
  • remote: Hung-Te Lee, Frederique Chollet, Di Qing, Dave Mason, Gareth Smith, Renaud Vernet, Rob Quick, Alessandra Doria
  • apologies: Alessandra Forti

Operations News

  • The next WLCG Workshop will take place in Lisbon, Portugal during the first week of February 2016, organised by LIP. Please note in your calendars. The expected duration is 2.5 days, the exact days of the week will be defined a.s.a.p. The workshop agenda will contain a limited amount of topics to leave enough time for discussion.

Middleware News

  • Baselines:
    • New version of UMD ( v 3.13) released today:
      • gLexec 1.2.3 : lcmaps-plugins-c-pep and mkgltempdir enhancements and bug fixes
      • cream 1.6.5 : ( verified by MW readiness) bug fixes
      • storm 1.11.8 : ( verified by MW readiness) This release fixes a critical issue that prevented the proper cleanup of PtP requests state after an srmRm was called on a SURL. Bug introduced in version 1.11.5
      • dpm-xrootd 3.5.2 : ( verified by MW readiness) first version using xrootd 4.1.1 ( also included in UMD)
      • fetch-crl 3.0.16 : fix intermittent authorization failures concerning services whose certificates are from a particular CA whose CRL was being served incorrectly. the client is more robust in this case
    • New version of the WN tarball (3.15.3-1_sl6v1) published to CVMFS grid.cern.ch, it includes updates and the gfal2-util package for the first time
    • Added xrootd v.3.3.6: if requested by all Experiment we can move it to v 4.1.1
    • removed LFC
    • baselines are also available from MW readiness site https://wlcg-mw-readiness.cern.ch/baseline/current/

  • Issues
    • NTR

  • T0 and T1 services
    • CERN
      • decommissioned LFC service
    • CNAF
      • updated Storm to version 1.11.9
    • JINR
      • dCache upgraded to 2.10.35 from 2.10.30, Postgres upgraded to 9.4.3 from 9.4.1
    • BNL
      • FTS was upgraded to version 3.3.0, but given some issues introduced in this version they downgraded to the previous version.

The very latest FTS patch is being tested in the CERN pilot service. The FTS development team would like more workflows to be tested, though. To be discussed offline.

Tier 0 News

  • Condor prototype at CERN: We have now increased the pool size to almost 300 cores and see that the pool is still fully used, which is a very good sign. Depending on available resources, we intend to increase the pool size further by a few hundreds of cores soon. Assuming the positive experience continues, we intend to enable accounting of the Condor resources into the APEL database by about end August. Soon after we would like to start increasing the Condor pool further at the expense of the shared LSF instance in view of moving the base load of grid-submitted jobs from LSF to Condor.
  • ATLAS T0 LSF instance: the required capacity (110 KSpec) has been provided, and with the required performance; the issues with the 110K reported over the last few weeks have been solved. We have received an additional request of 30 K, and the additional capacity is on its way.
  • Experiment support: The T0 receives more and more tickets (coming from GGUS or directly from SNOW) which are related to experiment support (e.g. INC0808844). We would like to understand the support structures of each experiment to be able to help the users and redirect the tickets accordingly. In GGUS it is easy: support unit "VO support" per experiment; it is not the case in SNOW, and some users do not want to go back and create another ticket in SNOW. It would be very useful for us if each experiment would provide us a few lines with instructions for this type of tickets; something similar to this, covering cases in GGUS and SNOW: https://cern.service-now.com/service-portal/article.do?n=KB0002488

It is agreed that the experiments should think how best to deal with SNOW tickets that should be addressed to the experiment support teams, either by providing CERN with appropriate instructions or by having ad-hoc experiment Functional Elements (for ATLAS this is already the case).

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • mostly high activity
    • 80k concurrent jobs reached for many hours on Jun 21-22
  • CERN: 2.5% raw data reco failures due to CVMFS cache corruption (team ticket GGUS:114534)
  • CERN: network incident Fri evening Jun 26 caused all VOBOXes for CERN to become unreachable
  • CERN: latest CASTOR versions are incompatible with old xrd3cp implementation
    • this prevents third-party copies from CASTOR to sites running Xrootd < 3.3.6
    • in particular the T1 sites are asked to upgrade to a recent Xrootd, preferably >= 4.1
      • it also brings support for IPv6 that will steadily become more important
    • in the meantime IT-DSS have set up a few gateway hosts for the raw data export
      • thanks in particular to Andreas Peters!
    • using 500 parallel transfers this allowed 50 TB to be transferred to KISTI in ~1 day at an average rate of 616 MB/s!

Maarten intends to start opening tickets to Tier-1 sites to request the upgrade to Xrootd 4.1 (or newer). There is still no deadline, as there is a temporary solution for sites that did not upgrade, but the sooner it is done, the better.

In the future it would be good to test xrootd 4 in the context of the middleware readiness validation for ALICE.

ATLAS

  • apologies, due to ATLAS SW & Computing week no one from ATLAS can attend the meeting
  • ATLAS Tier-0: in the last week the tests performed on the computing capacity throughput are satisfactory. In the process of increasing the cluster to be sure of having additional room in case of need
  • Lost files: ATLAS is seeing lost files (10 files reported over the past 2 weeks, all derived data which can be re-recreated, but still not good) and is actively debugging. This is due to interaction between Rucio/FTS, something in Rucio, something in FTS: observed wrong duplicated messages from FTS, but also Rucio has something not perfect while interacting with its DB and people are investigating.

CMS

  • Production overview
    • Good utilization of resources
    • Bulk of Run2 DIGI-RECO already done - but some more requests still in the queue
    • Continued MC production
    • High priority Upgrade studies - I/O demanding due to PU200

  • Faulty MC production workflows
    • Allocation of huge memory (100GB and more)
    • Jobs aborted
    • Investigations ongoing

  • Root certificate of Spanish CA recently expired at some sites
    • CA packages were updated in time
    • Also affected CERN GGUS:114709

  • Accounting issue for May 2015 for OSG sites
    • Already known and likely due to multi-core accounting issues
    • Seems to affect also ATLAS (since they also use multi-core jobs)

Rob adds that the accounting problem has been solved (apart from very minor issues) and a new report is being sent.

LHCb

  • Run2 offline processing
    • 2nd validation of offline workflows ongoing
    • all data from pit copied to offline
  • Issues
    • problems with DPM SEs at T2 sites. Some files had no checksum. Checking and recovering checksum for all files.

GGUS feedback from WLCG site survey

See the slides for the details.

Pablo reports the high level of satisfaction (98%) of the "users" (although the participants to the survey being the site managers, they are much more often supporters than users) and points out that many of the features requested actually exist but maybe they are not sufficiently advertised. The discussion focused on the logic for priority colours.

The agreement is that the colour coding of tickets is very useful (even if it might put some pressure on the supporters) and that it is not worth investing too much effort in improving it. Still everybody is invited to check the algorithm (link) and give feedback to the GGUS developers if desired.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

RFC proxies

  • SAM
    • a proper patch of the proxy renewal sensor is available
      • automatically detecting the type of proxy stored on the MyProxy server
      • supporting legacy and RFC transparently
      • making the Nagios web page show the actual type being used
    • applied on sam-alice-preprod (RFC) and sam-alice-prod (legacy)
    • a new version of the corresponding rpm will be created
      • meanwhile the patch can be applied transparently on any SAM-Nagios instance
      • a preprod instance could then be switched to RFC proxies

Machine/Job Features

Middleware Readiness WG


  • Argus Future meeting tomorrow July 3rd, at 11am CEST, focused on progress with EL7 support Agenda.
  • Latest version of pakiti-client (v3.0.1) with tag support has been pushed to EPEL stable, we will contact soon the volunteer sites for the upgrade.
  • dCache testing for ATLAS at Triumf paused to perform a re-configuration that will fix a problem with the SRM space token.
  • PIC and Brunel are collaborating for PhedeX tests. This allows PIC to better compare one site against another.
  • Reminder: The next MW Readiness WG vidyo meeting will take place on Wednesday September 16th at 4pm CEST. Please comment a.s.a.p. if this date is not good!

Multicore Deployment

  • Accounting: John Gordon will send out a broadcast and produce an updated list of missing CEs. Action should stay until all sites done.

IPv6 Validation and Deployment TF


Squid Monitoring and HTTP Proxy Discovery TFs

  • Alastair Dewhurst said he'd get the next to-do item done this month. Dates in the task list are updated.

Network and Transfer Metrics WG


  • perfSONAR status
    • Agreed to establish WLCG-wide meshes for top 100 sites (based on the contributed storage and location). This will enable full mesh testing of latencies, traceroutes and throughput
    • Working in collaboration with ESNet to narrow down on an issue affecting latency measurements for long distance testing (US to Europe, Europe to Asia, etc.). A fix has been released and will be auto-deployed to all sites.
    • perfSONAR 3.5 RC is planned to be released next week. The following sites agreed to participate in the validation testbed: Nebraska, BNL, SWT2, AGLT2, MWT, TAMU, IEPSAS-Kosice
    • perfSONAR support involved in debugging the network issues at RAL

  • Successfully tested publishing perfSONAR results directly from the OSG collector (that populates OSG/esmond datastore).
  • Started testing proximity service, which helps to map sonars to storages and thus enables integration of the network and transfer metrics.
  • Next meeting will be on 8th of July (https://indico.cern.ch/event/393101/), planning a detailed update on OSG datastore and FTS performance study.

HTTP Deployment TF

Action list

Creation date Description Responsible Status Comments
2015-06-18 Organise further discussions on the InfoSys future Maria Alandes ONGOING Organise a new TF to discuss the future of the Infosys.
2015-06-18 Remove the LFC from the Baseline table Andrea Manzi CLOSED To be done before the next meeting
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING Details on GGUS:114076

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-07-02 Provide a description of each experiment computing support structure, so tickets wrongly assigned to the T0 (via SNOW or GGUS) can be properly redirected; evaluate the creation of SNOW Functional Elements for the experiments, if this is not already the case all n/a   July 30 0%

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-06-18 Some sites have still not enabled multicore accounting All Multicore Deployment Instructions here a.s.a.p. still partial
2015-06-04 ALL ATLAS sites implementing a cap to their multicore resources (whether their configuration is dynamic just for a portion of nodes or it is a static partition) should review the cap to give 80% of the ATLAS production resources to multicore. As a reminder the shares for ATLAS jobs are as follows T1: 5% analysis and 95% production; T2: 50% analysis and 50% production. So multicore should get 80% of 95% at T1s and 80% of 50% at T2. More info here ATLAS Multicore   None  
2015-06-04 LHCb T1s requested to make sure that all the RAW data will be stored on the same tape set in each tape system when it is feasible LHCb - More details in GGUS:114018    
2015-06-18 CMS requests an adjustment of the Tier-1 fair share target for the following VOMS roles: /cms/Role=production 90% (was 95%), /cms/Role=pilot 10% (was 5%). Note that for CMS SAM tests the role cms/Role=lcgadmin is used, it basically needs very little fair share but should be scheduled asap to have the test not timing out. Overall at least 50% of the pledged T1 CPU resources should be reachable via multi-core pilots (this is as before - just mentioned for completeness) CMS CLOSED (confirmed regarding config) Longterm verification pending   None yet  
2015-06-18 CMS asks all T1 and T2 sites to provide Space Monitoring information to complete the picture on space usage at the sites. Please have a look at the general description and also some instructions for site admins. CMS -   None yet ~10 T2 sites missing, Ticket open

AOB

-- AndreaSciaba - 2015-06-30

Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r25 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback