WLCG Operations Coordination Minutes - October 16th, 2014

Agenda

Attendance

  • local: Maria Dimou (chair), Nicolò Magini (secretary), Marian Babik, Alessandro Di Girolamo (ATLAS), Christoph Wissing (CMS), Andrea Sciabà, Maarten Litmaath (ALICE), Felix Lee (ASGC)
  • remote: Alessandra Forti, Antonio Calero, Burt Holzman (FNAL), Catherine Biscarat, Cristina Aiftimiei (EGI), Di Qing, Gareth Smith, Jose Hernandez, Pepe Flix (PIC), Maite Barroso (Tier-0), Michael Ernst (BNL), Renaud Vernet, Rob Quick (OSG), Sang Un Ahn, Thomas Hartmann (KIT)

Operations News

  • Andrea Sciaba presents the site survey. Link to preliminary set of questions is in the agenda.
    • Sections on site (site name is confidential), managing changes, communication channels, monitoring, grid service administration
    • Alessandra and Pepe will try the survey first
    • Will be public after test runs
    • Send feedback by e-mail by next GDB

  • Christoph Wissing suggests to add a question on the number of VOs supported by the site. Agreed.

Middleware News

  • Baselines:
    • dCache 2.6.35 and 2.10.7 have been released. They fixed reading problem with ROOT6 affecting LHCb and CMS. Sites should be contacted to perform the upgrade ASAP as it' s quite a blocking issue for the experiments. ( ACTION for MW officer)
    • Cream 1.16.4 was released in EMI3, to be verified by MW readiness WG ( both for CMS and ATLAS)
    • gLExec 1.3.0 released in EMI3. Lots of fixes and improvements, waiting for UMD release to set it as baseline
      • ERRATUM: the actual update was ARGUS EES v. 0.2.1 which currently is not relevant for the WLCG baseline (see the Update 21 release notes)
    • perfSONAR 3.4 has been released. It includes performance improvement and security fixes. To check with Network and Metric WG if this has to be considered baseline version due to the shellshock issue that is affecting the previous version. New POODLE SSLv3.0 vulnerability discovered to also take into account.

  • On perfSONAR, Marian comments that current baseline is still 3.3.4 - baseline will be increased to 3.4 when the instructions are available; announcement will be broadcasted to WLCG. On POODLE see WG report.

  • MW Issues:
    • UPDATE on the issue “installation of several grid products is broken”. CREAM, WMS, L&B, UI, WN cannot be installed at the moment cause the classads package ( dependency for all of them ) was declared orphan in EPEL. and retired from the EPEL repository. The missing package has been included in UMD and EMI Third party repo. CESNET is going to take care of it with Steve Traylen as co-maintainer so soon will be re-integrated in EPEL.

  • T0 and T1 services
    • CERN
      • FTS upgraded to v 3.2.28
    • BNL
      • FTS upgraded to v 3.2.28
    • KIT
      • dCache upgraded to v 2.6.35

Oracle Deployment

  • Marcin reports that the DB group is still working on the schedule for the announced DB migrations.

Tier 0 News

  • ACTION on WLCG Operations to review voms-admin feedback.

  • lxplus5 was smoothly switched off this Monday

  • Job efficiency meeting last Friday:
    • agenda and minutes available here: https://indico.cern.ch/event/346270/
    • egroup created for discussion, job-efficiency@cernNOSPAMPLEASE.ch
    • Nothing conclusive; in plots from PES with total usage/total wallclock time, we see global 80% efficiency; the diff experiments are submitting variety of jobs, some of them with very low efficiency, which require further investigation;

  • Christoph reports that the job efficiency meeting agenda is not accessible; Maite to check offline. This is already fixed, it should be accessible now

  • FTS3: Discussing with FTS3 developers to promote a high availability FTS3 deployment choice for all experiments, similar to what Atlas has done

  • Nicolò comments that LHCb is most affected due to single server deployment choice. CMS is planning to use high availability FTS3 deployment but currently switch requires manual action.
  • Alessandro comments that also for ATLAS the switch is complicated; need more operational experience on when to perform it.

  • The quattor end of service deadline has been extended 1 month for specific machines owned by the experiments, being set at the end of November. Please, contact us through tickets to let us know which machines should be preserved. Otherwise, same than for IT services (with the exception of DBs), the removal process will start the 3rd of November. Support for SLS is extended until the end of the year but has no implication for the quattor phaseout.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • CERN: major reduction of job failures thanks to improved parameters of the batch system queue and of the pilot code
  • NDGF: files accidentally cleaned up from tape are being replicated from CERN
  • major operational instabilities on Oct 15 throughout the day:
    • jobs at CERN often could not reach central AliEn services
    • other sites saw machines frozen due to CVMFS late afternoon (CEST)
    • all recovered by itself, after a wild goose chase by experts...
    • no explanation was present on the IT SSB

  • Maite is not aware of issues on the 15th; only of batch problems on the 16th. ALICE can send details if further investigation is needed

ATLAS

  • Commissioning activity: Rucio, Prodsys2, and Rucio-ProdSys2 integration. Full Chain test considered successful, run for over one week at 1M files/day. ProdSys2: many workflows integrated and tested, now testing also reprocessing. Rucio-PRodSys2 integration ongoing, bulk of jobs have been successfully run. tasks have been injected. working on various types of different tasks
  • reconstruction campaign (500M events) discussed also 2 weeks ago is successfully finished, now waiting soon for the derivation production framework to start.
  • ATLAS hit by 2 FTS3 isseus, one due to config one still under investigation, discussing with FTS experts how to deal immediately with the issue. Experts from BNL are participating into a discussion to define procedures in case of FTS issues: when ready, the ideas will be of course shared with the other experiments.

CMS

  • Processing overview:
    • Not much work in the system
    • Preparations for new campaign (PHYS14) ongoing
  • Tier-0
    • PromptRECO will need Tier-1 sites in addition to CERN (Meyrin+Wigner)
    • Functional testing: CERN resources + fraction of the Tier-1 sites
  • Service Migrations
    • SLS
      • CMS is quite depending on SLS service
      • Migration has started
      • Will not be able to finish by end of October
    • AFS-UI
      • Interested in access statistics after closing of lxplus5
      • Services are being looked at
      • Will not manage to finish all by end of October - need some extention
    • Openstack
      • Many services need rather big VMs
      • Have some difficulties to get such big VMs provisioned
  • Problems with Dashboard reporting are fixed
  • Recent CMSSW releases have issues with dCache access using xrootd protocol
    • Updated dCache releases: 2.6.35, 2.7.20, 2.8.15, 2.9.11, 2.10.7
    • Sites are encouraged to update
  • Operational items
    • FTS3 trouble during the weekend Oct 4th/5th
    • Frequent overload of CERN Castor SRM endpoint
      • CMS needs to copy several million files to EOS for Tier-0 testing
      • In contact with FTS3 experts how to tune

  • Maite agrees to re-run the statistics for the AFS-UI and says that extension is possible.
  • On SLS, Alessandro ask to check if Helge mentioned an extension at the latest AI meeting. Maite will update the minutes.
  • Maite will also update minutes with Quattor extension for experiments.
  • Alessandro reports that many features used by ATLAS operations are missing in new SLS (structure of services in particular). Feedback provided to monitoring team, but it would be more effective to provide joint experiment requirements. ACTION on Alessandro to suggest to monitoring team to organize a common meeting with all experiments.

LHCb

apologies from LHCb, as nobody from the experiment nor liaison is able to connect today

  • LHCb proposes to re-visit the list of critical services before Run2. Some services are new (e.g. DB on demand) some others are not used anymore.
  • Change of plans for the upcoming stripping campaign. Ie. to pre-stage as much data as possible onto the "LHCb-BUFFER" disk-only storage element (i.e. LHCb-Disk ST). Currently working with the FTS team to commission this workflow.
    • The actual start for the stripping campaign is foreseen for beginning of November
  • LHCb is currently setting up a prototype installation for http access and a access federation where 70 % of the storage endpoints are accessible

  • ACTION on Andrea Sciabà to review the critical services.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

Machine/Job Features

  • A discussion about a single possible implementation for cloud infrastructures has taken place (two prototypes existed). A conclusion on a single architecture was taken. This will be now shaped into a more easily deployable and documented package and then proposed at WLCG level.

Middleware Readiness WG

Full minutes of the last meeting are HERE. Summary:

  • The experience developed from the Readiness verification of DPM, the CREAM CE and BDII at different Volunteer sites was well documented and paved the path to now test the next DPM version and more products from our shortlist starting with dCache, Storm, xrootd and FTS3. This will entail the involvement of more Volunteer sites and the completion of relevant experiment workflows' documentation.
  • A database (DBoD for now) was designed by the MW Officer Andrea M. and the Package Reporter developer Lionel to store the verification results.
  • ATLAS asked the MW Readiness WG to be involved in the HTCondor testing for various CE types. CMS take care of such testing themselves.
  • LHCb will test the VOMS client on behalf of the MW Readiness WG.
  • The Tier0 will participate in the MW Readiness effort by testing EOS and FTS3.
  • The developer of the MW Package Reporter presented its design, the number of hosts and sites it now runs, the alternatives being examined for interoperability with Pakiti. Pakiti expert Daniel Kouril was also connected to the meeting.
  • The next meeting will take place on November 19th at 4pm CET,

Right now the MW Officer and Volunteer Site managers work intensively on the Readiness verification of dCache, StoRM and CREAM CE. Details on experiment workflow and specific sites involved in our JIRA tracker.

  • Cristina reports that there was no communication to EGI on the new dCache baseline version 2.6.35, which was released on the 10th. UMD is doing verification of 2.6.31 which no longer has value. Maria Dimou agrees that effort between UMD staged rollout and MW readiness should be coordinated to avoid duplication of effort; to be followed up offline with MW officer (currently away).

Multicore Deployment

  • CMS activities:
    • Testing submission of multithreaded jobs started. Jobs running at PIC in multicore pilots along with single core jobs
    • Working on improving multicore pilots performance monitoring
    • Started testing submission of multicore pilots to CMS T2s (running regularly at T1s)
  • CHEP15 contribution: plans to submit an abstract to CHEP15 concerning TF activities. Draft currently being prepared by coordinators, to be circulated to TF members.

SHA-2 Migration TF

  • introduction of the new VOMS servers
    • Ops and ALICE have been opened to the world on Monday
      • ALICE site admins have been asked to adjust their VOBOX configurations to make use of the new servers
    • LHCb date to be agreed
    • CMS have been reminded by OSG-WLCG interop officer (see below)
    • ATLAS made significant progress
    • OSG will do a release on Nov 11 with the new servers added to the vomses files for ATLAS and CMS (GGUS:109265)
      • voms-proxy-init may then try the new servers first
      • it would be best if by that time the firewall is open also for those VOs!

IPv6 Validation and Deployment TF

IPv6 was the topic of two talks at HEPix, one by Dave Kelsey and one by Rob Quick (agenda). Some highlights:
  • All Tier-1 sites should to join the HEPiX IPv6 working group
  • The data transfer testbed should be moved to PhEDEx+FTS3
  • A dedicated instance of SAM running on dual-stack nodes should be put in place
  • Deploy perfSONAR in dual-stack during 2015
  • Prepare a plan to deploy dual-stack SEs, taking into account the needs of the experiments, the readiness of the middleware and the constraints from the sites
  • Concerning OSG:
    • The network infrastructure is IPv6-ready and tested
    • work on adapting the production deployment procedures to have dual-stack services is ongoing
    • tests are conducted on VMs not affecting any crucial service
    • Software testing scenarios are defined (server and client in dual-stack, server in dual-stack and client IPv4-only). IPv6-only clients will be tested later on.

Other news:

  • Re-added Imperial and added Wisconsin to the data transfer testbed mesh

  • Rob Quick emphasises the need to collaborate closely between HEPiX, WLCG, OSG IPv6 working groups to avoid duplication of effort.
  • Rob clarifies that OSG IPv6 readiness means that the OSG operations infrastructure in IU is ready - the OSG site resources are not all ready yet.

  • Marian reminds that details of tests with SAM and perfSONAR needs to be agreed: dual-stack, IPv6-only or also IPv4?
    • For perfSONAR discussions in progress. Instructions for dual-stack will be included in 3.4 documentation if ready.
    • For SAM, Maarten recommends an IPv6-only submission instance.
    • Marian to organize a meeting to clarify.

Squid Monitoring and HTTP Proxy Discovery TFs

  • The only change is that we pushed back the task update dates. We are now expecting the Squid registration campaign in November.

Network and Transfer Metrics WG

  • Update on WG presented at GDB last week (Details at agenda)
  • perfSONAR 3.4 released 7th of October, we recommend ALL sites to wait with upgrade until the re-install instructions are broadcasted via WLCG and EGI
  • Performed internal security audit in collaboration with perfSONAR developers - summary to be provided in the re-install instructions
  • Metrics area meeting was canceled, doodle for the new one will be sent shortly
  • POODLE: SSLv3.0 vulnerability (CVE-2014-3566) announced yesterday - https://access.redhat.com/articles/1232123 - affecting perfSONARs as well. Patches from distributions not available yet (16th Oct) - perfSONAR team provided their own fixes yesterday (perl-perfSONAR_PS-Toolkit-3.4-29.pSPS and perl-perfSONAR_PS-Toolkit-SystemEnvironment-3.4-29.pSPS). We recommend all sites running 3.3 to temporarily disable SSL3. We recommend ALL sites to wait with upgrade to 3.4 until the re-install instructions are broadcasted via WLCG and EGI.
  • perfSONAR operations meeting this Friday (Oct 3 at 3PM), minutes at https://indico.cern.ch/event/342995/
    • Highlights: Agreed to introduce several major changes in operations (introduce GGUS SU, security mailing list, setup infrastructure monitoring, introduce automated mesh configurations)
    • Next operations meeting will be held next week, please vote at http://doodle.com/qydib32fkv48er2r

Action list

  1. ONGOING on the experiments: check the usage statistics of the AFS UI, report on the use cases at the next meeting.
  2. CLOSED on Andrea S.: to understand with EGI if it is possible to bypass the validation of VO card changes in the case of the LHC VOs. Peter Solagna asked the developers to implement the change, Andrea will check status for next meeting.
    • Andrea reports that the implementation in EGI ops is not trivial, but EGI ops agreed to approve manually all requests to update LHC VO cards immediately - action CLOSED.
  3. ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Report about showstoppers. Status: the SAM team made a proposal on the steps to taken to enable SAM. ATLAS is following up to make sure that the new CEs are correctly visible in AGIS, while for the CMS VO feed they will be taken directly from OIM. The plan is at first to test HTCondor-CEs in preproduction and later switch to production. It is not foreseen to monitor at the same time GT5 and HTCondor endpoints on the same host.
    • No showstopper for SAM. Need to discover topology; publishing queues in BDII not necessary for SAM probes since Condor can choose the queue based on the proxy.
  4. NEW on WLCG Operations - report on voms-admin test feedback
  5. NEW on Andrea Sciaba - review the critical services table
  6. NEW on MW officer - bridge communication between MW Readiness WG and EGI UMD Staged Rollout
  7. NEW on Alessandro Di Girolamo - suggest to monitoring team to organize common meeting with experiments to collect requirements for new SLS dashboards.
  8. NEW on Marian Babik - organize meeting to define details of IPv6 testing activities with SAM and perfSONAR

AOB

  • Next meeting on November 6th

  • INC:0657309 ticket opened for Vidyo issues in CERN meeting room.

-- NicoloMagini - 03 Oct 2014

Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2014-10-20 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback