WLCG Operations Coordination Minutes, September 1st 2016

Highlights

  • The new accounting portal with changes performed for the WLCG view is ready for validation: https://accounting-next.egi.eu/. People are encouraged to start using it and provide their feedback or report eventual problems to wlcg-accounting-portal@cernNOSPAMPLEASE.ch.
  • Regular Monitoring Consolidation and Traceability WG reports available in the Operations Coordination meetings from now on.
  • CC7 plans from experiments presented at today's meeting and summarised in the minutes twiki. Sites plans to be collected by WLCG Operations.

Agenda

Attendance

  • local: Maria Alandes (minutes), Alberto Aimar, Stefan Roiser, Luca Tomassetti, Vincent Brillault, Thomas Oulevey, Maarten Litmaath, Julia Andreeva, Andrea Sciaba, Helgue Meinhard, David Cameron, Andrea Manzi, Maria Dimou, Marian Babik
  • remote: Pepe Flix (chair), Nurcan Ozturk, Ult Tigerstedt, Felix Lee, Di Qing, Javier Sanchez, Hiro, Christoph Wissing, Olivier Devroede, Antonio Maria Perez Calero Yzquierdo, Kyle Gross, Massimo Sgaravato, Shkelzen Rugovac, Victor Zhiltsov, Dave Mason, A. Cavalli, Hironori Ito, Dave Dykstra, Ron Trompert, Vincenzo Spinoso, Eygene Ryabinkin, Tim Bell, Andrea Valassi

Operations News

  • Hope you had a nice holidays!!! Welcome back!
  • Meetings:
  • Alberto Aimar (Monitoring effort leader) joins from now on the Mondays' 3pm Ops call and the monthly Ops Coord - experiments are encouraged to prepare issues for discussion.
  • From now on, we will have regular Traceability WG reports at this meeting
  • The new EGI accounting portal has changes performed for the WLCG view and it is ready for validation: https://accounting-next.egi.eu/. Please, check!

Middleware News

  • Baselines/News:
  • Issues:
    • GGUS:123613 reports crashes on openldap with ARC-CE on C7 ( similar problem was affecting also SL6 and has been fixed). The supposed fixed rpms are available on RHEL 7.3 beta 1, but a first test showed that the issues is still there. We have contacted ARC Devs to do more testing.
    • GGUS:120586 concerning an issue with glite-ce* clients with dual stack CEs. No news yet
  • T0 and T1 services
    • CERN
      • Upgrade to xrootd 4.4 all CASTOR instances
      • ARGUS upgraded to 1.7.0 ( C7)
      • FTS upgraded to v 3.4.7
    • IN2P3
      • Minor dCache upgrade planned (2.13.32->2.13.41)
    • JINR-T1
      • Minor updates: dCache 2.10.59->2.10.61, Postgres 9.4.8->9.4.9
      • dCache 2.13.x in later November
    • NDGF
      • 2.16.10 last week and moved to HA messaging
      • Switch to a fully HA setup for database and SRM within September
    • RAL
      • FTS upgraded to v 3.4.7

Tier 0 News

Highlights

  • Network traffic out of CERN peaked at 253Gbps in July, mostly driven by increasing volumes of LHC data being exported. To cope with the increased export rates, links to some Tier-1 centres have had their capacity increased.
  • Traffic volumes have also increased over LHCONE leading to GÉANT to deploy additional capacity. An estimate of likely traffic in 2017 will be required.
  • Due to record LHC performance, in July and August, about 10 PB and about 6 PB were recorded into Castor, respectively.
  • A couple of CERN-Wigner link instabilities were impacting EOS performance (a reconfiguration recipe is now in place to cope with these situations). Other transfer issues previously reported (mainly affecting ALICE and CMS) have been solved.
  • Minor upgrades (maintenance) have been completed on CASTOR (v. 2.1.16-6) and are still continuing on EOS (moving all instances to 0.3.193) in agreement with the experiments. The hardware retirement for Q3 is complete. Smaller retirements are expected in Q4 or beginning of 2017.
  • Obsolete WLCG-related Oracle accounts (for example SAM and GridView) are being deleted. Another major clean-up will take place next year after the migration of experiment dashboards to Hadoop.
  • Work is going on with CMS on using Spark on the the CERN Hadoop services for CMS' use cases of Big Data Analytics and Deep Learning. There is a series of tutorials on Hadoop and related technologies to help users get started for their data processing use cases.
  • The Oracle security upgrade of July was not deployed, as it contains nothing relevant for the CERN installation.
  • The LSF instance for public batch has reached its supported limit of 5'000 nodes. Further capacity increases are implemented in the HTCondor pool.
  • The main HTCondor pool has reached 21'000 cores; users are starting to test local submissions. A significant number of worker nodes are being regularly defragmented in order to support multi-core as well as single-core grid submissions.
  • The external cloud resources (some 4'000 cores) are now processing jobs from all LHC experiments.
  • ARGUS has been upgraded to the 1.7 release on CentOS 7, which supposedly fixes all known bugs. Since the upgrade, the service has run with good stability; it hence appears as if the previous issues of glexec probes for especially CMS failing was now resolved.
  • New resources have been added to the public batch pools and the ATLAS Tier-0 instance.

Issues

  • There have been performance problems in the CMS offline database (CMSR) caused by misbehaving applications. Work is going on with CMS to improve the applications concerned.

Highlights (for information only, not strictly Tier-0 related)

Tier 1 Feedback

Pepe invites Ulf to summarise the information that Ulf sent to the Ops Coord co-chairs on a mail thread before the meeting where he described NDGF plans to move to CC7. Ulf summarises that indeed some NDGF sites are going to deploy new HW early 2017 and it woud be good to understand experiment plans to know whether CC7 can be used or not.

Tier 2 Feedback

  • T2_BE_IIHE: network issues with Russian T1/T2
    • Core problem: Russian sites not accessible via Géant (hence hitting restriction on our commercial traffic)
    • Main identified impacting use case: Phedex dataset transfers coming from T1_RU_JINR
    • Investigated solution: LHCONE (was not a solution)
    • Is it clearly acknowledged? What other Tier 2 sites do? What is the recommendation?

RU-VRF (Eygene Ryabinkin) answer: we (.ru LHCONE) aren't currently connected to GEANT and I don't see any progress here; regarding this very site (going to the network via Belnet), we can try to peer with Belnet's LHCONE space directly, we're present at AMS-IX where Belnet is also present, so that's solvable technically. We had already offered this option to Belnet, but to no avail. We have direct peering with Belnet itself (in Amsterdam), but, probably, Belnet wants some money from T2_BE_IIHE for this kind of connectivity.

It is agreed that the affected Belgian sites get in touch with both Belnet and Edoardo to understand how traffic from Russia could stop coming through the commercial route and use LHCONE links instead. For these cases there's no straightforward solution and every case needs to be investigated independently since each country has its own connectivity.

LHCb mentions that they experienced some problems last week to download data from Russia and this could very well explain it.

Experiments Reports

ALICE

  • Typically high to very high activity
    • New record: 110636 concurrent jobs
  • CERN: as of July 28 jobs are also being submitted to the T-Systems external cloud resources
    • up to 3k+ cores have been used in parallel
    • job success rates have been good
    • job types include simulation, reco and user analysis
      • analysis trains were excluded to reduce the load on the 10 Gbps link to CERN
    • use of the DPM has been postponed, pending a fix to allow the disk servers to be monitored in MonALISA
      • big UDP messages are not correctly handled by the T-Systems NAT setup

ATLAS

  • Smooth operations towards ICHEP, end user data delivered on time.
  • Grid is now full with Monte-Carlo production from the new Sherpa samples. There was occasional lack of payload and the grid was not fully utilized in August, but back to normal in the last week.
  • To help with T0 backlog 3 runs were spilled-over to grid and successfully ran. More automation is underway.
  • New T0 workflow called RAWtoALL (skip producing intermediate ESDs) was deployed at T0 (15% gain in CPU), it was also tested on the grid for spill-over, more tests are upcoming to understand/control the high memory usage better.
  • Slots unused by T0 processing are fully used by grid.
  • Waiting for news from CERN-IT on the new monitoring based on the new infrastructure, we would like to have our ADC live pages monitored there and provide feedback.

CMS

General CMS Activities

  • Successful data taking with rather high logging rate (s. also below)
  • High production activity in July to prepare for ICHEP
  • Larger disk cleaning campaign finished
    • We likely need another one to host coming data and 2017 MC
  • Tape cleaning campaign is ongoing
    • Actual deletion requests to be sent to the sites in September
    • Allow sites to re-pack (if needed) in October
  • New MC Campaign started
    • Jobs are now multi-threaded (4 threads)
    • Relies on remote data access to reading premixed pileup data

Various Operational Items

  • Wigner network link
    • Redundancy was limited several times over the last few weeks
    • On July 25th P5 to Tier-0 transfers badly affected
    • No trouble reports in August

  • Data export from CERN
    • Developed a multi-PetaByte backlog of data to be exported
    • Transfer speed out of CERN greatly improved in July
      • Additional GridFTP doors got deployed in the CERN EOS instance
      • Faulty network components got identified and fixed
    • CMS reduced volume that needs to be transferred out of CERN
    • Present performance expected to be sufficient for remaining pp run

  • Monitoring infrastructure
    • Observed various Kibana instabilities: July 15th, Aug 9th, between Aug 24th-29th
    • Access issues for Dashboard metrics

LHCb

  • High activity
    • Data taking continues, transfers from the PIT are ok
    • Reconstruction, Stripping, MC and users jobs running "fine"
      • Reco a little behind waiting for data-transfers but catching up
  • T-Systems in use: download and upload problems from everywhere (~20% of the times) due to network connection. IT people is aware.

  • CERN lbvobox108 inaccessible because of hypervisor died (GGUS:123484). Alarm ticket submitted 18 Aug at 19:17. Fixed on 19/08 with the VM reachable at 11:40 (motherboard replaced).

Ongoing Task Forces and Working Groups

Accounting TF

Dave asks whether there is OSG people involved in the TF and Julia answers that indeed Tanya Levishna is part of the TF and she recently presented the status of OSG accounting in a recent meeting.

Pepe asks how people could give feedback on the new portal. Julia says there is an egroup for this wlcg-accounting-portal@cernSPAMNOT.ch

Information System Evolution


  • At the last MB a proposal to adopt CRIC as the new Information System was approved. A new project associate will join the development team in the next weeks.
  • The next IS TF meeting is scheduled on 22nd of September. Information sources and main functionality of central CRIC will be discussed.

IPv6 Validation and Deployment TF


Machine/Job Features TF

Working through the last step of the mandate: "Plan and execute the deployment of those implementations at all WLCG resources"

MJF deployments on batch passing tests at Cambridge (Torque), GridKa (Grid Engine), Liverpool(HTCondor), Manchester (Torque), PIC (Torque), RAL Tier-1 (HTCondor), RAL PPD (HTCondor). Aware of other sites currently working on deployment (e.g. CERN, CC-IN2P3.) UK intends to deploy at all sites.

All Vac/Vcycle VM-based sites have MJF available as part of the VM management system.

Pepe will get in touch with Andrew to report some outdated links in the TF twiki.

Monitoring Consolidation

Link: http://monit.cern.ch (open to all CERN authenticated users)

  • Will start reporting on the progress of Monitoring at this meeting.

  • Progressing with adding data in the new portal based on Kibana and ElasticSearch (FTS, Xrootd, Job Monitoring, SAM/ETF)
  • Prototyping on plotting and reporting using the data stored in HDFS.

  • Service (not in production) less stable in the last few weeks.
  • Increasing and restructuring the ES (ElasticSearch) infrastructure in order to be more scalable and independent on other ES projects.
  • Participating to the benchmarking of ES resources.

On a question whether sites are also participating in the project, Alberto answers that this is not a WLCG project, so in principle sites are not involved, but of course, useful feedback from sites can be taken into account. Maarten asks how could sys admins report possible issues and Alberto answers that there will be means to do this through GGUS/SNOW.

Pepe also mentions that in any case it would be very good to hear from CERN's experience in this project as it could be useful for other sites. Alberto is happy to give regular updates on the status of this.

MW Readiness WG


LHCb will provide the information about CC7 requested by the TF at today's meeting.

Network and Transfer Metrics WG


  • Network session is planned at the WLCG workshop covering IPv6, LHCOPN/LHCONE status and LHC network evolution
  • LHCOPN/LHCONE workshop will be held in Helsinki, Sept 19-20 (https://indico.cern.ch/event/527372/)
  • pre-GDB on networking focusing on the long-term network evolution postponed to January
  • Throughput meetings were held on July, 27 and August, 16:
    • Mark Feit from Internet2 presented pScheduler (new test scheduler in perfSONAR 4.0)
    • Xinran Wang and Ilija Vukotic from Univ. of Chicago presented their Network Analytics work
  • OSG datastore and collector are experiencing problems since the upgrade last week, the issue is being followed up by OSG
  • Plan on migration to the new perfSONAR 4.0 configuration was drafted and will be followed up with OSG
  • We are now using a new mailing list wlcg-network-throughput-wg@cernNOSPAMPLEASE.ch - joint mailing list for European and NA throughput meetings
  • WLCG Network Throughput Support Unit: New cases were reported and are being followed up, see twiki for details.

On a question on whether the WG should follow up on the issues reported by the Belgian sites, Marian replies that since the issue is more architectural, it's better to follow up with LCHONE.

RFC proxies

  • OSG have made RFC proxies the default in their latest release (3.3.15)
    • they submitted a voms-proxy-init patch to the VOMS developers
  • an "official" new version of the VOMS client rpm should be released soon (GGUS:122269)
  • a new version of YAIM core for myproxy-init on the UI is available
    • temporary location: here
  • the intention is for the UMD to be updated when both rpms are available

Squid Monitoring and HTTP Proxy Discovery TFs

  • http://wlcg-wpad.cern.ch/wpad.dat now returns proxies based on the squids registered in ATLAS AGIS. http://wlcg-squid-monitor.cern.ch/worker-proxies.json contains the input to the service. Quite a number of sites are disabled because of overlapping GeoIP organizations.
  • Inclusion of squids from CMS SITECONF has had the initial development done, and it needs more refinement. It still needs to map CMS site names to those registered in GOCDB/OIM. Also CMS SITECONF allows the client to do load balancing between squids, which the WPAD format does not natively support. The format is a subset of javascript, however, so the plan is to distribute load based on the lowest order bits of the client's IP address.

Traceability and Isolation WG

More details on the WG website

  • Traceability:
    • There was an agreement that we should build upon each existing VO logging infrastructure, which seems to already cover the expected logs. We might seek minor adjustment, but no global rewrite
    • VOs invited to do a self-assessment: identify the cost and bottleneck for identifying a user and the associated payload from a timestamp and an IP address/workernode name.
  • Isolation:
    • There are missing components for full unprivileged containers on RedHat 7.2. OSG and CERN have open tickets/feature request. RedHat has been expressing security concerns...
    • Brian Bockelman has identified an opensource software solution aimed at unprivileged "containers" (with no daemon) and has been working on a building a RHEL6 job environment from CVMFS, to be presented at the next meeting.
  • Next meeting: September 14 at 1600 (after the GDB)

Theme: CentOS7 - Experiment plans and info for sites

Challenges of CentOS7 migration

Thomas presents tha timeline of CC7. Now it's the right time to propose changes as it is the middle of the first production phase and afterwards it will be more difficult to add new features and bug fixes. It's good to discover issues as soon as possible. It has to be taken into account that the release process in Redhat could be very long.

On a question on whether Redhat 8 is already planned, Thomas says there are no dates yet.

Pepe asks when it is planned to add missing MW in CC7 in EGI repos. Maarten replies that hopefully by the end of the year the missing bits will be ready.

Thomas mentions the unprivilege mount namespace features needed by WLCG has been reject by CC7.3. It will be asked again in CC7.4 next year. Maarten explains that this is a very important feature for WLCG and that it is very important to have it available as it will allow to mount SL6 software from CVMFS in CC7 boxes, which is a straightforward solution if experiments still need to run SL6 software and sites need to move to CC7 for new HW.

ALICE

  • the ALICE offline packages in CVMFS are still built on SLC5
    • used on SL6 (at most sites) and a few other OS flavors
  • an official validation has been run on CC7
  • the results were identical to those obtained on SLC6
  • conclusion: ALICE looks ready to run on CentOS/EL7

Tim asks whether there are any plans to stop building in SL5 and it is coming to its end of life. Maarten replies that this will be discussed very soon.

ATLAS

See ATLAS plans in the slides attached in the agenda. There are no plans to move ATLAS central services and WNs until LS2. If sites need to move to CC7 WNs, feedback is very welcome. ATLAS SW built on SL6 is being tested on CC7. Builds in CC7 are available and are being tested in CC7 sites. ATLAS relies on lcg-utils that is not available in CC7. They need to move to gfal. Sites who want to test ATLAS SW in CC7 should contact Alessandra.

CMS

See CMS plans in the slides attached in the agenda. It's not possible to run SL6 binaries on CC7. SW built in CC7 but not tested yet. CMS needs feedback from WLCG on migration plans to CC7 so that CMS understand whether a container solution could be envisaged or not to meed the timeline.

Stefan mentions that the official timeline to move to CC7 announced by WLCG is until the end of LS2. Maria mentions that this needs to be revisited as sites seem to have earlier time contraints now to install CC7 and WLCG would need to acknowledge this and work to cope with the new situation, finding solutions and recipes to be able to run SL6 on CC7 machines. Sites who have the effort to deploy container solutions are OK, but sites who don't have this effort and buy new HW deploying CC7 are a problem.

It is mentioned that there is a new tool called singularity that seems to offer an easy container solution. Brian Bockelman will give more details at the new Traceability WG meeting.

Tim asks whether the CC7 timeline is end or start of LS2? Helgue replies that it is the end of RUN 2.

It is agreed that WLCG could assess the current situation by collecting from sites their plans to move to CC7, to stay with SL6 and their current knowledge and availble effort to deploy container solutions.

LHCb

  • Simulation workflows with slc6 binaries have been executed on CC7 worker nodes successfully
    • No firm confirmation yet for slc5 binaries (very few simulation workflows) but seems possible
  • What are plans for moving the WLCG middleware, VOBOX infrastructure and lxplus to CC7?
    • The lxplus alias should be moved as the last action in the migration procedure to the CC7 cluster
  • What are the future plans for HEP_OSlibs meta-package? LHCb interested in continuing to use it.

Stefan asks how long will SL6 be supported. Time replies that as long as it is needed. Tim adds that for VOboxes sys admins can choose to deploy sl6 or cc7 images. Lxplus needs to be coordinated with experiments, lxplus7 is already available. Stefan mentions that the alias change should only happen at the very end. Maarten adds that HEPOslibs is already available in CC7 in WLCG repos and that EGI wants to offer the comple MW stack in CC7 this autumn/end of year.

Helgue mentions that performance indicators for CC7 show 5% gain performance.

Regarding UI and WN, there is already a testing folder for the CC7 UI and there will be one for the WN too.

Action list

Creation date Description Responsible Status Comments
01.09.2016 Collect plans from sites to move to CC7 WLCG Operations NEW  

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
29.04.2016 Unify HTCondor CE type name in experiments VOfeeds all - Proposal to use HTCONDOR-CE. Waiting only for ALICE and LHCb now.   Ongoing

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion

AOB

-- MariaALANDESPRADILLO - 2016-08-30

Edit | Attach | Watch | Print version | History: r35 | r33 < r32 < r31 < r30 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r31 - 2016-09-05 - MariaALANDESPRADILLO
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback