WLCG Operations Coordination Minutes - March 19th, 2015

Agenda

Attendance

  • local: Maria Dimou (minutes), Maarten Litmaath (ALICE), Marian (Network TF), Andrea Manzi (MW Officer), Stefan Roiser & Marc Slater (LHCb), David Cameron & Alessandro di Girolamo (ATLAS).
  • remote: Alessandra Forti (chairperson), Alessandra Doria (Napoli), Di Qinq (Triumf), Maite Barroso (T0), Thomas Hartmann (KIT), Ulf (NDGF), Alessandro Cavalli (CNAF), Anthony Tiradani (FNAL), Massimo Sgaravatto, David Mason, Catherine Biscarat (french grids).

Operations News

  • A first presentation on the WLCG Sites' survey results was given at the Amsterdam GDB on March 11th. More details at Okinawa (workshop and CHEP).
  • The updated Critical Services for the Tier0 were presented at the WLCG MB on March 17th, presentation given to this meeting a few weeks ago.
  • GGUS monthly release next Wed March 25th.

Middleware News

  • Baselines:
    • Storm 1.11.8 released last week with a fix for critical issue introduced in version 1.11.5.
      • The baseline for WLCG is still 1.11.4 but ~ 11 instances v 1.11.[5,7] are installed ( and some already running 1.11.8)
      • Storm 1.11.8 has been installed in QMUL for MW readiness verification and it seems ok, so we suggest to move to 1.11.8 taken from INFN repo ( no yet in EMI 3)
      • Sites will have also to apply some db cleaning scripts as documented at http://italiangrid.github.io/storm/release-notes/storm-backend-server/1.11.8/
    • latest versions of UI (v 3.14.0-1_sl6v1) and WN (v emi-wn-3.10.0-1_sl6v2) tarballs published to CVMFS ( thanks to Matt Doidge), latest UI still not available via EMI /UMD

  • MW Issues:
    • NDGF just reported an issue when upgrading to dCache 2.12.2 ( affecting also 2.11.13), new releases with the fix are under preparation by dCache
    • openssl release to fix high severity issues announced for today. Not yet clear if and how WLCG is affected by this new release.

  • T0 and T1 services
    • ASGC
      • DPM upgraded to 1.8.9
    • CERN
      • Castor for LHCb Upgraded to v 2.1.15
    • IN2P3
      • dCache upgrade to 2.10.21 (core servers)
      • Postgres 9.3.6 on SRM node
    • NL-T1
      • Upgraded dCache from 2.6.44 to 2.10.20
    • JINR-T1
      • Both dCaches were moved to new powerful hardware
      • dCache upgraded to 2.10.22 and postgres 9.4.1
      • FTS3 upgraded to 3.2.33
    • NDGF
      • dCache upgraded to 2.12.2

Tier 0 News

  • VOMS-admin update: The experiments are still adapting to the new software and they have requested some features to the VOMS developers e.g GGUS (112282). There were also some complains on the high number of emails VOMS-admin is sending about suspended users. We understand the cause and we are working to improve the situation.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • high activity
  • started automatic replication to T1s and reconstruction of cosmics trigger data, same as in normal (beam on) data taking.
  • steady progress with switching WLCG VOBOXes at the sites to RFC proxies
  • CASTOR at CERN:
    • previous releases' behavior of staging through Xrootd has been restored - thanks!
    • instabilities in the last few days, affecting both online and offline transfers
      • root cause was Oracle deciding to change the execution plan of some standard (and already optimal) queries
      • on top of that there were a few badly behaved nodes plus some problems in the monitoring
      • thanks for the prompt support in these matters!

ATLAS

  • Activities
    • Cosmic data taking and data export going smoothly. T0 data transfer stress tests went smoothly
    • MC15 simulation in MCORE, 30Mevents, MC14 MCORE 8M events, and MC12 SCORE 20M events
    • MC15 pileup reconstruction SW still not fully validated. Still few days
  • Network glitch in CERN Computing Centre on 13th March caused problems for many ATLAS services. As this was discussed at the 3pm Ops call, Maria D., as SCOD, contacted the T0 network team for further information. They provided this page describing the 12-13 March incident. It was considered insufficient, given the trouble services experienced and the time it took to recover. Further discussion concluded on the T0 manager (Maite) to discuss on a clear SIR format with the T0 network team.
  • ADCR database was partially down for a few hours on morning of the 18th, causing further glitches
  • Network issues with TRIUMF: some T2/T3 sites are on commercial networks and cannot transfer data to/from TRIUMF. Discussion at the meeting on whether this is a technical or policy issue and where follow-up should be done. Di said that the Triumf's network commercial network bandwidth is only 40Mbps. Compared to their current connections to LHCONE (20Gbps), LHCOPN (5Gbps, will soon upgrade to 10Gbps) and Academic network (maximum 10Gbps, however share with other connections), it's almost nothing. Maarten recommends to use sites with limited network capacity for MC running only and not a lot of transfer activity, basically because this issue will get worse. Alessandra suggested to look at using the academic network instead of the limited commercial one. Alessandro said that the other ATLAS T1s don't have this policy as far as their T2s and T3s is concerned. The issue will be taken offline and Di will check with the Triumf management. Alessandro and David will provide ATLAS numbers to describe the expected transfer requirements. Ulf (NDGF) suggested to use a 100 Megs/s network port in the router to limit traffic.

CMS

  • Ongoing activities
    • Cosmics with main magnet off (CRUZET)
    • Mainly production of Upgrade MC
    • Main MC production campaign for Run2 to start beginning April
    • Moderate load in the system
  • Had some issues with EOS during some stress testing, all under investigation by experts:
  • Had a bad surprise with VOMS
    • After migration a VO admin accidentally requested a re-sign of the AUP
    • Many users (O(1k)) got suspended this week because they missed to re-sign the AUP
    • Created some issues, when all people at the time attempted to re-sign
    • GGUS:112425
    • Let to some suggestions to improve the service
  • Larger campaign for distributed PromptRECO (needs to run at Tier-0 and Tier-1)
    • Reached almost 50% of the T1 resources with multi-core PromptRECO (which is the target)
  • Dave Dykstra agreed to join the HTTP TF wearing two hats: "CMS" and "Squid"

LHCb

  • Operations
    • Recent issue with overloading dCache SRMs with gfal_getturlfromsurl requests: Found to be due to a bug fix in the srm_ifce library which is used by gfal 1. Temporary fix of rolling back to the previous version is in place. Will soon be constructing TURLs using string manipulation instead. Many, many thanks for the help in debugging this from both the sites and the dCache developers!
    • "Run 1 legacy stripping" - all productions have been verified and closed.
  • VOMRS migration: Two issues still being investigated GGUS:112279, GGUS:112281
  • perfSonar extraction - first prototype installation is available which extracts information from the esmond perfsonar store and publishes into the message bus. Some details need to be sorted out about eventual empty messages.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • gLExec in PanDA:
    • testing campaign is covering 61 sites out of 94 (1 mistakenly included site was removed)
    • new versions of the pilot with more debugging handles were released on Tue and today

RFC proxies

  • ALICE: steady progress with switching WLCG VOBOXes at the sites to RFC proxies

Machine/Job Features

  • started working with UK on deployment of mjf on batch system infrastructures, first trial with HTC implementations

Middleware Readiness WG

  • The 9th WG meeting took place yesterday. The Summary is available HERE.
  • Please book your calendars for the next meeting on May 6th at 4pm CEST at CERN and with vidyo!.

Multicore Deployment

Passing parameters to the batch system: last update about this was in November. Since then Atlas added extra memory parameters to pass to the batch systems according to the plan presented at the last S&C week. While parameters were already passed to ARC-CE sites, they were with a different memory scheme. The new scheme is now being tested at 3 ARC-CE/Htcondor sites in the UK on their multicore queues. For Cream sites which were the most controversial and debated it is in test at Nikhef and Manchester which both have a cream/torque combination. This has helped ironing out the most macroscopic problems such as units used. In Manchester the new scheme has been now enabled on all the queues on 1 of the clusters, i.e. analysis jobs as well as production multicore and single core and will see in the next few days how it goes. If no adverse effect is observed the plan is to extend to more queues on the sites in test already and contact other sites to test other batch systems/CE combinations.

ATLAS only 13 sites are missing a multicore queue: 3 are too small and 10 have a plan for the next few weeks months.

IPv6 Validation and Deployment TF

Squid Monitoring and HTTP Proxy Discovery TFs

  • Nothing to report

Network and Transfer Metrics WG

  • WG meeting was held on 18th of March (https://indico.cern.ch/event/379017/)
  • perfSONAR status
    • All sites should be running 3.4.1, final deadline was 16th of February, 5 sites received tickets (3 of them responded)
    • Testing/evaluation of the 3.4.2rc candidate ongoing, additional issues were identified and fixed by the ESNet developers team.
    • Plan is to follow up the testbed for next couple of days, if there are no issues reported, 3.4.2rc will get a green light (once released, this should propagate to all sites within 24 hours)
  • Datastore (esmond) status
    • Esmond testing is ongoing, gathering 100% of the meshes (some with missing data due to issues in 3.4.1)
  • Network performance incidents follow up
    • Procedure was proposed and is still under discussion within the WG.
  • Integration projects
    • Revised proposal for the experimentís interface to perfSONAR, esmond2mq prototype was developed and tested, feedback will be reported to OSG and ESNet.
  • Next meeting: 8th of April (https://indico.cern.ch/event/382622/)

HTTP Deployment TF

The "HTTP Deployment TF" and put a very short status : https://twiki.cern.ch/twiki/bin/view/LCG/HTTPDeployment The news since my last email is that the T0 and dCache will join. Still waiting for a couple of members' names before scheduling the first meeting.

Action list

  • ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Status: HT-Condor CE tests enabled in production on SAM CMS; sites publishing sam_uri in OIM will be tested via HTCondor (all others via GRAM). Number of CMS sites publishing HTCondor-CE is increasing. For ATLAS, Alessandro announced that with the USATLAS experts a solution has been found to publish the HTCondor CEs in the BDII and OIM in a way that satisfies both ATLAS and SAM needs. It's not a long term solution, but it should be good for the next six months. Before closing the action though we need a confirmation by Marian. Marian was at the meeting today and agrees this action can be marked as [CLOSED].
    • Ongoing discussions on publication in AGIS for ATLAS.

AOB

  • The next meeting is on April 2nd.

-- AndreaSciaba - 2015-03-16

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback