Week of 140331

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: AndreaS, Felix/ASGC, MariaA, Massimo, Ignacio, Maarten, Zbyszek
  • remote: Onno/NL-T1, Alexei/ATLAS, Lisa/FNAL, Michael/BNL, Lucia/CNAF, Roger/NDGF, Joel/LHCb, Tiju/RAL, Tommaso/CMS, Rob/OSG, Rolf/IN2P3-CC

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • Panda Monitoring voatlas139 in trouble, node removed from LB
      • VOBOX ss-sara-ru is not stable -- to be understood
      • glitch in INFN-T1 and VOBOX ss-staging: all IT cloud transfers were failing for 10 minutes according to the dashboard and were successful after that. VOBOX was failing at that time.

  • CMS reports (raw view) -
    • Production activities
      • HeavyIon rereco plus standard MC activities are rolling
      • Starting from this afternoon, the HI rereco will run also at the HLT. Needed samples transferred over the weekend.
    • Tickets on CERN:
      • INC:522849 : problems in reading from t0streamer. Seems SOLVED (waiting for a few days for statistics).
      • (related to ) INC:525539: Machines no longer on critical power. Some machines are going to be declared w/o critical power starting on May 1st. This could be problematic for CMS, since services would need to moved elsewhere if no automatic tool id present. Discussion IT / CMS going on.
    • Tickets on T1s
      • GGUS:101731 : Low transfer quality from T1_US_FNAL_Buffer to T2_CH_CERN. SOLVED
      • GGUS:101785 : Transfer quality lower than 50% in T1_DE_KIT . SOLVED
    • Other tickets (newer on top)
      • GGUS:102334 T2_CH_CERN pilot issues -> still open; it seems it was an ARGUS overload. Last action 6 days old, apparently waiting for reply from CERN side (???) [Ignacio will follow this up. Maarten explains that the immediate problem was solved on Thursday but the matter was not fully clarified]

  • ALICE -
    • CERN: job failures due to SLC5 WN without glibc-devel x86_64 (GGUS:102823) [Maarten adds that it's not yet clear if only a few jobs are affected because they belong to unusual workflows or because there are a few bad WNs. Ignacio will contact the batch people]

  • LHCb reports (raw view) -
    • Stripping, MCsimulation and User jobs.
    • T0: NTR.
    • T1: Pledge at GRIDKA will be provided today (20TB for LHCB-USER)
      • SARA : problem with the database behind their tape server.

Sites / Services round table:

  • ASGC: the tape library is not working. The vendor has been contacted but there are no news so far.
  • BNL: ntr
  • CNAF: a scheduled downtime tomorrow morning affecting CMS to update FTS-2
  • FNAL: ntr
  • IN2P3-CC: in the next days, some of our CREAM CEs will be upgraded to EMI-3, as already announced. There should be no visible impact
  • NDGF: a scheduled downtime tomorrow affecting ALICE and ATLAS
  • NL-T1: there has been a MSS issue from last midnight; it was fixed around 1300 CEST and one hour more will be needed to recover the backlog
  • RAL: ntr
  • OSG: ntr
  • CERN batch and grid services:
    • CERN CAProxy OTG8750. One week today on April 7th the ca-proxy.cern.ch squid service will migrate to a service instance. The migration should be transparent to all. The proxy is used inside CERN only for
      • All CRL downloads of IGTF certificates within CERN.
      • All CvmFS downloads from clients within CERN.
    • CERN WMS As announced in the last months, the WMS service at CERN is being decommissioned; we plan to shut it down by the end of April. (Note the exception of the WMS instances used by SAM, which will be kept running till the validation of the new SAM probes with direct CREAM and CondorG submission, on schedule for June.) The WMS instances for experiments will start getting drained next Tuesday 1st of April. Please, open a GGUS/SNOW ticket to the CERN site for any issue related to this activity.
    • CVMFS clients upgraded to 2.1 on LXPLUS5 and LXBATCH, will do a rolled reboot on LXPLUS5
  • CERN storage services: ntr
  • Databases: tomorrow there is a scheduled downtime of the ATLAS offline and ADCR databases to upgrade to 11.2.4 and move to new, Puppet-managed hardware. It will last from 10 am to 2 pm.

  • GGUS: (MariaD) Last week's email notification problems for the test ALARMs to CERN are gone away. Details in Savannah:142611#comment8 [Maarten adds that GGUS repeated the test for the ALICE VO today (GGUS:102858) and the operator hadn't yet acknowledged the ticket after more than half an hour (opposed to the usual ~15'). When this is fully debugged, for completeness the test may be repeated also for the other VOs.]

AOB:

Thursday

Attendance:

  • local: Andrea Sciabà, Joel Closier/LHCb, Massimo Lamanna, Maarten Litmaath, Ignacio Reguero, Maria Alandes, Pablo Saiz
  • remote: Daniele Bonacorsi/CMS, Sang-Un Ahn/KISTI, Gareth Smith/RAL, Roger Oscarsson/NDGF, Pepe Flix/PIC, Pavel/ATLAS, Rob Quick/OSG, Rolf Rumler/IN2P3-CC, Michael Ernst/BNL, Sonia Taneja/CNAF

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • Scheduled ADCR DB intervention on Tuesday (4 hours) all services affected. Downtime on GOCDB published longer, it has been shortened only after explicit request from ATLAS (ATLAS automatic tools blacklist resources using services in downtime like LFC). [Joel comments that it is important that downtimes are closed as soon as the intervention is finished]
      • On Tuesday atlas-SS07, then on Wednesday again atlas-SS04 and atlas-SS09 were degraded, today again atlas-SS09 affected, to be understood
      • Rucio moved from instance 4 to 3 of the ADCR DB. Short glitch reported. The benefit of this change is for the LFC to Rucio migration as it will read from the already cached in memory LFC data (well visible yesterday as we got performance boost)
    • T0/1s
      • on Tuesday FTS3 CERN not accepting submission alarm GGUS:102881 , solved in 26minutes.
      • on Wednesday networking issue at RRC-KI-T1 GGUS:102942 caused transfer problems, solved within a two hours
      • CERN prod CVMFS cache issue GGUS:102824 affecting many jobs, understood, cache size changed after upgrade to 2.1

  • CMS reports (raw view) -
    • Production activities
      • Heavy Ion rereco pass launched
        • currently also using HLT resources at CERN, working to expand it further
        • the jobs are memory consuming (3-4 GB) and CPU demanding (up to ~96 hrs), playing with job splitting to mitigate
    • No major troubles at the T1 level
      • main troubleshooting load is at the T2 level

  • ALICE -
    • CERN
      • SLC6 job submissions had to be stopped Mon evening because the SLC6 CEs kept publishing 0 jobs for all VOs, which would have led the batch system to get overloaded with waiting jobs (GGUS:102867)
    • KIT
      • around 14:00 CEST on Apr 1 the number of running jobs in MonALISA started going down from 5k+ to ~1k for unknown reasons; since then the numbers have fluctuated wildly around that low level, while the batch system typically is seeing ~3k jobs running at any time; no changes were made by ALICE and other sites are working OK; experts are investigating

  • LHCb reports (raw view) -
    • Stripping, MCsimulation and User jobs.
    • T0: one of our vobox had a kind of DoS from a Russian site yesterday . After we blocked the IP range, the VOBOX was usable.
    • T1: Pledge at GRIDKA will be provided soon, Thanks to PIC and RAL which provided new pledge.
Joel added that they opened a GGUS ticket to KIT to report some observed network problems but the site answered that no evidence of it was visible in their logs. Everythings seems to be fine now but the problem is not understood.

Sites / Services round table:

  • ASGC: Our vendor eventually identified the problem, which is actually a glitch at tape robot, it causes entire tape system to be very unstable, now, our vendor is working hard to get it repaired.
  • BNL: ntr
  • CNAF: ntr
  • IN2P3-CC: ntr
  • KISTI: ntr
  • NDGF: yesterday's downtime for maintenance and upgrade at the Norway site was OK but there was a problem with the dCache SRM update and it was necessary to roll back to the previous version. Due to network problems at the Slovenian site, the storage is now in read-only mode and even if it's accessible, it's slow. A downtime for power maintenance is scheduled at Bergen for April 5, some ALICE and ATLAS data might become temporarily unavailable.
  • NL-T1: investigating a problem with slow transfers from NIKHEF SE to BNL or TRIUMF SE together with our network experts (GGUS:102716)
  • PIC: ntr
  • OSG: tried to configure the new CERN VOMS servers for OSG but wondered if they actually are already available? Maarten answers that they are able to generate grid-mapfiles but the proxy generation is not yet enabled. It will be when the WLCG infrastructure will be sufficiently ready, and then it will be tested using SAM. Then, an announcement will be broadcast and sites showing issues will have the time to fix them. After that, the firewall rules blocking the proxy generation from the outside will be removed and another broadcast will be sent to sites to tell them to accordingly reconfigure their UI-based services so they use only the new VOMS servers. For OSG, a test will be arranged after next week.
  • CERN batch and grid services: ntr
  • CERN storage services: ntr
  • Dashboards: ntr
  • GGUS:
    • The alarms sent on monday did reach the operators. The whole workflow works as expected
    • The 'GGUS Shopping list' tracker was migrated from savannah to jira on 1/4/14. The 'Planned Release' (aka 'Fix Version' in jira) was not migrated properly. Still investigating how to fix it
AOB:

-- SimoneCampana - 20 Feb 2014

Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2014-04-04 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback