Week of 120116

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local (AndreaV, Stefan, Cedric, Maarten, Massimo, Jan, Eddie, Alessandro); remote (Michael/BNL, Rolf/IN2P3, Onno/NLT1, Ulf/NDGF, Gareth/RAL, Paolo/CNAF, Gonzalo/PIC, JhenWei/ASGC; Markus/CMS).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • Problem with EOS GGUS:78247 [Cedric, issue still there, any comments? Jan: problem is with localbatch at CERN using SRM causing an overload. As agreed with ATLAS, changed to day to an experimental setup with 8 Bestman servers per box, should soon see an improvement. Maarten: why "experimental"? Jan: this is not in the Bestman design to have several servers per box, but fairly confident this should work thanks to special configuration of FTS and firewall.]
      • DB load on ADCR on Friday afternoon.
      • [Gareth: is the DB downtime for tomorrow confirmed? Cedric: yes, will check that this was forwarded to the T1 contacts. Alessandro: DB intervention is at 10am, will only take cloud offline at 7am, without draining queues earlier. Jobs still running when intervention starts will fail, but at least can run jobs until the very end, this is a compromise. Gareth: if drained earlier, could reassign CPU to other VOs. Maarten: this is ATLAS quota anyway, ATLAS can decide how to use this.]
    • T1s
      • RAL reported that a disk server part of the AtlasTape space token, is unavailable.
    • T2s

  • LHCb reports -
    • Experiment activities
      • MC11: Monte Carlo productions [Maarten: should this be MC12? Stefan: no this is actually MC11a, the 2012 production]
    • T0
      • SAM jobs failing when accessing CEs (GGUS: 78185), fixed by restarting Gatekeeper
    • T1: ntr
    • T2

Sites / Services round table:

  • Michael/BNL: reminder, will take SE down to configure Postgres backups tomorrow, at the same time as the ATLAS DB intervention. [Alessandro: is this in GOCDB or OIM? Michael: will add a note in the ATLAS elog].
  • Rolf/IN2P3: ntr
  • Onno/NLT1: ntr
  • Ulf/NDGF: ntr
  • Gareth/RAL:
    • ATLAS mentioned problems with one diskserver, a second diskerver has been found to give issues today
    • interventions on SRM this week, declared warnings in GOCDB
    • network problem earlier today with transfers to T2 (not the OPN to T1), still under investigation
  • Paolo/CNAF: ntr
  • Gonzalo/PIC: ntr
  • JhenWei/ASGC: ntr
  • Pavel/KIT: ntr

  • Jan/Storage:
    • Monday/today:
      • CMS and LHCB update to 2.1.11-9 (transparent)
      • ATLAS STAGER+SRM DB update (unified HW)
      • SRM-EOSATLAS change to use 8x parallel BestMan (to address overload issues via CERN batch nodes)
    • Tuesday:
      • will do EOSATLAS update to 0.1.1-7 tomorrow (rescheduled from today)
      • CMS stager+srm db update (downtime)
  • Eddie/Dashboard: ntr

AOB: none

Tuesday

Attendance: local (AndreaV, Jan, Eddie, Alessandro, Maarten, MariaDZ, Eva); remote (Ulf/NDGF, Michael/BNL, Xavier/KIT, Elizabeth/OSG, Ronald/NLT1, Jeremy/GridPP, Gonzalo/PIC, Burt/FNAL, Tiju/RAL, JhenWei/ASGC; Stefano/CMS).

Experiments round table:

  • ATLAS reports -
    • ADCR DB 11g upgrade schedule details (ATLAS internal). [Eva: all went ok from the DB point of view, Streams are now restarting.]
    • Central Services
      • all ATLAS Distributed Computing has been swtiched offline in Panda at 7:15am ATLAS elog 33059
    • T1s
      • TRIMF-LCG2: can you please clarify the downtime? From GOCDB we see LFC, CEs and SRMdown but not FTS, while we got an internal email in which the FTS down was mentioned.
      • GGUS:78298 wrongly assigned by shifter to BNL-ATLAS, while "Penn ATLAS Tier 3" was the problematic site. On GGUS the Penn ATLAS Tier 3 site is not present.
      • Have all the Tier1s reduced their shares for analysis? It should now be 95% prod, 5% analysis. Thanks

Sites / Services round table:

  • Ulf/NDGF: ntr
  • Michael/BNL: planned intervention was ok, hot backups are now set up
  • Xavier/KIT: ntr
  • Elizabeth/OSG: ntr
  • Ronald/NLT1: ntr
  • Jeremy/GridPP: ntr
  • Gonzalo/PIC: intervention this morning went OK and was transparent
  • Burt/FNAL: ntr
  • Tiju/RAL: intervention on ATLAS SRM took a bit longer than expected but issues have been fixed
  • JhenWei/ASGC:
    • reminder, 10h intervention tomorrow between 12am and 10am
    • reminder, there will be a holiday for Chinese New Year from January 21 to 29
  • Reda/TRIUMF (offline notes/additions): we had to do an Oracle upgrade affecting FTS services, we will check why FTS didn't appear in GOCDB. The LFC was indeed up. Sorry for the confusion. We will do it properly the next time around. Also, we extended the downtime by ~2 hours (at risk) since we were still moving some 10G links to different port locations on our core switch in anticipation of the Tier-1 expansion in coming weeks. Last week we've put into production ~1500 new cores. Adding more hardware will be transparent in the future.

  • Eddie/Dashboard: ntr
  • Jan/Storage:
    • CMS CASTOR DB upgraded to 11g this morning, all ok
    • EOS ATLAS upgraded to latest version, took longer than expected but all is ok
  • Eva/Databases: nta

AOB:

  • In view of the T1SCM this week, please send GGUS tickets not followed-up to your satisfaction to MariaDZ for the standard presentation.

Wednesday

Attendance: local (AndreaV, Maarten, Ricardo, Luca, Jamie, Jan, Alessandro, Eddie, MariaDZ); remote (Michael/BNL, Gonzalo/PIC, JhenWei/ASGC, Marc/IN2P3, Ulf/NDGF, Tiju/RAL, Pavel/KIT, Giovanni/CNAF, Ron/NLT1, Catalin/FNAL; Stefano/CMS).

Experiments round table:

  • ATLAS reports -
    • new 2012 Tape families for Tier1s: data12_hi, data12_2p76TeV, data12_900GeV, data12_7TeV, data12_calib, data12_1beam, data12_comm, data12_calocomm, data12_cos, data12_idcomm, data12_larcomm, data12_muoncomm, data12_tilecomm
    • ADCR DB migration to 11g up to now successfull . Keeping eyes well opened. [Alessandro: pandamon on ADCR is very slow. Luca: yes see contention on one table, it is mainly statement that is slow and degrading performance of the system].
    • Central Services
      • CERN-PROD network issue, as reported in the ITSSB, generated some problems to many ATLAS services. ACRON was the last one to recover, problems observed till 10:45. Now recovered. [Alessandro: all SAM tests also failed because of the network problem, this should be subtracted in the computation of the availability plots.]
    • T0/T1s
      • CERN-PROD GGUS:78328 increased priority: most of the data export is failing due to this problem , FTS does not use the full SURL, thus it's failing with "File not exist"... [Jan: asked ATLAS if should go back to Bestman setup before Monday, waiting for a reply. Alessandro: yes please rollback. The high load observed with the Bestman setup from the weekend should now decrease, because ATLAS in parallel changed their configuration to use gsiftp instead of SRM.]

  • CMS reports -
    • all systems recovered after CERN network incident (which was noted worldwide) [Stefano: CMS had a problem because the job client fetches a configuration file from a CERN node that was unavailable, but the fallback did not work, this is an internal issue for CMS to solve.]
      • [MariaDZ: if there is a network problem at CERN, should this not affect all VOs because of VOMS issues? Maarten: not ATLAS, who have a replica at BNL, and eventually not CMS when they put their replica at FNAL actually into production (currently prevented by a technical issue), but definitely Alice and LHCb would be affected. Note however that all 4 experiments have multi-day proxies, so the problem would have a limited impact.]
      • [Jan: problem with the network was a stuck switch that had to be rebooted, but this had several consequences on other services too including AFS]
    • CRC on duty: Stefano Belforte

  • ALICE reports -
    • NTR [Maarten: the network problem surprisingly seemed to have no effect on Alice]

Sites / Services round table:

  • Michael/BNL: ntr
  • Gonzalo/PIC: ntr
  • JhenWei/ASGC: intervention today finished, all OK
  • Marc/IN2P3: ntr
  • Ulf/NDGF: OPN to one site in Norway broke, investigating and following up
  • Tiju/RAL:
    • yesterday's upgrade on ATLAS SRM was cancelled due to unforeseen problems
    • presently upgrading ATLAS 3d, all is going ok
  • Pavel/KIT: ntr
  • Giovanni/CNAF: ntr
  • Ron/NLT1: this morning's upgrade at Nikhef took longer than expected but all is ok now
  • Catalin/FNAL: ntr, but one question: is there a new EOS version available? Jan: 0.1.1.7 is current latest, but 0.1.1.8 will come tomorrow, so wait for that one. Just contact Andreas or the EOS lists.

  • Jan/Storage:
    • Upgrading Castor Alice this afternoon
    • Asked ATLAS if can upgrade EOS to 0.1.1.8 tomorrow. Alessandro: yes please go ahead (unless we tell you not to).
  • Luca/Databases:
    • After the intervention yesterday, there was aproblem with conditions streaming to T1. Was fixed in the evening.
    • Security patch is out for both 10g and 11g servers. It seems that there are important patches that we should apply.
  • Eddie/Dashboard: ntr
  • Ricardo/Grid: ntr

AOB:

  • MariaDZ: Progress in understanding the (non-)inclusion of some OSG T2s in the OIM view available to GGUS for Direct Site Nofification can be followed via Savannah:125662 .

Thursday - cancelled

Attendance: the meeting was cancelled

Experiments round table:

  • ATLAS reports -
    • CERN-PROD GGUS:78328 : Seems to be solved after rollback of Bestman version
    • IN2P3-CC GGUS:78374 : Temporary problem to access files in dcache. Related to high load of data transfer.

  • CMS reports -
    • NTR
    • CRC on duty: Stefano Belforte

  • LHCb reports -
    • Experiment activities
      • MC11: Monte Carlo productions
    • T0: ntr
    • T1
      • IN2P3 : problem with SRM
      • IN2P3 : migration of the DATA from the OLD space token to the new one is working . We need to cross check if it is ok, before we give the recipe to the other LHCb T1 using dCache.
    • T2: ntr

Sites / Services round table:

  • BNL - ntr
  • RAL
    • The Oracle 11 upgrade of the Atlas 3D database was completed yesterday (Wednesday 18th Jan). The update overran compared to the schedule but was completed successfully.

  • Storage/CASTOR:
    • CASTORALICE: Mon 2012-01-23 09:00-13:00 stager+srm DB upgrade to 11g, downtime
    • CASTORCMS: Mon 2012-01-23 update to 2.1.11-9, enable tape gateway (short interruption for migrations/recalls)
    • CASTORPUBLIC: Tue 2012-01-24 09:00-13:00 stager+srm DB upgrade to 11g, downtime (GOCDB: whole site)
    • CASTORALICE: Tue 2012-01-24 update to 2.1.11-9, enable tape gateway (short interruption for migrations/recalls)
    • CASTORLHCB: Thu 2012-01-26 09:00-13:00 stager+srm DB upgrade to 11g, downtime
    • upcoming: CASTORATLAS + CASTORLHCB transparent update to 2.1.11-9+tape gateway; EOSCMS update to 0.1.1

AOB:

Friday

Attendance: local (AndreaV, Maarten, Jan, Zbyszek, Eddie, Ricardo, Alexandre); remote(Michael/BNL, Catalin/FNAL, Elizabeth/OSG, Onno/NLT1, Jhen-Wei/ASGC, Rolf/IN2P3, Tiju/RAL, Thomas/NDGF, Jeremy/GridPP, Paolo/CNAF; Stephane/ATLAS, Stefano/CMS).

Experiments round table:

  • CMS reports -
    • NTR - crunching data happily
    • CRC on duty: Stefano Belforte

Sites / Services round table:

  • Michael/BNL: this morning a dedicated link to LHC1 will be activated and some traffic measurements will then be performed
  • Catalin/FNAL: updated EOS yesterday, behaves much better
  • Elizabeth/OSG: ntr
  • Onno/NLT1: ntr
  • Jhen-Wei/ASGC:
    • there was a hardware problem on a diskserver this afternoon
    • since tonight, there will be one week of holiday for the Chinese New Year; will still answer tickets 8h per day
  • Rolf/IN2P3: good news, progress on GGUS:75983, performance seems better, though not completely understood, so the ticket will stay open. A SIR will be presented to the T1SCM eventually. [Michael: heard yesterday about this issue at the T1SCM and started some measurements to investigate. Please keep the ticket open for the moment.]
  • Tiju/RAL: there will be an upgrade on CMS SRM on Monday
  • Thomas/NDGF: there will be a downtime on Monday afternoon to patch the dcache server
  • Jeremy/GridPP: ntr
  • Paolo/CNAF: ntr

  • Eddie/Dashboard: ntr
  • Alexandre/WMS: Upgrade of WMS servers to version 3.2 (EMI-1)
  • Ricardo/FTS:
    • The FTS at CERN pilot will be down on Monday 23rd January while it is "upgraded" from gLite 2.2.8 to EMI 2.2.8.
  • Jan/Storage: see Thu report
    • EOSCMS-0.1.1 update agreed for Wed 25.01.2012 10:00
  • Zbyszek/Databases:
    • Reminder, the Oracle critical security patches have been published and will need to be applied (for both 10g and 11g). All experiments have already been contacted to validate the patches against the integration databases next week.

AOB:

  • (MariaDZ) The reasons why "Penn ATLAS Tier3" is absent from the GGUS drop-down list of 'notifiable' sites is understood. Details in Savannah:125662#comment3 .

-- JamieShiers - 12-Jan-2012

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2012-01-20 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback