Week of 130107

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance: local(AndreaS, David, Alessandro, Stefan, Ignacio, Manuel, Massimo, Maarten, MariaD);remote(Oli/CMS, Michael/BNL, Wei-Jen/ASGC, Onno/NL-T1, Lisa/FNAL, Gareth/RAL, Thomas/NDGF, Rolf/IN2P3, Paolo/CNAF, Gonzalo/PIC, Dimitri/KIT, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • ATLAS Distributed Computing activities went quite smooth over holidays: many thanks to all the sites!
      • still many GGUS submitted to handle daily issues, but no major problems reported. GGUS team tickets between 21/12/12 and 07/01/2013
      • many sites got PRODDISK full due to high Group Production activity. Those sites have high failure rate (output cannot be written) and have less prod jobs (input cannot be sent to the sites):
        • this is an ADC issue we are working on.

  • CMS reports -
    • LHC / CMS
      • Successful operations during Holiday Break with very few problems that needed attention.
    • CERN / central services and T0
      • NTR
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • ALICE reports -
    • The use of the grid has been quite good during the break, with just a few interventions, thanks to the sites.
      Happy new year!

  • LHCb reports -
    • Mostly Simulation jobs on all Tier levels
    • First jobs run on the HLT farm!
    • T0:
      • SAM jobs to all sites failing because of certificate problem, solved since lunch time. [Maarten suggests to ask for the recalculation of the availability, if useful]
    • T1:
Sites / Services round table:
  • ASGC: ntr
  • BNL: ntr
  • CNAF: today jobs were killed when LSF was shut down during an intervention at the core and storage services.
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: tomorrow, maintenance of some internal routers during which the network connections to the worker nodes might be interrupted; declared "at risk" in GOCDB
  • NDGF: we are having file transfer problems with some Tier-1 sites (GGUS:90016); probably it is a network issue and we are trying to contact the administrators of the other sites.
  • NL-T1:
    • last Saturday we had a cooling problem which caused three dCache pool nodes to shut down; therefore some files were not available during the weekend.
    • SARA is now called SURFsara, but the host names and the site name in the BDII will remain the same. Email addresses will change, though.
  • PIC: ntr
  • OSG: during the Christmas break we had problems with one of the CERN BDII nodes, which gave lots of errors and was finally removed from the alias (GGUS:89927, GGUS:90041).
  • CERN batch and grid services:
    • the BDII problem is now fixed; it was due to a node that had been moved to Puppet.
    • Ignacio asked ATLAS and LHCB if it is OK to proceed with a schema update for LFC this Wednesday. For LHCb, the sooner, the better; for ATLAS, Alessandro will contact some people in ATLAS and answer offline.
  • CERN storage: ntr
  • Dashboards:
    • tomorrow at 1000 we will upgrade the ATLAS DDM monitoring, which will take approximately 30'. The new version features staging monitoring
    • the WLCG transfer dashboard is showing no traffic due to a problem with the message broker that started during the Christmas break and is being investigated.
AOB:

Tuesday

Attendance: local(AndreaS, Alexei, David, Maarten, Ignacio, Manuel, Eva);remote(Gonzalo/PIC, Xavier/KIT, Wei-Jen/ASGC, John/RAL, Lisa/FNAL, Ronald/NL-T1, Rolf/IN2P3, Kyle/OSG, Thomas/NDGF, Saverio/CNAF, Matteo, Alessandro).

Experiments round table:

  • ATLAS reports -
    • ATLAS@P1 : conduct combined and calibration runs
    • Tier-0 will be used for data quality checks
    • Grid : MC and Physics Groups production
      • any dcache site with EMI2 WN is affected by the bug reported in GGUS:89163 : lcg-cp goes in timeout with big files (e.g. bigger than 5 GB).
In fact it seems that some dCache servers considers the SRM optional parameter "desiredTotalRequestTime" 
like the timeout between a srmPut and a srmPutDone.... instead of the timeout of the srmPut operation itself.
The norm is not precise about this and leads to this kind of problem....
The easiest fix for this is to set the option --srm-timeout=3600 to your request, it should solve your problem. 
    • NDGF: a precision about yeserday WLCG report regarding GGUS:90016 : the problem was observed between NDGF and some of the NDGF distributed dcache pools (unige in particular), not others Tier1s (for what ATLAS saw at least)
    • LFC (@CERN) schema update, intervention will be transparent to ATLAS, but a final word is needed from RicardoR and ATLAS DBAs

  • CMS reports -
    • LHC / CMS
      • CMS Planning to have the magnet back up on Thursday
    • CERN / central services and T0
      • Performing some replay work with the new Tier-0 infrastructure
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • LHCb reports -
    • Mostly Simulation jobs on all Tier levels
    • T0:
    • T1:
Sites / Services round table:
  • ASGC: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: we remind LHCb of the downtime next Tuesday, to split the current SE and have a dedicated SE for LHCb; it will take three days. Another downtime is to add tape drives to a library which will be shut down for about two hours during which traffic will be redirected to other libraries.
  • NDGF: yesterday evening there was a network problem for a few hours at one of our subsites. Some ATLAS data might have been unavailable. It was solved later in the evening.
  • NL-T1: ntr
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr
  • CERN batch and grid services: the LFC schema will not be updated this week due to some concerns for possible performance impact which would require the developers (currently still on leave) to be cleared. The point is that even if adding two empty columns is transparent and harmless, if they start being populated there could be an impact. Meanwhile, there is anyway some work to do on the SL6 templates for the new EMI LFC. Stefan reminds that next week LHCb starts the data reprocessing, but Ignacio thinks that it is anyway safer to postpone the change until the developers are back, also considering that it will be a transparent change.
  • Dashboards: yesterday's upgrade to the ATLAS DDM went fine. The issue with the WLCG transfer dashboard is still ongoing and the developer is working with Ignacio. Ignacio took care of the problematic message broker node by removing it from the alias and will clean up the other one.
  • Databases: the LHCb online database was blocked yesterday for a few minutes due to the second instance running out of memory; it was fixed by a restart. We are investigating what caused it.

  • GGUS: Maarten noticed an issue of non propagation of SNOW updates to GGUS. Waiting for info from the SNOW developers. More info tomorrow, we hope.
AOB:

Wednesday

Attendance: local(Massimo, Manuel, Luca, Alexandre, Stefan);remote(Woojin/KIT, Wei-Jen/ASGC, John/RAL, Lisa/FNAL, Ronald/NL-T1, Kyle/OSG, Thomas/NDGF, Saverio, Luca and Matteo/CNAF, Ian/CMS, Michael/BNL).

Experiments round table:

  • CMS reports -
    • LHC / CMS
      • CMS Planning to have the magnet back up on Thursday
    • CERN / central services and T0
      • Performing some replay work with the new Tier-0 infrastructure. Still working to get moved over. We expect to write the HI streamer files to tape.
    • Tier-1: NTR
    • Tier-2:
      • NTR

  • LHCb reports -
    • Mostly Simulation jobs on all Tier levels
    • Next week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s
    • T0:
    • T1:
Sites / Services round table:
  • ASGC:Server crash
  • BNL: ntr
  • CNAF: Intervetion on staorage not yet completed. It is possible the downtime will be extended to tomorrow (more news on mail)
  • FNAL:ntr
  • KIT: ntr
  • NDGF: ntr
  • NLT1: ntr
  • RAL: ntr
  • OSG:ntr

  • CASTOR/EOS:ntr
  • Central Services:LFC upgrade paused (developers in the loop to answer questions from ATLAS)
  • Data bases: ntr
  • Dashboard:2 brokers back (thanx to Ignacio): Transfer dashboard now OK
AOB: CMS will change the T0 activity during pA in order to write also the "streamer" files. This will end up to a tape activity comparable to the CMS pp in Nov/Dec 2012 (~400 Hz). No change for export (all data only to FNAL).

Thursday

Attendance: local(AndreaS, Maarten, Stefan, Alexandre, Manuel);remote(Ronald/NL-T1, Michael/BNL, Wei-Jen/ASGC, Saverio/CNAF, Pavel/KIT, Alain/OSG, Thomas/NDGF, Tiju/RAL, Rolf/IN2P3, Alexei/ATLAS).

Experiments round table:

  • ATLAS reports -
    • ATLAS will take combined and calibration runs
    • Distributed computing : MC and physics groups production
    • Setting up the data replication scenario for the HI run

  • CMS reports -
    • LHC / CMS
      • Getting ready for HI
    • CERN / central services and T0
      • Performing some replay work with the new Tier-0 infrastructure. Still working to get moved over.
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR
Rolf: yesterday we had several SRM crashes due to long proxies and pinned it down to ~20 users using proxy delegation too many times. As there is no fix yet from dCache, we would like CMS to contact the users (the local CMS contact has the list) and tell them to use shorter proxies. Maarten comments that as the vast majority of the analysis users does not use long proxies, it cannot be necessary to use them.

  • LHCb reports -
    • Mostly Simulation jobs on all Tier levels
    • Next week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s
    • T0:
    • T1:
      • CNAF: DT is over for LHCb, all SEs unbanned
      • GRIDKA: timeouts of transfers to Gridka Disk storage under investigation (GGUS:88425)
Stefan asks KIT if the storage split scheduled for next week will improve the transfer problems. Pavel says that it will not. Stefan and Maarten suggest that KIT contacts the dCache and the FTS developers to get some help.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: the downtime has finished for the four LHC experiments and everything is back to normal
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • RAL: ntr
  • OSG: ntr
  • CERN batch and grid services: ntr
  • Dashboards: ntr

AOB:

Friday

Attendance: local(AndreaS, Massimo, Eddie, Stefan, Eva, Manuel);remote(Tiju/RAL, Onno/NL-T1, Saverio/CNAF, Michael/BNL, Xavier/KIT, Wei-Jen/ASGC, Gareth/RAL, Alain/OSG, Rolf/IN2P3, Lisa/FNAL, Christian/NDGF, Alexei/ATLAS).

Experiments round table:

  • ATLAS reports -
    • SARA "un-announced" downtime. We noticed that pilots were not running and discovered a downtime that had not been announced. [Onno: this was a human error, apologies to ALICE, ATLAS and LHCb. We will take measures to prevent it from happening again.]

  • CMS reports -
    • LHC / CMS
      • Getting ready for HI
    • CERN / central services and T0
      • Performing some replay work with the new Tier-0 infrastructure. Still some patching.
    • Tier-1:
      • Problem with slow submission at IN2P3 was reported on SAV:134961
    • Tier-2:
      • NTR

  • LHCb reports -
    • Mostly Simulation jobs on all Tier levels
    • Next week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s (validation on the week-end)
    • T0:
      • VOMS not reachable from outside CERN (GGUS:90295)
    • T1:
      • CNAF: 11 RAW files missing from storage, currently being investigated how this happened.
      • GRIDKA: timeouts of transfers to Gridka Disk storage under investigation (GGUS:88425, GGUS:88906)
Stefan asks KIT how they can progress on the transfer problems, because LHCb would really want to send raw files to KIT, but currently it is just not possible. Xavier answers that it is still under investigation, but admits little progress. It is agreed once more that contacting the FTS and dCache developers would be useful.

Manuel confirms that VOMS is not accessible from outside CERN, but the network guys did not yet find the reason and would appreciate some more information to be added to the GGUS ticket.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: a downtime is planned for next Tuesday and Wednesday to upgrade all the Frontier squid servers. ATLAS and CMS have two of them, so they should not notice anything; LHCb would, but it would be affected anyway by the planned storage intervention.
  • NDGF: next Monday there will be a downtime to allow Bergen to upgrade their batch system and CSC Finland to upgrade dCache.
  • NL-T1: ntr
  • RAL: ntr
  • OSG: ntr
  • CERN batch and grid services: ntr
  • CERN storage: ntr
  • Dashboards: ntr
  • Databases: ntr

AOB:

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2013-01-11 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback