Week of 120312

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Massimo, Maarten, Peter, Maria, Elisa, Przemek, Zbieckek, Ivan);remote(Alessandro, Gonzalo-PIC, JhenWei-ASGC, Ulf, Gareth-RAL, Onno-NLT1, Rob-OSG, Burt-FNAL).

Experiments round table:

  • ATLAS reports -
    • Services
      • Priority of a GGUS team ticket could not be changed by any member of the team (GGUS:80121). "There is a fundamental problem with the SOAP interface", under investigation
    • T0/1:
      • FTS transfers CERN-IN2P3 : FTS stuck GGUS:80133 , channel doesn't seem completely stuck now, but a little problematic
    • T2s:
      • SRM problem at IFIC (failed to contact on remote SRM) GGUS:80114. Reply from site: "From time to time, StoRM crashes randomly. We provide an automatic recovery procedure until problem is located and upgrade/patch applied"
      • Filesystem problem in the SRM server at DUKE, GGUS:80126. Marked in downtime in OIM
      • Transfers from FZK-LCG2_DATADISK to CA-SCINET-T2_DATADISK are failing with java.net.ConnectException, GGUS:80127. No answer so far
      • UKI-SOUTHGRID-OX-HEP Failed to concatc on remote SRM, GGUS:80131, fixed

  • CMS reports -
    • Tier-0 (plan):
      • Cosmics Data Taking on-going.
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 and Tier-2 sites
    • Tier-0/Tier-1
      • -
    • Tier-2
      • -
    • Services and Infrastructure
      • [Issues] Central JobRobot aborts affecting many sites. We actually seem to have 2 kind of unrelated issues :
        • The aborts already reported here (GGUS:79858), which was first seen at the gLite WMS's deployed at INFN-CNAF. On Friday March 9th, the local admins at CNAF installed the patched packages for WMS's used at CERN. However, according to the CMS JobRobot logs, the Abort-Rate due to this problem stayed at the 1% level. As a consequence, CMS decided to remove the JobRobot alarms from its computing shift procedures, until the problem is understood.
        • The issue reported in GGUS:80108 ("Globus 131: proxy expired"), which is also occuring at the 1% level, on both CERN and CNAF WMS's. According to the CMS JR expert this problem is not related to the above. Apparently this issue also triggers another bug on SUM jobs in OSG3, hence OSG sites seem to be afftected more. Production and user jobs appear to proceed fine.
    • Notes
      • Next CMS CRC is Claudio Grandi, starting from tomorrow 8AM.

  • LHCb reports -
    • Experiment activities:
      • Data reStripping and user analysis at Tiers1
      • MC simulation at Tiers2
    • New GGUS (or RT) tickets
      • T0
        • CERN : Castor upgrade this morning
      • T1
        • Gridka: queues were closed much earlier than the start of downtime. Site unusable for almost 1 week.
      • All sites:
        • Scratch space in WNs local disk has been set as 20GB in VO card. Before it was 10GB. A broadcast message will be sent to all sites.

Sites / Services round table:

  • ASGC: ntr
  • CNAF: ntr
  • FNAL: FTP upgrade next week (tbc)
  • IN2P3:ntr
  • NDGF: ntr
  • NLT1: Downtime next week (20 March): sevral upgarde (link NIKHEF-SARA upgrd to 20 Gb, Storgae etc...). Today we had an emergency vendor intervention on 2 pool machines.
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: CASTOR 2.1.12 in prod since this morning for CMS and LHCb
  • Data bases: nrt
  • Dashboard: ntr

AOB: (MariaDZ) File ggus-tickets.xls is up-to-date and attached to page WLCGOperationsMeetings . We had no real ALARM this week. Slides with totals and drills will be made next week for the MB.

Tuesday

Attendance: local(Massimo, Maarten, Claudio, Elisa, Eva, Ivan, JhenWei-ASGC);remote(Gonzalo-PIC, Ulf-NDGF, Gareth-RAL, Ronals-NLT1, Rob-OSG,Lisa-FNAL, Stefano-CNAF. Jeremy-GridPP, Xavier-KIT, Rolf-IN2P3).

Experiments round table:

  • ATLAS reports -
    • sorry no one can be present today
    • This week is ATLAS Software and Computing week.
    • nothing urgent to report

  • CMS reports -
    • Tier-0 (plan):
      • Cosmics Data Taking.
      • T0 prompt reconstruction stopped at Tue Mar 13 13:18:19 CET 2012 to fix an issue in the DB. CMS experts working on it. Will resume processin as soon as this is done.
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 and Tier-2 sites
    • Sites:
      • Nothing specific
    • Services and Infrastructure
      • WMS related issues:
        1. GGUS:79858 : ("NO_ISB Cannot move ISB" error). Probably related to something new in the EMI WMS (either config or software). It affects all CE types in the same way. It affects Job Robot globally at few % level but alters the results for some sites.
          • -> Temporary ignore the failures in Job Robot if the error message is "Cannot download..."
        2. GGUS:80108 : ("Globus 131: proxy expired" error). Also a problem in the EMI-WMS that hits all EMI-WMSs in the same way. Also not related to the CE flavour. It is an issue for JR and SAM. No evidence that it is a problem for analysis.
          • -> Workaround: unset the MyProxyServer variable in the WMS client configuration for JR and SAM
        3. GGUS:80195 : (SAM tests to OSG sites stay in Ready status). Only WMSs used by SAM are affected and only when submitting to OSG sites. First investigation indicates that it is a problem related to submissions to ARC-CEs that cause crashes of the gridmanager (GGUS:80215).
          • -> Workaround: consider "unknown" as "warning" in nagios so that the site availability is not impacted

Sites / Services round table:

  • ASGC: Upgrade to FTS228 done successfully
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: At this point no queue can be opened for LHCb. On the other hand in next big upgrades we will consider to offer access to resources without draining at the risk of running job to be killed (question from LHCb). The present intervention is going on (1 h delay)
  • NDGF: Today 18:00 UTC OPN intervention for Finland (some ALICE data unavailable). Intervention should be short
  • NLT1: ntr
  • PIC: ntr
  • RAL: Question: observed that ALICE was running at 50% efficiency all along of February across all Tier 1: known fact?
  • OSG:ntr

  • CASTOR/EOS: CASTOR upgrade for ALICE: 19-MAR-2012 (8:00 - 12:00 UTC). Rate tests scheduled for this week (ALICE + CASTOR)
  • Central Services:
    • SAM WMS nodes have run into a Condor bug provoked by an issue with the ARC-CEs at the single NorduGrid site that is tested by CMS; GGUS:80215 opened for WMS developers and Condor developers contacted directly; asked the site if they can try rolling back the change; ultimate recourse is to exclude the site from the WMS information supermarket and to reset Condor on the affected nodes (zapping any unfinished jobs after some draining period); the Condor developers already provided a workaround, to be tested...
  • Data bases: ntr
  • Dashboard: ntr

AOB: (MariaDZ) Please submit GGUS tickets that don't get satisfactory support, to be presented at the T1SCM till tomorrow 4pm. Re-opened GGUS:80121 as Maarten reported some disappearence of "Involved others" values. Found Savannah:124395 to start debugging the issue. Those were different issues, new ticket GGUS:80199 opened. More tomorrow.

Wednesday

Attendance: local(Massimo, Alessandro, Gavin, Claudio, Elisa, Ignacio, Ivan, Luca, Maarten, MariaD);remote(Gonzalo-PIC, Ulf-NDGF, Stefano-CNAF, John-RAL, Lisa-FNAL, ShuTing-ASGC, Pvel-KIT, Rolf-IN2P3, Rob-OSG, Ron-NLT1).

Experiments round table:

  • ATLAS reports -
    • This week is ATLAS Software and Computing week.
    • CERN-PROD GGUS:80170 : how to monitor the TAPE queue with lemon metrics. Waiting for some more clear inputs from CERN to better teach ATLAS shifters.
    • CERN-PROD got 3 GGUS tickets GGUS:80236 , GGUS:80239 , GGUS:80241 , the submitter was not very clear in explaining the problem, ATLAS is reviewing the instructions to improve the way in which problems will be followed up and reported.

  • CMS reports -
    • Tier-0 (plan):
      • Cosmics Data Taking on-going.
      • Prompt reconstruction stopped from ~1PM to 8 PM yesterday to allow for a (CMS) intervention on DBs
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 and Tier-2 sites
    • Sites:
      • Nothing specific
    • Services and Infrastructure
      • WMS related issues: only issue still hitting CMS is GGUS:79858 : ("NO_ISB Cannot move ISB" error). -> Temporary ignore the failures in Job Robot if the error message is "Cannot download...". The workaround used for the other two are working.
      • cmsweb partially red in SLS for 22 hours. It looks like the SLS was reporting alarm also in case of transient exceptions in which Lemon had not to report an error state. Furthermore part of that time the services did already recover and were ok but still red in SLS. Apparently this happened already in the past. Diego Gomes contacted Ivan Fedorko.

  • LHCb reports -
    • Data reStripping and user analysis at Tiers1
    • MC simulation at Tiers2

New GGUS (or RT) tickets

    • T0
      • one ticket for pilots aborted at ce205 GGUS 80190
    • T1

Sites / Services round table:

  • ASGC:ntr
  • CNAF:ntr
  • FNAL:FTS being upgraded. It should be finished in ~30 mins but expect perturbation for ~2h more
  • IN2P3:ntr
  • KIT: In response to yesterdays statement of LHCb concerning GridKa queues: All experiments using GridKa have been asked well in advance if their queues should be closed prior to the downtime in order to let running jobs finish (default) or if queues should be kept open until the downtime starts. There was no repsonse from LHCb, so on March 7th, the lhcbXXL (120 hours) and on March 8 the lhcbXL (96 hours) queues were closed. After the complaint of LHCb, both queues were opened again. All shorter queues were open all the time therefore GridKa was usable. The PBS logs show, that LHCb has started several 100 jobs every day before the yesterday's start of the downtime.
    Maintenance update: All work packages are within the time plan.
  • NDGF:Yesterday intervention was very disruptive. Telecom disconnected the wrong fibre and got Finns sites disconnected for 40'
  • NLT1:Intervention on the 20th confirmed
  • PIC:ntr
  • RAL:ntr
  • OSG:Chasing Brian for understanding the issues with Nebraske (affecting ATLAS but not CMS)

  • CASTOR/EOS: CASTOR/ALICE test starting (warm up). EOSALICE being set up. Some network glitches
  • Central Services:
    • CVMFS for LHCb will be upgraded (moved). During few hours new versions won't be uploadable. Cache data will stay available
    • FTS issues on Descy-CERN (FTS jobs getting stuck in the system). Deveopers investigating
    • WMS issues: patched versions (WMS and ARCce) becoming available. Roll out: tomorrow?
      • Meanwhile the WMS workaround remains in place
    • As we are not currently able to upgrade to the EMI1 wn, we are upgrading to the versions recommended in
      https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions They are glexec_wn 3_2_5-1 and wn 3_2_12-1. So they will be going to preprod nodes as soon as possible. Once we are happy with these versions we will consider migrating from SCAS to ARGUS.
  • Data bases:ntr
  • Dashboard:nrt

AOB: (MariaDZ) GGUS Release next Tuesday 2012/03/20. With the 2012/03/20 GGUS Release the Remedy engine behind the user interface will be upgraded to 764 SP2. Details in https://savannah.cern.ch/support/?124546 The change should be totally transparent to users. Interface developers were informed on 2012/03/07 about this up-coming change. They can use https://train.ggus.eu/pages/home.php for testing.

Thursday

Attendance: local(Massimo, Claudio, Elisa, Ivan, Maarten, Oliver, Ignacio, Andrea, Eva); remote(Ulf-NDGF, John-RAL, Kyle-OSG, Rolf-IN2P3, Lisa-FNAL, ShuTing-ASGC, Paco-NLT1).

Experiments round table:

  • CMS reports -
    • Tier-0 (plan):
      • Data taking during beam commissioning
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 and Tier-2 sites
    • Sites:
      • Nothing specific
    • Services and Infrastructure
      • Nothing new w.r.t. yesterday for what concerns WMS problems

  • ALICE reports -
    • Many VOboxes were affected by a problem with the installation of a new version of an ALICE analysis package: the installer got into an infinite loop and filled up /tmp in many cases. This in turn led to a very low usage of the grid. The issue actually started yesterday afternoon, but was only recognized and fixed this morning when several site admins reported seeing problems on their VOboxes.

  • LHCb reports -
    • Data reStripping and user analysis at Tiers1
    • MC simulation at Tiers2

New GGUS (or RT) tickets

    • T0
    • T1

Sites / Services round table:

  • ASGC: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3:ntr
  • NDGF: ntr
  • NLT1: ntr
  • PIC: ntr
  • RAL: see ATLAS report
  • OSG: ntr

  • CASTOR/EOS: ntr
  • Central Services: The problem reported yesterday is wider than expected (affecting the WN software as well). Essentially there are name clashes across different (incompatible) RPMs, viz. the ones for gLite vs. the ones for EMI.
  • Data bases: ntr
  • Dashboard: ntr

AOB:

Friday

Attendance: local(Massimo, Claudio, Elisa, Maarten, Ivan);remote(Thomas-NDGF, John-RAL, Kyle-OSG, Rolf-IN2P3, Lisa-FNAL, JhenWei-ASGC, Alex-NLT1).

Experiments round table:

  • CMS reports -
    • Tier-0 (plan):
      • Data taking during beam commissioning
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 and Tier-2 sites
    • Sites:
      • RAL in unscheduled downtime since this morning for network problems
    • Services and Infrastructure
      • Still living with WMS problems that affect the Job robot results

  • ALICE reports -
    • Details on yesterday's VOBOX problem have been added to yesterday's report.
    • The Dashboard team noticed the SAM topology for ALICE had become unavailable due to an oversight: its original host was being prepared to be decommissioned while the new host still referred to it. Fixed.

Sites / Services round table:

  • ASGC: ntr
  • CNAF: ntr
  • FNAL:ntr
  • IN2P3:ntr
  • KIT: restarting after the intervention
  • NDGF: ntr
  • NLT1: one file server rebooted. now ok
  • PIC:ntr
  • RAL:network problems: during an intervention the network came to an halt (packet storm). We are slowly recovering (some part of the centre are visible, others, notably some disk servers are still unreachable. Hope to fix it in the next hour or so
  • OSG:ntr

  • CASTOR/EOS:ntr
  • Central Services: ntr
  • Data bases:ntr
  • Dashboard:ntr

AOB:

-- JamieShiers - 31-Jan-2012

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2012-03-16 - MassimoLamanna
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback