Week of 110307

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local (AndreaV, Elena, Mattia, Daniele, Jamie, Ignacio, Maarten, Eva, Dirk, Alessandor, MariaDZ, Ricardo, Steve); remote (Jon, Dimitri, Onno, Felix, Rolf, Kyle, John, Lorenzo, Tore).

Experiments round table:

  • ATLAS reports -
    • ATLAS general info
      • project tag: data11_7TeV
      • HI express stream reprocessing which started on Fiday is finished now.
    • CentralServices
      • Problem with ATLAS Central catalog last evening. The problem was fixed by midnight. [Elena: this is a DQ2-related issue.]
    • T0
      • CERN-PROD: problem to transfer 3 RAW files from CERN-PROD_TZERO for reprocessing campaign (Friday evening): GGUS:68258. Solved quickly. Thanks.
    • T1
      • IN2P3: has reduce efficiency (75%) for file transfers because of errors:Unable to connect server. GGUS:68290, Solved quickly by restarting dCache. Thanks.

  • CMS reports -
    • CMS / CERN / central services
      • issues with the SSB, work in progress (status to be discussed at today's Computing Ops meeting)
      • still struggling a bit with 'cernmx': details are being collected/provided to Stefan to help his investigations [Daniele: this is an elog-related issue. Alessandro: ATLAS did not observe any problems with elog, can you give more details? Daniele: Stefan did mention that there are some differences in elog for ATLAS and CMS, so this is not surprising. CMS is collecting stats that will be sent to Stefan so that he can debug the problem.]
    • Tier-0
      • NTR
    • Tier-1
      • MC production with CMSSW 3_11, mainly at FNAL, CNAF, IN2P3, KIT, RAL - comfortable submission rates, though
      • IN2P3: a fraction of the output of MC prod jobs were published to the co-located T2: investigating (SAV:119613)
      • new production infrastructure: tests only at a very reduced rate (to allow MC prod - read above)
      • good progress in the T1 deletion campaign to get prepared for the 2011 data taking period
    • Tier-2
      • NTR
    • [ CMS CRC-on-duty from Mar 1st to Mar 8th: Daniele Bonacorsi. ]

  • ALICE reports -
    • T0 site
      • Since Friday afternoon we are testing Torrent installation mechanism in voalice14 (WMS submission). We found some issues during the weekend
        • LSF reported a big number of jobs running while MonaLisa was showing just a few), they seem to be solved by now but we are keeping an eye on it.
        • PackMan is not updating the software area. Under investigation
        • We plan also to use Torrent in voalice12 in the comming days.
      • ClusterMonitor service was not working during the weekend in voalice12 (CREAM submission), we found that the cause was a configuration file that was wrong. Fixed and working since this morning
    • T1 sites
      • Deployment of new AliEn subversion (v2.19.81) was done in T2's and T1's on Saturday morning
    • T2 sites
      • Usual issues at T2's

  • LHCb reports -
    • [No report. Maarten: there is an LHCb Tier1 jamboree at Lyon this week, this might explain it.]

Sites / Services round table:

  • FNAL: ntr
  • KIT: ntr
  • NLT1: Nikhef started draining queues for tomorrow's downtime.
  • ASGC: there will be a power maintenance next Sunday, will be announced on GOCDB.
  • IN2P3: ntr
  • OSG:
    • will be quiet this week, there are several meetings going on.
    • gridview availability for BNL is being recalculated. [Alessandro: submitted a GGUS ticket 10 days ago to recalculate gridview availailty for ATLAS. Is there anything else than ATLAS that is being recalculated? Kyle: will check and reply to Alessandro.]
  • RAL: ntr
  • CNAF: ntr
  • NDGF: ntr

  • Database services: 1h shutdown for Alice database due to various Alice interventions at the pit this morning

AOB:

  • (Steve) This TWiki now supports intertwiki links to SNOW, e.g. INC:012634 and RQF:0001962. Of course take care since viewing not public to all.
  • (MariaDZ) A change made by the developers into the ggus.org/eu DNS Records, temporarily blocked the internal Campus Firewall. This was the reason why https://ggus.org was unreachable last Wednesday 2011/03/02 wherelse https://gus.fzk.de was working without problem was.

Tuesday:

Attendance: local (AndreaV, Jarka, Maarten, Daniele, Mattia, Pedro, Jamie, Luca, Ignacio, Simone, Alessandro, Lola); remote (Jon, Ronald, John, Lorenzo, Rolf, Kyle, Jeremy).

Experiments round table:

  • ATLAS reports -
    • IN2P3-CC: gridftp door issue, GGUS:68347, seems to be dCache middleware bug, site admin filed bug report to dCache. GFTP door service on ccdcatli012.in2p3.fr restarted, it helped. GGUS solved. [Mattia: IN2P3 had problems with SRM storage tests for ATLAS, CMS and LHCb. Now the issue seems solved.]
    • ATLAS observed glitches on SLS monitor due to Lemon not providing data. We filed GGUS:68331, which was directed to SNOW:INC019411. Lemon support CC'd in GGUS. No update yet. Snow does not know INC019411.
    • Express stream reprocessing of HI data from 2010 done. Bulk reprocessing at Tier1s will start this week (Wed or Thu).

  • CMS reports -
    • LHC / CMS detector
      • Commissioning with circulating beam. Yesterday: progresses with collimation of B1 and B2, a couple of test ramp/squeeze overnight. Today: firmware updates and other businesses, injection and dump protection at 450 GeV, then collimator set-up at 3.5 TeV from 4-5pm onwards.
      • CMS DataOps prepared the list of primary datasets for the first data taking epoch in 2011 (in jargon: "Run2011A"). Now circulated to the Computing Resource Board for comments. If blessed, Savannahs will be opened to T1s as usual for the tape families creation.
    • CERN / central services
      • Some issue with the t1transfer pool, overnight, for some hrs. No major impact, but we would like to know the root cause - if understood. [Ignacio: can you please open a ticket even when the issue disappears immediately? Then this can go to CASTOR or to Lemon, depending where the issue is. Daniele: OK, will open a ticket. Alessandro: ATLAS did open a ticket about the SLS/Lemon issues, though it is difficult to provide useful information as there is no error message.]
      • The VOMS Core Service has been degraded intermittently yesterday for some hrs. Specifically, VOMS Core Service on lcg-voms.cern.ch has been unavailable. It repeated twice, I did not report it yesterday, but maybe it worths a mention today (see here and here).
      • ELOG: on Tuesday March 8th, 11:30 GVA time, a configuration change was applied by Stefan Roiser to try to fix the 'cernmx' error seen from time to time when submitting ELOGs. Observations (for the records, to help Stefan) below. [Daniele: the changes made by Stefan seem to have solved the issue.]
    • Videconferencing systems
      • message from EVO: "On Monday 7th March, our main server infrastructure faced major network instabilities in L.A. starting from 5:00 PST. It affected dozens of international meetings. Even if the network problem still there we have disconnected all our servers in this area to avoid the annoyance on our international service. Sorry about the inconveniences. The EVO Team."
    • Tier-0
      • Moved to CMSSW 412 for promptReco and AlCa processing. A few patches applied, also.
    • Tier-1
      • production with CMSSW 3_11 basically done. Continuing this week in 4.1 (at T1 and T2 sites)
      • downtimes: RAL (batch stop for Castor Nameserver update)
      • A note on ASGC: will upgrade Castor for CMS in early April . CMS is ending processing, ASGC is custodial of data that are also located elsewhere, so once processing is over (end March) they can safely proceed. The sooner in April, the better for CMS.
      • Issues at IN2P3: 1) a fraction of the output of MC prod jobs were published to the co-located T2: investigating (SAV:119613) 2) deletions are progressing slow, would need a check (SAV:119628)
    • Tier-2
      • MC prod with CMSSW 4.1 starting soon. Analysis as usual.
    • [ CMS CRC-on-duty from Mar 8th to Mar 14th: Daniele Bonacorsi ]

  • ALICE reports -
    • T0 site
      • Torrent testing is still ongoing and performing better. We got rid of the stuck jobs, which made increase the number of running jobs better but the amount of jobs ending in error validation has increased. Under investigation
    • T1 sites
      • FZK: yesterday afternoon we observed many xrdcp clients hanging. One of the servers suffered from high load. Site admin was contacted and it was discover that the problem was similar to what happened at CERN a week ago, a file only sitted in one server was access by many jobs. Mirroring the file fixed the probem
    • T2 sites
      • Usual issues at T2's

Sites / Services round table:

  • FNAL; ntr
  • NLT1: Nikhef was in scheduled downtime, all seems ok, but still checking if this caused any other issues.
  • RAL: there will be a downtime for CASTOR, as reported by CMS
  • CNAF: ntr
  • IN2P3: ntr
  • OSG: ntr
  • GridPP: ntr

AOB: none

Wednesday

Attendance: local (AndreaV, Jarka, Pedro, Eva, Nilo, Jamie, Simone, Ricardo, Ignacio, Roberto, Mattia, MariaDZ, Lola, Massimo, Julia); remote (Jon, Rolf, Xavier, Tiju, Kyle, Felix, Onno, Tore).

Experiments round table:

  • ATLAS reports -
    • No update in GGUS:68331 for Lemon glitch reported yesterday. Is GGUS-SNOW link OK? [!MariaDZ: GGUS-SNOW link does work, will check if the answer was posted on GGUS by the Lemon team. Ricardo: saw the ticket and sent it to the Lemon team.]
    • No major issues to report.

  • CMS reports -
    • WARNING: I will be giving a talk at the GDB at the time of this call - so I might not be able to connect. Report below (Daniele)
    • LHC / CMS detector
      • Commissioning with circulating beam. Last night: Beta* measurement and adjustment. Morning: access + tests. Afternoon: collimators into ramp, loss maps on flat top. Night: aperture measurements in dispersion suppressors.
    • CERN / central services
      • ELOG: the 'cernmx' problem seems better but not yet fixed. One notification of problems after the config change and restart yesterday morning (only 1 in 24 hrs).
    • Tier-0 / CAF
      • CAF: the CAF queue 'cmscaf1nd' had >1k pending jobs (see here) by one user. Legitimate, and no impact, just keeping an eye on it, and see if it increases and/or causes problems.
    • Tier-1
      • MC production soon with CMSSW 4.1 (also at T2 sites)
      • RAL (one observation plus a fix): CREAM-CE SAM tests as from here seemed to stop yesterday evening (note RAL is in scheduled downtime until 2011-03-09 15:00 UTC, but tests use to go there nevertheless of course). Cinquilli noticed and Sciabà checked. SAM had test jobs scheduled as of 18:40 yesterday, in principle they should have been aborting after 12 hrs for proxy expiration, but they didn't; Andrea cleaned the working dir of the involved CEs so that SAM will just send new jobs, and situation should restore in few hrs. [Tiju: yesterday's downtime could explain the problem, but it'is not clear if the times match.]
      • Issues at IN2P3: still following up (see yesterday's report: 1) a fraction of the output of MC prod jobs were published to the co-located T2: investigating (SAV:119613) 2) deletions are progressing slow, would need a check (SAV:119628))
    • Tier-2
      • MC production soon with CMSSW 4.1 (also at T1 sites)
      • Analysis as usual.
    • [ CMS CRC-on-duty from Mar 8th to Mar 14th: Daniele Bonacorsi ]

  • ALICE reports -
    • T0 site
      • The Central Optimizer was stuck till this morning so the number of jobs in SAVED status but doing nothing increased enormously. Situation back to nomal
      • Yesterday we discover that running jobs instead of using the r/o AFS area, they are using the r/w. There must be a misconfiguration, we have to do further investigation to fix it. This could explain the not so good performance that the SW area had for ALICE. [Massimo: definitely they are using the R/W volumes, but it's not clear if this is a misconfiguration. Ignacio: some volumes did not have a R/O replica.]
    • T1 sites
      • CNAF: was not available this morning due to a problem in the vobox. We confirm that it is fixed and jobs already running
      • FZK : GGUS 68387. There was a mismatching between the number of jobs in MonaLisa and the ones reported by the information provider. The CM service was down for ~10 hours and that provoked a decrease of the running jobs. There was also a problem with the memory used (4GB for all the vobox) , which is going to be solved this week. SOLVED
    • T2 sites
      • Usual issues at T2's

  • LHCb reports -
    • Restarted few MC production after the LHCb Jamboree. Nothing to report.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
      • [Roberto: noticed some dashboard tests failing just before the meeting, no GGUS ticket yet.]
    • Issues at the sites and services
      • T0
        • NTR
      • T1
        • RAL: DT for CASTOR upgrade. Banned SEs
      • T2 site issues: :
        • NTR

Sites / Services round table:

  • FNAL ntr
  • IN2P3: ntr
  • KIT: one Cream server gave problems with job submission and has been moved to downtime till Monday; this should not be a problem unless hardcoded addresses are used
  • RAL: Castor upgrade has been completed
  • OSG: ntr
  • ASGC: ntr
  • NLT1: ntr
  • NDGF: ntr

  • Dashboard services (Julia): before stopping SAM tests and moving to Nagios completely, must make sure that the availability plots for the MB can be produced from Nagios too, this is being followed up with the experiments. For CMS all is fine. For ATLAS all is ok except the Cream CE tests. For Alice this is being followed up with Lola. For LHCb this check will start soon.]

AOB: none

Thursday

Attendance: local(Ignacio, Jarka, Maarten, Maria D, Mike, Pedro, Ricardo);remote(Andreas, Gonzalo, Jeremy, Jhen-Wei, Jon, Kyle, Rolf, Ronald, Tiju, Tore).

Experiments round table:

  • ATLAS reports -
    • Lemon glitch reported ca 2 days ago -- according to Maria's response to GGUS:68331 the SNOW ticket (INC:019411) is resolved. Unfortunately, I'm not able neither to view the SNOW ticket, nor to see solution in the GGUS ticket. Can Lemon Support please comment on what is the final response? Thank you.
      • conclusions from investigation with Maria Dimou after the meeting:
        • 2nd line SNOW support erroneously converted the incident into a "request" (new SNOW ticket of a different type) on March 8 and closed the incident accordingly; Maria D will follow up with the SNOW team on why this happened
        • the change did not get propagated to GGUS, because the path for updates in that direction does not work yet
        • the "request" ticket has been assigned to the Lemon team: they should have received an e-mail, but did not update the ticket so far (maybe because it was only a "request", not an incident)
        • Jarka does not have access to the SNOW incident ticket, because she is not in any of the support teams known by SNOW; such access should not be needed anyway when the update path from SNOW to GGUS works
        • please do _NOT_ put xyz.support@cernNOSPAMPLEASE.ch in CC of a GGUS ticket, because it would cause additional tickets to be created!
    • No major issues to report.

  • CMS reports -
    • LHC / CMS detector
      • Commissioning with circulating beam. Stable beams foreseen on Saturday afternoon.
    • CERN / central services
      • ELOG: the 'cernmx' problem seems better but not yet fixed. Still keeping an eye on it.
    • Tier-0 / CAF
      • NTR
    • Tier-1
      • MC production soon with CMSSW 4.1 (also at T2 sites)
      • RAL: SAM (MC) test errors on two CEs (SAV:119693, GGUS:68456) . Actually RAL is in AT-RISK now ("T1_UK_RAL SCHEDULED downtime (AT_RISK) [ 2011-03-10 09:00 to 2011-03-11 15:00 UTC ]", reason: "At Risk while renaming Atlas files in bulk"). CMS will wait until RAL is out of this, and check back tomorrow.
        • Tiju: downtime should only affect ATLAS LFC; the CMS SAM test error was due to a bug in that test
      • IN2P3: still following up on output of MC prod jobs mistakenly published to the co-located T2 (SAV:119613) and deletions progressing too slowly (SAV:119628)
    • Tier-2
      • NTR

  • ALICE reports -
    • T0 site
      • We have CREAM submission disabled at CERN since last night because all jobs failed immediately. The problem is on our side. Under investigation.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual issues at T2's

  • LHCb reports -
    • Experiment activities:
      • MC productions on going w/o major problem.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • volhcb29, machine hosting SAM suite, has been rebooted this morning for a quick patch and SAM tests may have been affected. Back to normality now.
      • T1
        • NTR
      • T2 site issues:
        • NTR

Sites / Services round table:

  • ASGC - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - ntr
  • KIT
    • a routing problem on a subset of the WN to a subset of the file servers caused job failures for ATLAS and LHCb; should be fixed now
  • NDGF - ntr
  • NLT1
    • GGUS:68420: transfers from NIKHEF to various sites had intermittent errors due to a new disk server having an MTU of 9000 instead of 1500; fixed now
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • CASTOR - ntr
  • dashboards - ntr
  • grid services - ntr
  • GT group - ntr

AOB:

Friday

Attendance: local (AndreaV, Jarka, Alessandro, Pedro, Mattia, Jamie, Maarten, Stefan, Massimo, Ignacio, Eva, Nilo, Edoardo, Ricardo); remote (Jon, Gonzalo, Ulf, Jhen-Wei, Tiju, Xavier, Lorenzo, Onno, Rob; Daniele).

Experiments round table:

  • ATLAS reports -
    • ATLAS observed recent degradation of AFS response. It affects e.g. SW release building. Can AFS responsibles have a look, please?
    • [Massimo: known problem last week, but should have improved in recent days. Did you observe a degradation also in the last two days? Can you please open a ticket and provide details of the affected AFS paths? Alessandro: there was a general AFS slowness last week. Massimo: some problems were traced to the AFS client version and there has been a campaign to upgrade to the AFS client 1.4.14. AndreaV: did you upgrade only the central services like lxplus/lxbatch or did you contact the people responsible for individual VOboxes/clusters? Maarten: shouldn't there be something like an IT broadcast that people should reboot to pick up the latest AFS client? Massimo: will follow up.]
    • [Mattia: some LHCb SAM tests were failing due to the AFS problem.]

  • CMS reports -
    • LHC / CMS detector
      • Commissioning with circulating beam. Stable beams still foreseen on Saturday afternoon.
        • Details follow. Morning: ramp for collimation - lost beam from QTF trip. Collimator setup end of squeeze, then collide and setup of collimators with colliding beams. Afternoon: injection of multi-bunches, collimation, machine protection tests. Night: loss maps, async dumps.
    • CERN / central services
      • Jacek: "The CMSR database went unexpectedly down around 10:20. We are trying to restart the service ASAP." (10:38 GVA time). Major impact on most CMS services (PhEDEx, DBS, T0, DAS, Visualization... just to give examples). CMS operators put all the CMSWEB backends to maintenance mode to avoid useless service restarts by IT operators, when the Oracle servers will come back CMS operators will restart all the services running on CMSWEB cluster. Jacek confirmed us that the DB has been restarted at ~12:30 GVA time on the standby hardware. Quoting: "The failure was caused by a local power cut in the CERN CC which in turn caused a multi disk failure and the lost of the primary DB. Due to the failover to the standby system few seconds of transactions committed just before the failure could be lost". CMS triggered a restart of all CMSWEB services as of 12:49 GVA time. All done and communicated by 13:28 GVA time. SUMMARY (from our side): total CMSR downtime ~2h10m. Additional CMSWEB recovery time ~40m. [AndreaV: is there a ticket for this? Eva: there is a ticket about a possibly related connectivity issue; will prepare a SIR anyway.] [ Added the following week: this is the link to the SIR. ]
      • https://ca.cern.ch/ wasn't responding yesterday (at least since ~2pm) and today no success again. Stephen opened a ticket (INC020487, turned into RQF0002967, Stephen not sure why). Now FIXED though. [Maarten: the INC ticket being turned into a RQF may be another occurrence of the problem seen by Atlas, that MariaDZ is following up.]
      • ELOG: no new notification of the 'cernmx' errors, still asking shifters to monitor it though
    • Tier-0 / CAF
      • NTR
    • Tier-1
      • progress on tickets on RAL and IN23P3, CNAF is slow in migrating to tapes, more details tomorrow
    • Tier-2
      • NTR
    • [ CMS CRC-on-duty from Mar 8th to Mar 14th: Daniele Bonacorsi ]

  • ALICE reports -
    • T0 site
      • CREAM submission is again working
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual issues at T2's

  • LHCb reports -
    • Experiment activities:
      • MC productions running at full speed.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • cvmfs is being installed in some batch nodes at the moment. This is going to be tested over the w/e on a subset of nodes.
      • T1
        • NTR
      • T2 site issues: :
        • 3 T2 sites failed for MC productions, GGUS tickets have been opened for theses sites

Sites / Services round table:

  • FNAL: ntr
    • reminder: US will switch to daylight savings time this weekend
  • PIC: ntr
  • NDGF: link to OPN was severed last night, but backup worked fine, so nobody noticed
  • ASGC: reminder, 4h downtime on Sunday for power maintenance between 4am and 8am CERN time
  • RAL: ntr
  • KIT: ntr
  • CNAF: ntr
  • NLT1: announcement of one-day maintenance on Sara SRM on Tuesday 22 for software and formware updates, triggered by the mail circulated by Gonzalo yesterday about cooling (thanks to Gonzalo!)
  • OSG: a few more records for BNL will probably need to be racalculated [Alessandro: please copy me in CC so that I can follow up for ATLAS]

  • Database services:
    • LCGR was also affected by the power cut this morning, but the impact was lower, only 3 nodes went down and sessions were moved to the fourth node
    • the precise causes of the failure caused by the power cut are still being investigated
    • the three ATLAS calibration centers in Rome, Munich and US started streaming data to ATLR yesterday
  • Grid services: cernvmfs has been deployed on some batch nodes; when this deployment is validated, cernvmfs will be deployed on all batch nodes in one or two weeks.
  • Network: ntr
  • Dashboards: nta
  • GT group: ntr

AOB: none

-- JamieShiers - 03-Mar-2011

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2011-03-15 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback