Week of 120206

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Jan, Pepe, Fernando, Xavier, Eva, Maarten, Mike);remote(Gonzalo, Ulf, Jhen-Wei, Lisa, Burt, Joel, Stefano*2, Onno, Lorenzo, Tiju, Rolf, Rob).

Experiments round table:

  • ATLAS reports -
  • Report to WLCGOperationsMeetings
    • T0/Central Services
      • CERN-PROD: SRM EOS problems writing into ATLASDISK with lcg-cr. Ticketed Sun 00:52. GGUS:78923 [ Jan - error has gone down since Sunday, suspect it might be a new release of lcg_utils. Error message is missing space token - being discussed with Alessandro ]
    • T1s
      • SARA-MATRIX: FTS server problem ticketed Sat 22:30. Looks like Oracle problem, will be investigated Monday. GGUS:78922
      • NDGF-T1: Data staging failures on ND.ARC ticketed Sat 23:00. Five ATLAS and ALICE pools went offline, some data unavailable. GGUS:78924


  • CMS reports - NTR - crunching data happily


  • ALICE reports -
    • xrootd redirector voalice16.cern.ch was stuck Sunday evening and on our request got reset around midnight by the operator


  • LHCb reports -
    • T0
    • T1
      • GRIDKA: Problem with LFC; changed to InActive in configuration [ Eva - seems that they have some firewall problems with their new DB ]
      • SARA - looks like some problems their too.

Sites / Services round table:

  • PIC - ntr
  • NDGF - down pools have been moved to GPN ; network problem in Norway; tomorrow at 13:50 UTC SRM head node will be down for 1h for service; install patches for xroot to get alice to talk to that box
  • FNAL - ntr
  • ASGC - ntr
  • CNAF - ntr
  • NL-T1 - as reported by ATLAS there was an FTS issue this w/e caused by Oracle upgrade last week. Reason is that the upgrade switched off the auto-extent setting in Oracle so that tablespaces were not automatically extended. Fixed this morning. Currently we have another problem probably caused by Oracle upgrade. Issue with Streams being investigated. Affects LFC LHCb. Reminder: tomorrow at SARA all grid services are down.
  • RAL - ntr
  • IN2P3 - reminder that tomorrow we have a complete outage for batch and dcache.
  • KIT - ntr
  • OSG - seem to be having problems accessing mywlcg again. Just got a message from Pedro saying that it is running slow. Still having problems with tickets with attachments.

  • CERN storage: have had problems with CMS EOS this morning. Switching from master to slave config. Suspect network issue. Each time it switched unavailable for about 5'

  • CERN DB: CMS archive DB being switched back to 11g copy; LHCb online DB was moved to CERN CC and upgraded to 11g and applied latest sec. patches; ATLAS offline DB ADCR and ATRL will be patch tomorrow; 20' downtime at end to change some DB parms. Intervention ongoing on h/w hosting LCGR, CMS offline and online DBs - should be transparent

  • CERN Grid: this morning SAM test failures observed for OPS and CMS. Not clear where these problems came from. Random sets of CEs could not be resolved by WMS. Went away by itself.

AOB:

  • Input for T1SCM regarding GGUS - please send to Maria Dimou.

N.B. no meeting / call tomorrow due to WLCG Technical Evolution Group reports

Tuesday

N.B. no meeting / call today due to WLCG Technical Evolution Group reports

Wednesday

Attendance: local(Xavier, David, Jarka, Fernando, Jamie, Uli, Luca);remote(Sunrise, Xavier, Alexander, Burt, Lorenzo, Raja, Tiju, Rob, Ulf, Gonzalo, Jhen-Wie, Rolf).

Experiments round table:

  • ATLAS reports -
  • Report to WLCGOperationsMeetings
    • T0/Central Services
    • T1s
      • SARA-MATRIX: Scheduled downtime Tue 7am – 7pm UTC. Replacement of the 6620 SAN storage hardware, affecting a number of grid services such as SRM, dCache and the UI. Firmware updates will be carried out as well. Not all services will be down all day but can be rebooted at any time. Since LFC has already been migrated to CERN, cloud is left online and only SARA has been set offline in Panda and DDM. Site was set online Wed morning.
      • IN2P3-CC: Scheduled downtime Tue 6am – 7pm UTC. Maintenance and upgrade of the various services and servers. The vast majority of services for ATLAS will be affected: LFC, dCache, FTS, batch system, Worker nodes, etc. Complete cloud has been set offline in Panda and DDM. Downtime for most services finished in time, only downtime for CE and SE had to be extended until Wed midday.
      • RAL: Scheduled downtime Wed 9am - 4pm UTC. Outage for intervention on core network within the RAL Tier1. Affects all services. UK cloud set offline.


  • ALICE reports -
    • Xrootd redirector voalice16.cern.ch hung and had to be reset again this morning. Not clear yet what is causing these instabilities, usage patterns look normal.


Experiment activities:

MC11 Monte Carlo productions and user analysis

New GGUS (or RT) tickets

  • T0
    • Looking forward to new hardware for DIRAC services.
  • T1
    • GRIDKA: Problems with LFC replication at GridKa. GGUS ticket 79014.
    • RAL : Zombie jobs on CreamCE preventing direct job submission which is LHCb preferred method of submission now. Submission via WMS-es working for now. GGUS ticket 78873.
    • IN2P3 : Problem with CVMFS when too many jobs start at the same time. Jobs hang setting up the environment (LbLogin / SetupProject) - this was before the last downtime.


Sites / Services round table:

  • KIT - ntr
  • NL-T1 - ntr
  • FNAL - ntr
  • CNAF - ntr
  • RAL - downtime progressing according to plan
  • NDGF - ALICE is finally able to read SRM storage node; had been failing for past 3 years or so!
  • PIC - ntr
  • ASGC - ntr
  • IN2P3 - had some problems with dcache so had to extend downtime; this should be ok; still another problem not fully understood that appears using scp command internally for some grid accts. As cannot limit access to accts which work have to put CEs in downtime until we understand the reason. Essentially a sudden interruption of scp connection from CE and WNs. Hope to resolve asap.
  • OSG - 1) problem with running query against mywlcg has been resolved; now ok. 2) ggus attachment issue still on-going. trying to work with ggus devs to get this working again. MDZ: we discussed this morning this issue in our weekly meeting. Decided that we will put this in test suite of all releases so that future problems would be caught soonliest.

  • CERN DB - on-going ALICE Online upgrade; yesterday we had a reboot of #3 of CMS offline affecting frontier et al. Human mistake trying to correct network. Similar incident to #5 of LCGR.

  • CERN Grid - LCG CEs will go into draining mode tomorrow; final retirement. During w/e had some problems with one pre-prod EMI CE. Today found a pending firewall issue which should be now be solved. Problems with EMI WN release that we tried to rollout in preprod - decided this morning that we had to rollback. Clash of some executables with CASTOR and some problems with lcg_utils.

  • CERN storage - planned intervention for EOS cancelled as we found a small bug.

AOB: Very Last Chance: Input for T1SCM regarding GGUS - please send to Maria Dimou. No 'ongoing issues' found on twikis or received so far.

Thursday

Attendance: local(Jarek, Fernando, Uli, Zbyszek, MariaDZ, Andrea);remote(Raja, Andreas, Gareth, Ronald, Michael, Ulf, Marc, Lisa, Stefano, Jhen-Wei, Rob, Ian).

Experiments round table:


  • CMS reports -
  • Tier-0 Services: Nothing to Report - global runs will start mid-week next week
  • Tier-1s:
    • ASGC Failing CE SAM Tests [ Jhen-Wei; missed one CMS s/w; asking support to reinstall this ]
    • Between large scale activities, so low activity level expected
  • Tier-2s:
    • Continued Analysis Activity and Simulation ramping up
  • CRC on duty: Ian Fisk


Experiment activities: MC11 Monte Carlo productions and user analysis

New GGUS (or RT) tickets

  • T1
    • GRIDKA: Problems with LFC replication at GridKa - solved. Ticket closed.
    • RAL : Zombie jobs on CreamCE preventing direct job submission - these jobs have been killed. Question - how did they arise and can an automatic procedure be used to kill them?
    • IN2P3 : "srm authentication failed" trying to access some files. GGUS ticket 79074 opened and solved quickly. Waiting to see if it solves the problems seen by user jobs there.


Sites / Services round table:

  • KIT - nta
  • RAL - planned an update to myproxy server here today but had problems with new m/c and backed out; failing OPS VO SAM tests due to IS problems, chasing it down. An outage declared for Tuesday when ATLAS moving LFC from RAL to CERN and will do CASTOR update at same time
  • NL-T1 - ntr
  • BNL -
  • NDGF - have short network break to Finland and 16:00 UTC today. Some ALICE data may be unavailable. One site in Norway has a total net down tomorrow (11:00 for a few hours) - which will affect ALICE and ATLAS
  • IN2P3 - several problems following shutdown; SRM export problem: problem solved after updating some parms; another issue with CEs - in downtime yesterday; found problem with scp command. Configuration from /etc/shadow file. Solved and downtime finished yesterday. Update on LHCb Oracle DB: update planned this morning to install a workaround which has failed; downtime over but will have to reschedule. [ Raja - can we assume LFC is in production now? Zbyszek - yes in production; followed back all changes and should be ok under 10g. ]
  • FNAL - ntr
  • CNAF - ntr
  • ASGC - nta
  • OSG - ntr
  • GridPP - ntr

  • CERN DB - after some problems yesterday with ALICE Online, disk corruption in pit, power cut, final managing about 17:00 to update to 11g. also patched with latest critical patch set. Today also patched ATLAS online DB around 08:00. All experiment DBs are now patched.

  • CERN Grid - all LCG CEs now draining. Currently experiencing some problems with glexec tests.

AOB:

  • When CERN had power cut the other day a service that impacts people remotely (FNAL) was impacted. Is there a mailing list for this?

Friday

Attendance: local(Jamie, Eva, Maria, Maarten, Xavier, Fernando, Alex);remote(John, Onno, Ulf, Rolf, Raja, Rob, Dimitri)

Experiments round table:

  • ALICE reports -
    • Submission to the LCG-CE nodes at CERN has been stopped yesterday evening. The VOBOX involved will be reconfigured to submit to CREAM instead.


Experiment activities:

MC11 Monte Carlo productions and user analysis

New GGUS (or RT) tickets

  • T1
    • IN2P3 : Jobs waiting for a long time after finishing - Tier-1 contact is investigating what could be happening there.
  • Other information
    • Need to avoid having two Tier-1 sites simultaneously down if possible, at least for scheduled downtimes.
    • Need CVMFS at GridKa (asap ...)
    • LHCb sees problems with proxy delegation with the EMI WMS at PIC [ Maarten - bug in EMI-1 WMS and ticket opened. DIRAC can be adjusted to work around this bug ]


Sites / Services round table:

  • RAL - ntr except to reiterate Gareth's report from yesterday about next week's downtime
  • NL-T1 - last night dCache pool node crashed; fixed this morning; GGUS:79089 from ATLAS to us; jobs failing as trying to read a file we don't have. [ Fernando - think it can be closed will check]
  • NDGF - we have a out in Norway which should have been finished by now. Monday at 12:00 UTC small break in Umea to switch nodes back to OPN
  • IN2P3 - could LHCb please give us ticket # into minutes; for downtime collision of T1s understood that announcement 4 weeks before downtime should allow experiments to react. Raja - only noticed downtime when a lot of users complained. We have not yet opened ticket against IN2P3 as we are not yet sure it is not on our side
  • BNL - ntr
  • KIT - ntr;
  • FNAL - ntr
  • OSG - ticket attachment exchanges: still on-going and no update since over 1 week. GGUS:78844 need input before we can move forward. Tickets being exchanged but not their attachments

  • CERN Storage - would like to decommission the default pool for ALICE next week, Tuesday. Agreed by Latchezar.

AOB:

-- JamieShiers - 18-Jan-2012

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2012-03-07 - JaroslavaSchovancova
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback