Week of 120820

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(AndreaS, Stefan (LHCb), Torre (ATLAS), Alex (Dashboard), AlexL (CERN));remote(Philippe (IN2P3), Saverio (CNAF), Michael (BNL), Gonzalo (PIC), Alexander (NL-T1), Thomas (NDGF), Tiju (RAL), Rob (OSG), Ian (CMS)).

Experiments round table:

  • ATLAS reports -
    • Added to ongoing issues
      • Problem seen in PanDA pilot with JSON unicode when using python 2.6 and lfc_addreplicas(). Cannot be fully solved by the pilot due to a bug in the LFC API. A GGUS ticket has been created. Until this issue has been resolved, we cannot use python 2.6 in combination with LFC registrations. GGUS:84716. It should be solved before of the next update of the pilot.
    • T0
      • Nothing to Report
    • T1
      • Nothing to Report

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • Hammer cloud failures at ASGC, caused by a particular CE
      • Requests for New tape families at several Tier-1s

  • ALICE reports -
    • CERN: EOS-ALICE was further upgraded to provide support for hosting the ALICE calibration data under a separate name space: thanks!

  • LHCb reports -
    • T0:
    • T1 :
      • GridKa : GGUS:85270, problem with disk servers, all failed servers have been recovered by this morning. Jobs submitted earlier had a very high failure rate.
Sites / Services round table:
  • BNL: ntr
  • CNAF: ntr
  • IN2P3: concerning the transfer problems due to timeouts on long transfers that were reported last week, the timeouts have been increased. Will check if there is an improvement.
  • NDGF: will have a downtime tonight due to an intervention to the network link to a subsite. A backup should kick in, so the event should be transparent.
  • NL-T1: ntr
  • PIC: ntr
  • RAL: tomorrow morning will have a maintenance on the site firewall, services will be unavailable
  • OSG: ntr
  • Dashboards: the SAM update has been completed successfully. The user interfaces still point to the preproduction service and will be switched to production after performing a final validation
AOB:

Tuesday

Attendance: local(Andrea, Alexander (Dashboards), Torre (ATLAS), Jan (CERN), MariaD (CERN), Eva (CERN));remote(Philippe (IN2P3), Michael (BNL), Gonzalo (PIC), Salvatore (CNAF), Elisa (LHCb), Tiju (RAL), Thomas (NDGF), Ron (NL-T1), Ian (CMS), Lisa (FNAL), Rob (OSG), Xavier (KIT)).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • T0 export to EOS failures this morning from ~2:08 UTC, "eosatlas was automatically restarted and since the restart time was taking long (similar issue last week) I decide to put the instance in read only and apply a fix for the situation. eosatlas was available again around 5.30" GGUS:85370. Jan explained that this update was initially scheduled for next week but they decided to apply it now, and took also the opportunity to migrate to a machine with more memory; a further change is foreseen for next week.
      • Eager for resolution of the svn problems! http://itssb.web.cern.ch/service-incident/major-incident-svn-repositories/21-08-2012
      • T0 LSF: Jobs end in LSF but they're still marked as RUN by bjobs. This is happening for less than 1 per thousand of the jobs, it started around the 19th of August and it's still going on (INC:155320)
      • Our SLS monitors were not updating for a short time around ~12:00-12:30 CEST, main sls.cern.ch page showed the same, "database unavailable" from the logs. Is it known what happened? Eva explained that there is a problem with the SLS database being overloaded by processes; investigating the cause.
    • T1
      • Nothing to Report

  • CMS reports -
    • LHC / CMS
      • Normal running
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • NTR

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • MC simulation and user analysis at T2s
    • T0:
      • constant rate of failed pilot observed during last week, GGUS:85385. They are due to timeouts.
    • T1 :
      • GridKa : bunch of failed FTS transfers to Gridka around 1AM UTC, GGUS:85270. No failures in the latest 2 hours, will keep ticket open until it is confirmed that the problem is fixed. Xavier said that actually this problem is not related to the one in the ticket.
Sites / Services round table:
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: GGUS ticket about transfer problems closed, the increase in the timeouts solved it.
  • KIT: ntr
  • NDGF: will update dCache head nodes tomorrow afternoon, 1-2 minutes of outage, declared in GOCDB
  • NL-T1: ntr
  • PIC: ntr
  • RAL: the firewall intervention was successfully completed this morning
  • OSG: ntr
  • CERN storage: will apply an emergency update to EOS ALICE
  • Dashboard: ntr
  • Databases: ntr
AOB:

Wednesday

Attendance: local();remote().

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • Nothing to Report
    • T1
      • Nothing to Report once again; excellent stability! (knock knock)

  • CMS reports -
    • LHC / CMS
      • CMS had the magnet ramp down yesterday afternoon. Some calibration runs, and the Tier-0 transfer was disabled for some time
    • CERN / central services and T0
      • We had a Castor failure last night. Alarm ticket submitted and responded to immediately, thanks!
    • Tier-1/2:
      • We seem to have a problem with the pilot submission to PIC. Waiting on FNAL to wake up and check the factory.

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • MC simulation and user analysis at T2s
    • T0:
      • constant rate of failed pilot observed during last week, slightly improved this morning (though no update in the ticket) GGUS:85385
    • T1 :
      • IN2P3: all transfers fail in the channel IN2P3-PIC and some transfers fail in channel IN2P3-CNAF, GGUS:85305
Sites / Services round table:
  • RAL: ntr
AOB:

Thursday

Attendance: local(Andrea (SCOD), Torre (ATLAS), Alex (Dashboard), Jan (CERN), AlexL (CERN), MariaD (CERN));remote(Philippe (IN2P3), Jeff (NL-T1), Michael (BNL), Ian (CMS), Lisa (FNAL), Salvatore (CNAF), Kyle (OSG), WooJin (KIT), Gareth (RAL), Jeremy (GridPP), Roger (NDGF), Jhen-Wei (ASGC), Gonzalo (PIC)).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • T0: Chronic lsf bsub slowness persists; most recently, high ~10sec submit times for an hour this morning. ADC organizing a meeting of experts to address. A script causing the slowness was found and removed, so the problem should be solved.
      • File missing on EOS, thanks for the recovery! Issue being worked on what went wrong in (not) detecting the transfer failure. GGUS:85421. Jan: this happened also last week, all files were recovered but for 65. We recommend to make a second xrtood call to check that the file was written as sometimes exit codes do not work correctly.
    • T1
      • PIC: Missing files at PIC_DATADISK, site is investigating. GGUS:85426
      • RAL-LCG2: There are have been a recent number of functional test errors from srm-atlas.gridpp.rl.ac.uk to a number of sites in the FR and IT cloud. Site is having a look. GGUS:85438. Gareth: this was because the test files were being moved to different disk servers, so the cause is understood.
      • NIKHEF-ELPROD: Transfers failing with 'file exists' errors. We're using overwrite option in FTS 2.2.8, removal of existing files works elsewhere but not here. Being looked at. GGUS:85439. Jeff: FTS does not really overwrite, it first deletes the old version and this was what failed. The cause was in a faulty disk server and the problem is now solved. The FTS error message should be more clear.

  • CMS reports -
    • LHC / CMS
      • New release for 0T running. Should increase the speed of the reco, so we are holding until the release is validation
    • CERN / central services and T0
      • CERN T0 was failing the SAM test, but looks like a local CMS configuration issue. Andrea: the SAM test will be updated today or tomorrow to a completely new version, which generates no errors at CERN.
    • Tier-1/2:
      • We seem to have a problem with the pilot submission to PIC. Ticket still open. We don't yet know if a reboot of a glidein-related service solved the problem.

Sites / Services round table:
  • ASGC: ntr
  • BNL: ntr
  • CNAF: 1) started working on an old ticket (GGUS:83304) about transfer problems from CNAF to UK Tier-2's. 2) There is a Savannah ticket about transfer problems with Caltech, waiting for them to run some network checks to understand where is the problem. 3) Problem with a WMS node (GGUS:85415) failing to accept new jobs; an RPM upgrade did not solve it.
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: an NFS problem on a worker node caused some ATLAS jobs to fail; it should soon be fixed.
  • NDGF: yesterday's dCache upgrade took more than anticipated due to database problems. Another intervention was scheduled for today.
  • NL-T1: ntr
  • PIC: working on the problem reported by ATLAS, not yet understood, contacted the dCache support; any news will be posted to the GGUS ticket.
  • RAL: ntr
  • OSG: ntr
  • CERN Central Services: SIR on the SVN issues earlier this week
  • CERN Grid Services: some CMS Nagios tests show CERN site in red (cf. GGUS 85431 & 85432) however looking at the failed test's output (e.g. https://sam-cms-prod/nagios/cgi-bin/extinfo.cgi?type=2&host=ce204.cern.ch&service=org.cms.WN-mc-%2Fcms%2FRole%3Dproduction ) this seems to be a fault in the test script... Or at least the test output does not give clear indication of what component at CERN may be failing. Comments from CMS welcome, hoping we can get the site green again soon.
    BTW, a link to the test output in future similar tickets (rather than a link to a dashboard) would save a lot of time. Thanks!
  • CERN storage: ntr
  • Dashboards: 1) the SAM update 17 is finished, including the reconstruction of historical data, and we are waiting for the experiments to validate the data. 2) The TRIUMF FTS stopped sending messages for the WLCG transfer monitoring, we understand that it is because they are upgrading it and it should be soon back to normal.

AOB: (MariaDZ) Concerning the course on hadoop in preparation at CERN, its content is taking shape and sessions' dates can be in October and November. To better tailor the course to our needs, more info is now needed from you &/or your team members in the doodle http://doodle.com/xqb5bzpchcb52knu Please do read the course description in the doodle comments and select carefully:

  • The content you actually need (clearly distinguish between Hive/Pig, HBase or the Developer's course).
  • Whether your subscription will be approved by your management.
  • Add a comment, if necessary.
Community members not covered by CERN departmental training budgets will have to cover their course costs and accomodation. Instructions for paying for the course can be obtained by technical.training@cernNOSPAMPLEASE.ch (please put maria.dimou@cernNOSPAMPLEASE.ch in Cc).

Friday

Attendance: local(AndreaS (SCOD), Jamie, Torre (ATLAS), Alex (Dashboard), Jan (CERN), Marcin (Databases));remote(Philippe (IN2P3), Gonzalo (PIC), Lisa (FNAL), Alexander (NL-T1), Saverio (CNAF), Jhen-Wei (ASGC), Xavier (KIT), Gareth (RAL), Michael (BNL), Christian (NDGF), Ian (CMS), Woo Jin, Elizabeth (OSG), Scott).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • T0 data export to CERN-PROD degraded yesterday evening by "Mapped user 'atlas003' is invalid" errors. Promptly addressed but origin of the problem not understood yet? GGUS:85455. Jan: by the time we identified the machine affected, the problem had already disappeared and the logs do not contain useful info. Filed a RFE to be able to spot it in time next time.
      • T0: Morning spikes in lsf bsub time reduced (5sec rather than 10sec) after removal of script(s) by lxbatch team, but still present
    • T1
      • INFN-T1_DATADISK full, free space down to 1TB. Cleanup yielded 10TB thus far. Removed from T0 export pending adequate space availability.
      • Taiwan-LCG2: Castor problems, excluded from T0 export this morning pending resolution (~8% T0 export success rate at time of exclusion). Resolution reported later this morning, transfers not showing problems since, if this persists through this afternoon will restore T0 export. GGUS:85461

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Some persistent job failures on Tier-0. Under investigation
    • Tier-1/2:
      • Hammer cloud problems at ASGC

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • user analysis at T2s
    • T0:
      • constant rate of failed pilot observed during last week GGUS:85385
    • T1 :
      • IN2P3: problem relative to failed FTS transfers in the channel IN2P3-PIC and IN2P3-CNAF solved, GGUS:85305 can be closed.

Sites / Services round table:

  • ASGC: this morning we had a RAC problem on the CASTOR database and had to reboot it; now it should be ok. Next Tuesday there will be a DPM storage hardware intervention, the downtime is in GOCDB.
  • BNL: ntr
  • CNAF: yesterday at 11 pm the PhEDEx debug instance stopped due to a shutdown of the virtual machine. Now it is recovered.
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: next Monday we will have an intervention on the tape library and to upgrade dCache.
  • PIC: ntr
  • RAL: next Monday is holiday in UK, so we will not connect.
  • OSG: there is a problem with GGUS tickets because only the first update gets propagated to our ticketing system. Waiting for help from the GGUS support.
  • CERN storage: ntr
  • Dashboards: ntr
  • Databases: ntr

  • GGUS: File ggus-tickets.xls is up-to-date and attached to page WLCGOperationsMeetings. There are 6 real ALARMs to drill, so far, for next week's MB, that will cover 5 weeks of activity.
AOB:

-- JamieShiers - 02-Jul-2012

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2012-08-27 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback