Week of 120827

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Alexandre, Cedric, Luca M, Maarten, Maria D, Przemek, Stefan, Torre);remote(Alexander, Christian, Gonzalo, Ian, Jhen-Wei, Lisa, Michael, Rob, Rolf, Saverio).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • CERN-PROD transfer failure due to "No such file or directory", ticketed Fri evening, prompt response, dialogue sorting out the issue. GGUS:85488
        • Luca: waiting for the transfer date and time
      • T0: Morning spikes in lsf bsub time still an issue, 1hr this morning with over 10sec (15 sec peak) bsub times.
      • DDM central catalog load balancing not functioning correctly over the weekend -- LB reported the correct machine that should be used but this was not ultimately the machine that was used, resulting in machines with no activity for long periods triggering SLS alarms. SNOW ticket INC:156560
    • T1/Calibration centers
      • Staging failures at NDGF-T1, checksum mismatch, several hundred job failures. Ticketed Fri afternoon, awaiting site response. GGUS:85483
      • Transfer failures to AGLT2 muon calibration center, ticketed Fri afternoon, promptly resolved. GGUS:85480
      • Brief PIC SRM service interruption Sat morning, overload on the PgSQL server, promptly resolved. GGUS:85491
      • TRIUMF job failures due to LFC registration failures, ticketed Sat morning, known issue of heavy LFC load, load from local LFC cleaning underway was reduced. Still some errors Sun. GGUS:85494
      • IN2P3-CC jobs failing with "Staging input file failed", posted Sat afternoon. Heard from site this morning, hardware problem is being addressed. GGUS:85498
      • SRM failures at TAIWAN-LCG2_DATADISK reported Sun morning, T0 export degraded to ~40% success rate, promptly addressed, storage overheating, keeping an eye on it. GGUS:85499
      • TRIUMF FTS failures on inbound CA transfers, "Not authorised to query request", due to proxy expiration. T0 export to CA not seriously impacted. Ticketed Sat afternoon, resolved Sat night. GGUS:85502
      • T0 export failures to IFIC calibration center, 100% failures when reported Sun evening, SRM problems promptly resolved (stuck disk server), closed this morning. GGUS:85504
      • Taiwan transfer failures to TAIWAN-LCG2_DATATAPE and MCTAPE failing, including T0 export. Reported this morning. Site reports tape service degraded, units shut off to cool computer room pending completion of A/C repair. GGUS:85506

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Some persistent job failures on Tier-0 and closing out luminosity sections. Did not appear to be caused by any problems with central computing services
    • Tier-1/2:
      • Hammer cloud problems at ASGC reported over the week

  • LHCb reports -
    • T0:
      • constant rate of failed pilot observed during last week GGUS:85385
    • T1:

Sites / Services round table:

  • ASGC
    • 1 A/C system broke down in the weekend, some systems were switched off to avoid overheating, under control now
    • tomorrow DPM storage firmware upgrade in parallel with network intervention
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • NDGF
    • will ping colleagues about ATLAS ticket
  • NLT1
    • We are currently in downtime because of maintenance on our tape backend. We have already used this downtime to upgrade dCache to 2.2.4 and it seems to work fine.
  • OSG
    • followup on the issue reported on Fri: for tickets that start in Savannah and are bridged to GGUS OSG only receives the initial message and the solution, while intermediate updates are not seen; furthermore, OSG updates do not make it to the Savannah ticket either; GGUS devs said that locking does not work for Savannah tickets
      • Maria: at CHEP or the WLCG workshop in NY CMS said they were considering abandoning the usage of Savannah in favor of GGUS
      • Ian: we will look into accelerating that transition, but there are some technical issue to be solved, e.g. how to ensure the right squad is notified of a particular ticket; time line would be this autumn
  • PIC
    • SRM incident early Sat morning due to overloaded DB host, solved quickly

  • dashboards - ntr
  • databases
    • yesterday around noon ATLARC became partially unavailable for a few h; 1 instance had a shared memory problem, cured by a reboot, cause unknown
  • GGUS
    • File ggus-tickets.xls is up-to-date and attached to WLCGOperationsMeetings page. The 6 real ALARM drills for tomorrow's MB are attached to this page.
  • storage
    • ALICE to respond to GGUS:85497 (11 ALICEDISK CASTOR files cannot be accessed)
    • EOS-ATLAS upgrade Wed

AOB:

Tuesday

Attendance: local(Alexandre, Cedric, Ian, Luca M, Maarten, Marcin, Maria D, Stefan, Steve);remote(Christian, Gonzalo, Jeff, Jhen-Wei, Lisa, Michael, Rob, Rolf, Salvatore, Tiju, Xavier).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
    • T1
      • Some error transfers from T0 to BNL : "[FTS] FTS State [Failed] FTS Retries [1] Reason [TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] globus_gass_copy_register_url_to_url transfer timed out] Duration [0]" GGUS:85541.
        • Michael: a few percent of the transfers time out, only between CERN and BNL; unclear where the problem is exactly; the CERN FTS logs may provide clues
        • Steve: will follow up

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • File reported lost in EOS. Re-created
      • Ticket created for SRM yesterday afternoon
        • Luca: can you check the activity that is not coming from PhEDEx?
        • Ian: production jobs?
        • Luca: they should not be writing to t1transfer? some files have the string "CRAB" (or "crab") in their name
        • Ian: will check
    • Tier-1/2:
      • Some reports of poor transfer quality between Tier-1s, but seems to have recovered.

  • LHCb reports -
    • T0:
      • constant rate of failed pilot observed during last week GGUS:85385
    • T1 :
      • GRIDKA: LHCb VOBOX not accessible (GGUS:85544), swiftly fixed this morning by rebooting the node.

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • see ATLAS report
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • downtime Sep 18: complete outage for batch all day; 1h for DB (which may affect dCache); the rest would not be affected
  • KIT - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • databases - ntr
  • GGUS
    • To make sure we don't forget Savannah:131565 was opened following OSG's report yesterday on the savannah-ggus-osg ticket interfaces.
  • grid services - ntr
  • storage - nta

AOB:

Wednesday

Attendance: local(Alexandre, Cedric, Luca M, Maarten, Marcin, Maria D, Stefan, Steve);remote(Christian, Gonzalo, Jhen-Wei, Lisa, Michael, Rob, Rolf, Ron, Saverio, Tiju).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • ntr
    • T1
      • INFN-T1 : "Problem on the StoRM BE. We restarted the service and it seems to be working now". GGUS:85580
      • SARA-MATRIX : [SE][StatusOfPutRequest][SRM_NO_FREE_SPACE]. Inconsistency between dCache and BDII. GGUS:85581
        • Ron: symptoms currently gone, but the problem is still being looked into; may be due to dark data (if any), which will then be removed

  • ALICE reports -
    • CERN: 157 ALICEDISK or EOS files cannot be accessed (GGUS:85570)
      • Luca: the CASTOR data is on a machine in standby and would need to be copied to another disk server, but currently there is no space remaining
    • CERN: IGCA (India) CRL out of date on multiple services including MyProxy and VOMS (GGUS:85537)

  • LHCb reports -
    • T0:
      • constant rate of failed pilot observed during last week GGUS:85385
    • T1:

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • NDGF - ntr
  • NLT1 - nta
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW - ntr
  • grid services - ntr
  • storage
    • EOS-ATLAS updated between 14:00 and 14:40 CEST; currently still readonly to check it out, should become read-write again at 17:00

AOB:

Thursday

Attendance: local(Alexandre, Cedric, Luca M, Maarten, Marcin, Stefan, Steve);remote(Burt, Christian, Gonzalo, Ian, Jeff, Jhen-Wei, John, Michael, Rolf, Saverio, WooJin).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • EOS problem after the upgrade. Some files cannot be accessed : GGUS:85616
        • Luca: there were also some put problems experienced by PanDA, which was a different issue; some files may have been replaced or deleted while the problem was present; the cause was another incident a few months ago, which allowed files to be put without a name space; this has been rectified, but there was some fall-out from the subsequent cleanup attempts; today another update was applied; the system should be OK now, except that some temporary quota issues may be observed, which can easily be cured directly by ATLAS
    • T1
      • SARA-MATRIX : Missing files : GGUS:85581. Maybe due to race condition between Panda and deletion service. Under investigation.

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • SUM Test failure reported at ASGC

  • LHCb reports -
    • T0:
      • constant rate of failed pilot observed during last week GGUS:85385
        • Steve: will have this followed up
    • T1:

Sites / Services round table:

  • ASGC
    • recently high loads on the CASTOR DB are observed during routine backups, which may lead to occasional SRM errors (cf. CMS report); the cause is unclear, we are in contact with CASTOR support from CERN and RAL
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • databases - ntr
  • grid services - ntr
  • storage - nta

AOB:

Friday

Attendance: local(Alexandre, Cedric, Luca M, Maarten, Stefan, Steve);remote(Alexander, Burt, Elizabeth, Ian, Jeremy, Jhen-Wei, John, Michael, Saverio, Stephane, Thierry, Xavier).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • EOS error : GGUS:85667 . The service stopped around 6 due to a SEGV in the XRootD GSI authentication module. The automatic restart wasn't enabled after the update yesterday. Service is back now.
        • Luca: another ticket was opened (GGUS:85672) because of EOS source instead of destination errors; an Xrootd bug has been opened; meanwhile we have the auto restart as a workaround
    • T1
      • ntr

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • One of our service components grew to 16GB of RSS. Reset but under investigation
    • Tier-1/2:
      • NTR

  • LHCb reports -
    • T0:
      • constant rate of aborted pilot observed during last week GGUS:85385
    • T1 :
      • IN2P3: Jobs failures, cannot load shared library, site was banned for production (GGUS:85644)

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • yesterday there was a fiber cut in the link from Washington to New York, causing the OPN bandwidth to BNL to drop by 50% to 9 Gbps; fixed since 07:30 UTC today
  • CNAF - ntr
  • FNAL - ntr
  • GridPP - ntr
  • KIT - ntr
  • NDGF
    • this morning there was a power distribution issue affecting the Danish site, e.g. pools were down; fixed
  • NLT1 - ntr
  • OSG - ntr
  • RAL - ntr

  • dashboards
    • Tue morning the WLCG integration DB will have security patches applied, which may cause occasional glitches in the dashboards for ATLAS and CMS
  • grid services
    • there was a VOMS-Admin hiccup yesterday, noticed by 1 person; fixed
  • storage
    • transparent update of EOS-ALICE going on to prevent issues due to known bugs

AOB:

  • lately there have been instabilities with the Audioconf system (Alcatel MyTeamwork), e.g. preventing CNAF and OSG from connecting today; a ticket has been opened (INC:158097)

-- JamieShiers - 02-Jul-2012

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r1 manage 2491.5 K 2012-08-27 - 11:44 MariaDimou GGUS ALARM drills for the 2012/08/28 WLCG MB
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2012-09-10 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback