Week of 100125

LHC Operations

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Jaroslava, Jamie, Harry, Ignacio, Julia, Timur, Nicolo, Eva, Patricia, Roberto, Miguel, Steve, MariaDZ, Gavin);remote((Jon Bakken(FNAL), Kyle (OSG GOC), Rolf (IN2P3), Michael (BNL), Gonzalo Merino (PIC Tier1), Gang Qin, Angela Poschlad (KIT), Rob Quick, Gareth (RAL)).

Experiments round table:

  • ATLAS reports - Weekend issues: srm-atlas.cern.ch SRM v2.2 endpoint timeouts, was red on SLS monitor, high request load on _SCRATCHDISK, alarm GGUS:54949; No transfers between CERN and NIKHEF: CERN FTS restarted on 23 Jan, further details promised during Monday, GGUS:54935; Slow transfers between Triumf and ASGC, got better during 23 Jan; Observed _MCTAPEs short of space (BNL_OSG2, TRIUMF_LCG2, PIC), not sure whether garbage collector will be fast enough under load or do we need larger _MCTAPE buffers.Steve - ATLAS NIKHEF downtime - see CERN report below.
  • CMS reports - Highlights:
    • T0
      1. T0 operators lost interactive access on cmst0 worker nodes (needed for debugging troublesome workflows). T0 ops reports successful login, requesting additional access from vocms68 and vocms69. GGUS #54848
      2. Saturated T0 with 2700 jobs running on Friday, some RFIO open errors: 'Job timed out while waiting to be scheduled' - maybe correlated to high number of queued transfers on T0EXPORT.
      3. New SLC5 VOBOXes vocms02,vocms03 available for PhEDEx, requested registration in myproxy Remedy #CT656082. PhEDEx Debug instance migrated to SLC5, checking for issues before migrating Prod instance.
    • FNAL
      1. New ReReco pass ran at FNAL.
      2. One very large (120 GB) file from a test replay timing out in transfers T0-->FNAL.
    • PIC
      1. JobRobot errors during weekend: MARADONA error.
      2. File waiting for a long time to be exported to T2s.
    • ASGC
      1. Following up on deployment of SLC5 software releases
    • CMS T2 sites issues: Highlights (selected ones):
      • Various MC production workflows running in KIT, IN2P3, RAL, CNAF, PIC, FNAL T2 regions - T2_FR_GRIF_LLR excluded for scheduled downtime.
      • Open tickets for T2_EE_Estonia, T2_BR_SPRACE
      • T2_UK_SGrid_Bristol migrated production storage element to StoRM.
      • T2_BE_UCL - SLC5 nodes configured. T2_UK_London_IC - requesting installation of dependency meta-RPM on SLC5.

  • ALICE reports -
    • T0 site o On Friday ops meeting a kernel upgrde was announced (performed on Thursday, nodes were drained and rebooted on Friday). The operation was annouunced as almost done at the T0, however during the whole weekend and also today, SL5 batch systems are announcing 99 total CPUs. Issues to be aksed to the expets today during the ops. meeting o Access to the ALICE VOBOXES reported last week, still having problems with voalice14. Access to this machine is immediately closed while it is using the same CBD tamplates as voalice11, 12 and 13. Remedy ticket opened yesterday: CT656294. This is concerning the vobox submitting to CREAM-CE, the issue should be considered of high priority
    • T1 sites o CNAF T1: On regard with the CREAM-CE, the corresponding queue was reporting zero CPUs at the site yesterday afternoon. Reported to the site admin and corrected this morning. Regarding the submission to the LCG-CE via gLite-WMS a missconfiguration in LDAP was avoiding the proper behavior of the Packman service. Solved this morning. Finally the site is finishing the required configuration of ALICE to test glexec at the site
    • T2 sites o CREAM-CE systems announced in two french sites entering production in a short time. Systems announced yesterday for:GRIF IPNO and IRFU o Annotation in Savannah telling CREAM for SGE is being certified: https://savannah.cern.ch/task/?9126#comment13. This will allow Trujillo T2 to provide this service for Alice as soon as the system is ready in production. o LPSC (Grenoble) vobox not reachable yesterday. Issue solved this morning, services restarted and back in production o Poznan VOBOX suffering of some hardware problems announced this morning. Waiting for the confirmation of the site admin to put the system back in production o Madrid WMS system was showing a bad performance yesterday afternoon (jobs were not submitted to the queue although the site was empty). Reported to the site admin. WMS services restarted.

MariaDZ - why don't you use GGUS directly to site? A: site initiated contact but we can do this via GGUS.

  • LHCb reports - Only bb and cc inclusive MC09 stripping in the system now (very few jobs in the system in total at T1's). Launched a MC simulation but some application problem found. T1 sites issues: RAL : Downtime; IN2P3: Downtime.; PIC: an issue with one file for some user analysis jobs Under investigation by our local contact person. CNAF - migration from CASTOR to TSM now completed and registered in catalog.

Sites / Services round table:

  • FNAL - 1) one of BDII servers out of date - not providing correct info. Reported last week - ongoing problem. Ticket was closed & problem fixed but reappeared and reopened. 54803 GGUS. Be good if all those BDIIs would report consistent info - if a site gets the wrong one it gets wrong info. 2) OSG reported accounting correction - was refused. OSG GOC 7986.
  • IN2P3 -confirm still in downtime will continue until tomorrow. At beginning of downtime a lot of ALICE jobs waiting. Operator in charge made a mistake - about 1/3 ended prematurely. Sorry. Hope others will go through ok.
  • OSG - reviewed escalation report. 7 GGUS tickets - 4 closed and some for a week or so. When is this report run? (Monday) Maria - on index page there are many different reports. Tickets assigned to OSG - see documents behind ROC - OSG appears as a ROC. No open issues this week!
  • BNL - minor issue: failure of storage server yesterday. For 3 hours job trying to access input files on that server were unable to get them.
  • PIC - had some hiccough with SRM server this morning. Causes still not clear - suffering from overload and some transfers timeout 09:00 - 11:30 this morning.
  • KIT - ntr
  • ASGC - minor update - performance degradation - continue installing 800TB. (of disk?)
  • anon -
  • RAL - 2 things: 1) currently "at risk": for UPS work - hope of fixing noise on current. (Today - ongoing), 2) also draining batch for intervention Wed/Thu migrating Oracle DBs back to original disk arrays.

  • CERN FTS: The CERN-NIKHEF T0 FTS channel was down on Saturday 23rd from 04:09 till 23:53. The CERN operators did intervene earlier but their restart attempts failed. The service manager restarted agent to correct at 23:53. Log analysis gives no clues. Was also reported by ATLAS. GGUS:54935.
  • CERN - ATLAS SRM incident yesterday ticket # 54949. Doing a post-mortem. User accessing ATLAS scratch at a high rate - queuing building up. Access from SRM timing out. This can take much more than 3' of FTS timeout. If using SRM to access pool user access should be limited to avoid queuing. xrootd access - unscheduled - would not have shown this behaviour. Would have worked better. (ATLAS - in contact with user).

AOB:

Tuesday:

Attendance: local(Jamie, Gavin, Steve, Jaroslava, Harry, Eva, Lola, Nicolo, Roberto, MariaDZ);remote(Jon Bakken, Gonzalo, Angela, Gang, Ronald, Jeremy, Michael, Rolf, Pepe, Jason Rob).

Experiments round table:

  • ATLAS reports - One issue: problems transfers to IT T1. Need to restart SRM daemon and since lunch transfer work well.

  • CMS reports -
    • T0
      1. T0 operators lost interactive access on cmst0 worker nodes (needed for debugging troublesome workflows). T0 ops reports successful login, requesting additional access from vocms68 and vocms69. GGUS #54848
      2. New SLC5 VOBOXes vocms02,vocms03 available for PhEDEx, requested registration in myproxy Remedy #CT656082. PhEDEx Debug instance migrated to SLC5, checking for issues before migrating Prod instance.
    • FNAL
      1. Some merge jobs in ReReco pass at FNAL were stuck, dCache intervention by FNAL admins.
    • ASGC
      1. Following up on deployment of SLC5 software releases

    • T2 sites issues: Highlights (selected ones):
      • T2_UK_SGrid_RALPP, T2_EE_Estonia, T2_IN_TIFR - ongoing errors in SAM tests.
      • T2_ES_IFCA - problem in job reporting to Dashboard from production jobs
      • T2_UK_London_IC SLC5 software deployed, now following up on T2_FI_HIP.
      • T2_IN_TIFR - network issue solved by GEANT, now good upload rates.
      • T2_PK_NCP - CERN-NCP channel demonstrated good stability, requested increase in number of files to saturate site bandwidth.

  • CMS Weekly-scope Operations plan

[Data Ops]

    • Tier-0: taking data in MWGR (mid-week global run) Wednesday/Thursday and otherwise doing replays and transfer tests to T1. Tier-1: waiting for new rereconstruction requests of 2009 data at custodial sites possibly coming this week, backfill at IN2P3 and CNAF as preparation, more testing everywhere especially FNAL. Tier-2: ongoing MC production

[Facilities Ops]

    • Managing VoBox requests for central servers at CERN. Migration of central services to SL5.
    • Several Tier-2 sites still have only SL4 worker nodes and several more have SL5 WN's, but no SL5 builds due to various problems. Tickets are opened for them and progressing.
    • Follow-up the SL5 WNs migration and tape recycling/repack in ASGC Tier-1, to bring the site back to operations and be ready for 2010 run.

Note: Beam Commissioning 09 computing post-mortem session scheduled today at 4.30pm. Discuss lessons learned during 2009 data taking.

  • ALICE reports - GENERAL INFORMATION: New MC production cycle, normal running with more than 13K concurrent jobs
    • T0 site o Issue reported yesterday on regard with the access to voalice14 still following it together with Steve
    • T1 sites o CNAF: Site admin reported yesterday a large amount of Alice agents being submitted to the site (over the limit defined by Alice to avoid any queue overload). It was reported to the site admin, that the information provided by voview regarding Alice queues was not reflecting the real status of the queue. Site admin is following the issue.
    • T2 sites o Issues reported yesterday regarding Madrid and Grenoble solved. o Still waiting for a fix on the Poznan VOBOX (system seems to be continuously overloaded) o The regional expert in Italy reports: + Cagliari: The new VOBOX is suffering of some instalabilities in the proxy renewal mechanism. Being followed by the site admin + Catania: The CE at the site is showing a bad performance. Issue being followed by the site admin

  • LHCb reports - No scheduled activities running in the system now.
    • T0 sites issues: o CASTOR upgrade this morning. o VOBOXes:forced reboot this afternoon for kernel upgrade o volhcb15: (LHCb log SE service). Delivered, the machine seems to have not the right partition (just one for "/" and usually ones for OS despite has been explicitly asked a different partitioning)
    • T1 sites issues o RAL : Downtime o IN2p3: Downtime. o PIC: SE was banned due to many users affected by a dCache pool which had some network problem. The content of the problematic pool has been migrated to a new one. o CNAF:CREAM CE problems if using sgm account.
    • T2 sites issues: o Shared area issues both at UKI-SOUTHGRID-RALPP and AUVERGRID

Sites / Services round table:

  • FNAL - As Nicolo said one problem with pool which had system disk replaced. Incomplete install of dCache s/w - repaired last night and now ok. Rolling upgrade of dCache pools to newer version to fix one of bugs in 1.9.5-11 series.
  • PIC - ntr
  • KIT - Need maintenance on part of one tape library - plan at risk Monday 08:30 - 12:30. During this time 1/3 of old data will not be accessible for reading. ATLAS has a downtime at this time and local CMS say 'OK!".
  • NL-T1 - At NIKHEF have completed kernel upgrade on WNs, UI and VOBOXes. SARA has experienced CREAM CE crash - increased FTS timeout to one of Russian T2s.
  • BNL - Issue with firmware on core switch - Force10 - communicated to Force10. They have provided a fix which should arrive today. Will fix asap. Requires reboot of switch. T1 activities will stall for 5-8'. May see a couple of failed transfers but no disturbance to production jobs.
  • ASGC - ntr
  • OSG - a couple tickets open on BNL. 15089, 15121. Both opened last week(end?).
  • GridPP - Observation: increasing number of RAID-related server problems at T2s. Incident at Glasgow - more user analysis. These groups must make copies / backups of files. Reminder!.
  • IN2P3 - still in downtime. Small extension due to unannounced update to WNs. Else all on schedule. Storage ok, Oracle security patch underway. Batch controller already there but no batch for moment...

  • RAL - We wish to remind everyone that we have an intervention planned for the LFC tomorrow and Castor tomorrow and Thursday. We are also in the process of draining the batch system.

  • CERN - ALICE mentioned capacity decrease following kernel update. Corrected this morning. ATLAS also noticed it - ran out of SLC5 resources completed. Now ok.
  • CERN DB - ALICE Has upgraded firmware of 3 disk arrays. Still running on DB at CC. Will schedule switchover for end of this week - still TBC. Start deploying latest Oracle security in production this week: Wed - ATLAS; Thu - LHCb. Rest next week. Rolling intervention.

AOB:

  • Reminder of GGUS ticket on BDII issue - Jon: checked this morning and out of date for close to 80h 54803.

  • Issue of recalculation of availability. Follow-up?

Wednesday

Attendance: local();remote().

Experiments round table:

  • ATLAS reports -
    • FTS channel IN2P3-SARA stuck: FTS channel monitoring did not show any Active FTS jobs, there were only Ready and Finished jobs. Problem under investigation, "FTS racing" with LHCb excluded so far. GGUS:55008. Similar issue observed during weekend on BNL-CNAF FTS channel (cause not found), GGUS:54943.
    • RAL on scheduled downtime.

Sites / Services round table:

Release report: deployment status wiki page

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 22-Jan-2010

Edit | Attach | Watch | Print version | History: r10 | r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2010-01-27 - JaroslavaSchovancova
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback