Week of 090511

WLCG Baseline Versions

WLCG Service Incident Reports

GGUS Team / Alarm Tickets during last week

Weekly VO Summaries of Site Availability

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Jamie, Miguel, Harry, Nick, Jean-Philippe, Markus, Jeff, Olof, Maria, Simone, Diana);remote(Angela, Jeremy, Daniele, Gareth, Michel, Gang).

Experiments round table:

  • ATLAS - (Simone): Dispatching of Cosmic data has stopped. The exercise ran smooth over the Weekend. No problems to report. One problem at SARA with memory in dCache but it was resolved within a couple of hours. BDII lookups from WNs at NIKHEF: not understood but Graeme and Hurng are investigating this issue. Gareth: in standard ATLAS production jobs, when copying files back to the storage element, do you specify the space token? Simone: yes but it depends on whether uploading or downloading the files. It might be that for some sites the space token is not passed on a 'get'.

  • CMS reports - (Daniele) Brief report. mostly following up a number of tickets for transfer problems. Some timeout problems for big files (>50GB?!!!). Daniele has closed ticket for CNAF about Custodial data slow to show up on T1_IT_CNAF_MSS - issue solved. Another issue with 11 unavailable files at IN2P3. Daniele mentioned that there are a number of announced interventions affecting CMS: CASTOR upgrade at CERN - time slot will be discussed at the CMS ops meeting. Daniele also wondered about the CMSONR and CMSR DBs rolling upgrades foreseen for 14th of May (http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/ScheduledInterventionsArchive/CMSONRDBrollinginterventionannouncement.htm): were they announced to CMS? what's the impact? Daniele will follow up with DBAs.

  • ALICE -

Sites / Services round table:

  • NIKHEF (Jeff): tough weekend with Storage. One of the poolnodes didn't come back after some intervention last week. dCache is running out of memory. The memory was increased by a factor 4. Simone: do you know when the memory gets exhausted? no? Jeff mentioned that a similar problem might have been seen by LHCb on dCache...
  • FZK (Angela) issues: one problem with a FTS channel, which was down over the Weekend. Unknown why, and the developers don't understand. Will try to recreate the channel today. Planned downtime for part (1/2) of tape system, probably on Thursday.
  • RAL (Gareth) things coming up: tomorrow network intervention, 1 hours site down in GOCDB. The week after: 18-19/5 the CASTOR dbs are migrated to new hardware - two days outage for CASTOR.
  • GRIF (Michel) NTR
  • ASGC (Gang): timeouts from LCG commands. Problem is likely to be due to the BigId issue.
  • CERN (Miguel)

AOB:

  • Maria wonders about some CMS member with DoE certificate. Daniele is following up. Another issue with Coreen person requesting a certificate: Maria will mail Gang

Tuesday:

Attendance: local(Jean-Philippe, Olof, Simone, Roberto, Miguel);remote(Daniele, Michael, Brian, Gareth, Angela).

Experiments round table:

  • ATLAS - (Simone) Nothing special to report other than that some of the ATLAS DM services at CERN will be redeployed in the coming days. There might be intermittent problems. Miguel (CERN): timeouts reported by Armin yesterday - the network switch turned out to not have a standard fiber for the router uplink. The diskservers on that switch have been removed from production until the link problem is fixed. Short incident on CASTORATLAS this morning where the execution plan had changed for a particular stager query. Brian (RAL): test transfers with ATLAS production role between CERN - RAL after network change but didn't manage to transfer files across, is it a known problem? Simone hasn't seen any transfer issues but suggested that Brian to try going through the whole chain starting from lcg-cp with a spacetoken. If that works, try with FTS. Jean-Philippe: instead of using lcg-cp, you may try lcg-gt (for a file at RAL), which will pinpoint whether it is a SRM or FTS problem.

  • CMS reports - (Daniele) Working on tickets related to site problems, especially for transfers. Have been seeing an impressive cycle of ticket resolution in the recent past, which proves that many Tier-2s are able now to reach the 80% availability. CASTORCMS upgrade to 2.1.8-7: a good date would be May 25th? Miguel agrees. The rolling CMSONR and CMSR DB interventions are both supposed to be transparent so they can go ahead.

  • ALICE -

  • LHCb reports - (Roberto) Another FEST week, which means that by tomorrow usual transfers between CERN and tier-1s should be expected. Several problems with WMS testing at RAL, SARA and PIC - GGUS tickets opened to relevant people. A problem at GridKA SRM not returning TURLs has been understood: DIRAC doesn't always set 'Done' state for 'get' request which caused the system to be unstable.

Sites / Services round table:

  • RAL (Gareth) scheduled outage this morning for network intervention. Unfortunately the intervention failed and had to be rolled back and repeated later.
  • BNL (Michael) scheduled intervention this morning (local time): move pnfs PostGres from current 32bit to 64bit database system with large caching (up to 48GB). This increased cache will hopefully improve the overall performance significantly.
  • FZK (Angela) NTR
  • CERN (Miguel) scheduled SRM intervention this morning went fine. The scheduled linux upgrade, mentioned yesterday, was only scheduled today due to an urgent kernel fix. Unfortunately the upgrade caused problems for the LSF batch service (lost jobs). A post-mortem at https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090512

AOB:

Wednesday

Attendance: local(Olof, Patricia, Gavin, Simone, Roberto, Jamie, Jean-Philippe, Jamie, Wayne Salter, Miguel);remote(John Kelly, Michael).

Experiments round table:

  • ATLAS - (Simone) One relevant news: after discussions with ASGC people the site is considered operational now. All storage related functional tests work ok (SRM, LFC). However, when trying FTS subscription the transfer went into timeout in gridftp phase. Investigations show that all transfers to/from ASCG ends up with ~400-500kb/s with CERN and other Tier-1s. Is the routing via the OPN or not? If not, this is a blocking issue for the STEP09. Jamie: do we know the hostnames? yes, can be provided. Wayne Salter (OPN responsible at CERN) will follow up once the hostnames have been provided to him.

  • CMS reports - Daniele busy at GDB today.

  • ALICE - (Patricia) back from holidays... Production is running quite smooth. Checking WMS submission issues with CREAM CE at CNAF. Patricia is following this up directly with the site.

  • LHCb reports - See Recap on LHCb issues with dCache at FZK for a discussion of the issues seen and possible ways forward. (Roberto) Main points:
    • issues with unstable WMS @ Tier-1s - the service is not really usable for daily activity.
    • After SRM upgrade at CERN yesterday, LHCb started to see massive error rates. Gavin: seems to be a problem with 2.7-17 release, which has a specific issue with multi-file 'get' requests and affects at least LHCb and CMS. A bug report has been submitted to CASTOR/SRM developers. All production end-points (except PPS) have been downgraded to 2.7-15. A post-mortem will be produced

Sites / Services round table:

  • RAL (John Kelly): nothing specific to report.
  • BNL (Michael): yesterday's intervention was completed in time. New PNFS server is running on improved hardware. Stable operation during night and load has come down.
  • ASGC
    • ASGC T1 is brought back to ATLAS DDM Functional Test this morning. Transfer from INFN,NDGF and TRUIMF to ASGC are fine with 100% efficiency, but transfer from CERN to ASGC did not succeed due to low transfer speed.
  • CERN (Miguel): besides the SRM issue mentioned above, tomorrow the CASTOR Oracle db services will be patched with the latest security patch. Intervention is transparent and will start at 09:00 (CEST). The name server database will be patched at 13:00.

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • CMS reports - Report by Daniele: The WLCG Ops call today clashes with one of CMS-internal STEP planning meeting, so I cannot attend, apologies. One thing to report is that an operational mistake in a removal cycle of the normal CMSSW deployment activity on EGEE sites has caused swinst CMS-specific SAM tests to fail at all sites. It's a tool/infrastructure issue, and not a site(s) issue.

  • ALICE -

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

-- JamieShiers - 06 May 2009


This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsMeetings > WLCGDailyMeetingsWeek090511
Topic revision: r9 - 2009-05-14 - DanieleBonacorsi
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback