Week of 090511

WLCG Baseline Versions

WLCG Service Incident Reports

GGUS Team / Alarm Tickets during last week

Weekly VO Summaries of Site Availability

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Jamie, Miguel, Harry, Nick, Jean-Philippe, Markus, Jeff, Olof, Maria, Simone, Diana);remote(Angela, Jeremy, Daniele, Gareth, Michel, Gang).

Experiments round table:

  • ATLAS - (Simone): Dispatching of Cosmic data has stopped. The exercise ran smooth over the Weekend. No problems to report. One problem at SARA with memory in dCache but it was resolved within a couple of hours. BDII lookups from WNs at NIKHEF: not understood but Graeme and Hurng are investigating this issue. Gareth: in standard ATLAS production jobs, when copying files back to the storage element, do you specify the space token? Simone: yes but it depends on whether uploading or downloading the files. It might be that for some sites the space token is not passed on a 'get'.

  • CMS reports - (Daniele) Brief report. mostly following up a number of tickets for transfer problems. Some timeout problems for big files (>50GB?!!!). Daniele has closed ticket for CNAF about Custodial data slow to show up on T1_IT_CNAF_MSS - issue solved. Another issue with 11 unavailable files at IN2P3. Daniele mentioned that there are a number of announced interventions affecting CMS: CASTOR upgrade at CERN - time slot will be discussed at the CMS ops meeting. Daniele also wondered about the CMSONR and CMSR DBs rolling upgrades foreseen for 14th of May (http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/ScheduledInterventionsArchive/CMSONRDBrollinginterventionannouncement.htm): were they announced to CMS? what's the impact? Daniele will follow up with DBAs.

  • ALICE -

Sites / Services round table:

  • NIKHEF (Jeff): tough weekend with Storage. One of the poolnodes didn't come back after some intervention last week. dCache is running out of memory. The memory was increased by a factor 4. Simone: do you know when the memory gets exhausted? no? Jeff mentioned that a similar problem might have been seen by LHCb on dCache...
  • FZK (Angela) issues: one problem with a FTS channel, which was down over the Weekend. Unknown why, and the developers don't understand. Will try to recreate the channel today. Planned downtime for part (1/2) of tape system, probably on Thursday.
  • RAL (Gareth) things coming up: tomorrow network intervention, 1 hours site down in GOCDB. The week after: 18-19/5 the CASTOR dbs are migrated to new hardware - two days outage for CASTOR.
  • GRIF (Michel) NTR
  • ASGC (Gang): timeouts from LCG commands. Problem is likely to be due to the BigId issue.
  • CERN (Miguel)

AOB:

  • Maria wonders about some CMS member with DoE certificate. Daniele is following up. Another issue with Coreen person requesting a certificate: Maria will mail Gang

Tuesday:

Attendance: local(Jean-Philippe, Olof, Simone, Roberto, Miguel);remote(Daniele, Michael, Brian, Gareth, Angela).

Experiments round table:

  • ATLAS - (Simone) Nothing special to report other than that some of the ATLAS DM services at CERN will be redeployed in the coming days. There might be intermittent problems. Miguel (CERN): timeouts reported by Armin yesterday - the network switch turned out to not have a standard fiber for the router uplink. The diskservers on that switch have been removed from production until the link problem is fixed. Short incident on CASTORATLAS this morning where the execution plan had changed for a particular stager query. Brian (RAL): test transfers with ATLAS production role between CERN - RAL after network change but didn't manage to transfer files across, is it a known problem? Simone hasn't seen any transfer issues but suggested that Brian to try going through the whole chain starting from lcg-cp with a spacetoken. If that works, try with FTS. Jean-Philippe: instead of using lcg-cp, you may try lcg-gt (for a file at RAL), which will pinpoint whether it is a SRM or FTS problem.

  • CMS reports - (Daniele) Working on tickets related to site problems, especially for transfers. Have been seeing an impressive cycle of ticket resolution in the recent past, which proves that many Tier-2s are able now to reach the 80% availability. CASTORCMS upgrade to 2.1.8-7: a good date would be May 25th? Miguel agrees. The rolling CMSONR and CMSR DB interventions are both supposed to be transparent so they can go ahead.

  • ALICE -

  • LHCb reports - (Roberto) Another FEST week, which means that by tomorrow usual transfers between CERN and tier-1s should be expected. Several problems with WMS testing at RAL, SARA and PIC - GGUS tickets opened to relevant people. A problem at GridKA SRM not returning TURLs has been understood: DIRAC doesn't always set 'Done' state for 'get' request which caused the system to be unstable.

Sites / Services round table:

  • RAL (Gareth) scheduled outage this morning for network intervention. Unfortunately the intervention failed and had to be rolled back and repeated later.
  • BNL (Michael) scheduled intervention this morning (local time): move pnfs PostGres from current 32bit to 64bit database system with large caching (up to 48GB). This increased cache will hopefully improve the overall performance significantly.
  • FZK (Angela) NTR
  • CERN (Miguel) scheduled SRM intervention this morning went fine. The scheduled linux upgrade, mentioned yesterday, was only scheduled today due to an urgent kernel fix. Unfortunately the upgrade caused problems for the LSF batch service (lost jobs). A post-mortem at https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090512

AOB:

Wednesday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

  • ASGC
    • ASGC T1 is brought back to ATLAS DDM Functional Test this morning. Transfer from INFN,NDGF and TRUIMF to ASGC are fine with 100% efficiency, but transfer from CERN to ASGC did not succeed due to low transfer speed.

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

-- JamieShiers - 06 May 2009

Edit | Attach | Watch | Print version | History: r15 | r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2009-05-13 - GangQin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback