Week of 090824

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Jamie, Lola, Simone, Gang, Ewan, Edoardo, Patricia, Roberto, Julia, Harry, Andrea, Jean-Philippe, Jacek, Maria, Dirk , Markus, Diana, Olof (chair));remote(Gonzalo, Michael, Andreas Heiss, John Kelly, Jos, Ronald).

Experiments round table:

  • ATLAS - (Simone) Announcements:
    • NIKHEF - Hoang informed ATLAS that the DC move has completed. Ronald: confirm that the move has completed and the WNs are being brought back.
    • New version of the ATLAS site services is running in the french cloud since a week. Will now change to use LFC bulk inserts.
    • Reminder: tomorrow there will be the name change for BNL at 15:00 CEST (09:00 EDS). Sites are reminded to run the script 15 minutes after the start, i.e. 15:15. The Procedure is documented at https://twiki.cern.ch/twiki/bin/view/LCG/FtsProcedureSiteNameChange

  • ALICE - (Patricia) still finding small issues in the ALIEN configuration which prevents from starting the production. This is central configuration fault, not sites' problems (many sites have asked). The two WMSes for ALICE use in France are completely dead. For grid33 @ LAL a GGUS ticket 51107 has been submitted. Patricia discovered just before this meeting that the other one was also dead (GGUS ticket will be submitted). CREAM CE at CERN issues: ticket for ce201 has to be reopened over the Weekend. Seems to be some grid-proxies related problem... POST-MEETING: France has confirmed that both WMS@FRANCE are in scheduled downtime

Sites / Services round table:

  • PIC (Gonzalo) - NTR
  • BNL (Michael) - name change already mentioned
  • FZK (Andreas) - short downtime for the CE pointing to SL5 cluster. Still installing the remaining WNs to SL5, which will continue up end of the week. SE down time for SRM upgrade. Scheduled for 30 minutes tomorrow morning.
  • RAL (John) - scheduled an 'at risk' intervention for one of the tape robots tomorrow to fix the damages from the leak.
  • NIKHEF - NTR
  • ASGC (Gang) - CMS migration started again. 2000 files have been migrated.
  • CERN (Ewan) - CASTORCMS default pool was overloaded this morning.

  • Network: (Edoardo)
      Since 24/8/2009 14:00CEST, the CERN LHCOPN routers are announcing the prefix 128.142.0.0/16 together with the previous, more specific 128.142.128.0/17.
      The BGP announcement of the old prefix will be stopped only when all the Tier1s will have acknowledged of being receiving and using the new /16.
     The new address space is needed for the new servers that will soon be installed in the CERN computer centre.

     The intervention is documented here: https://gus.fzk.de/pages/ticket_lhcopn_details.php?ticket=50916
  • New router tomorrow put in production starting at 10am.

  • middleware: (Markus) WMS 3.2 released

AOB:

  • Maria: Do ATLAS and BNL want to run an alarm ticket test tomorrow late afternoon after the name change in order to assure the work-flow is ok after the change. Simone and Michael agree.

Tuesday:

Attendance: local(Simone, Gang, Dirk, Jamie, Maria, Harry, Patricia, Roberto, Maria, Julia, Dan, Olof (chair));remote(Gonzalo, Tiju, Daniele, Brian, Andreas).

Experiments round table:

  • ATLAS - (Simone) only concern: one tier-2 (Milano) failing all transfers due to SRM errors. FTS 2.2 testing: no problems seen here at CERN and the checksum functionality was successfully tested yesterday with dCache sites. Still need to test with DPM. In summary, the tests looks promising. Roberto: did you test with StorM? no. Simone thinks we need to foresee a stress-test with FTS 2.2 the two first weeks of Septembers. Brian: does this mean that ATLAS will require minimum version of DPM 1.7.2? Simone: we still need to verify that it works with DPM. In any case, the ATLAS production system does not require the checksum checks but will use it if the SE supports it.

  • ALICE - (Patricia) the CREAM CE problems at CERN have been fixed. More than 1000 concurrent jobs successfully run so far today. Concerning the issue with WMS in France reported yesterday: the sites were in scheduled downtime so it was normal that they were down. In order to cope with similar situations in the future ALICE can now use two WMS at CERN for submitting to the French CEs.

Sites / Services round table:

  • TRIUMF (Reda/offline): Yesterday we experienced a reverse DNS lookup failures affecting all of our Grid services which are in 206.12.1.0/24 address space (nodes on the lightpaths). The problem affected all sites outside British Columbia province. There was a configuration issue with a BCNET server which took some time to resolve after a few iterations with the NOC. This essentially resulted in a 0 MB/s traffic to/from TRIUMF lasting about 7 hours starting at about ~5PM UTC.
  • PIC (Gonzalo) - NTR. Scheduled downtime at PIC as announced. Maria: forwarded a mail to the wlcg-operations list last week concerning the GOCDB problem reported by Gonzalo about not being able to broadcast early announcements for scheduled downtime. As there was no answer, Maria suggests now to follow up the issue with the GOCDB developers. This was agreed by the meeting.
  • RAL (Tiju): NTR
  • ASGC (Gang): CASTOR operation team found a problem with the 'expert' system but the CMS migrations are still stalled for 24 hours. There is 110TB available in the disk space so the space is not yet critical.
  • FZK (Andreas): problem with PBS server this morning. Job submission didn't work for a few minutes. The SRM intervention scheduled this morning was successful.
  • CERN (Olof): LFC upgrade

AOB:

  • Simone: as agreed yesterday ATLAS will send an ALARM and TEAM ticket to BNL this afternoon after the name change.

Wednesday

Attendance: local(Jean-Philippe, Jan, Simone, Oliver, Maria, Gang, Antonio, Harry, Cedric, Roberto, Diana, Julia, Jamie, Olof (chair));remote(Gonzalo, Daniele, Michael, Tiju, Ron, Jos).

Experiments round table:

  • ATLAS - (Simone) BNL name change: situation of FTS (as known) - CERN, PIC, ASGC, IN2P3 applied the script. SARA applied it but the channels still existed, being followed up at site. RAL and TRIUMF do not do daily BDII update and will therefore apply the script at next intervention. NGDF and CNAF have not answered but the channels are working. GGUS test: name has not been updated, still BNL-ATLAS-T2. ALARM ticket test will therefore be postponed until tomorrow. Maria: not sure it will be possible by then? Diana: GGUS has not been updated but that can be done tomorrow. However, another issue is the resource grouping used for publishing sites in EGEE sense has nothing to do with GGUS. This will be followed up offline. TEAM ticket: several BNL-... names exist which is confusing and needs to be understood/fixed.

  • CMS reports - (Daniele) MC production run throughout August: 265M raw events in 26 days in the Tier-2s. Remarkable results according to physicists. Transfers Tier-2 -> Tier-1 ~90TB. Correction to minutes submitted by Daniele after meeting: T0->Tier-*: 90 TB, T1->Tier-*: 210 TB, T2->Tier-*: 110 TB. Further details in the CMS twiki. Some transfer errors are being investigated. SAM status: Russian sites seem to have suffered some instabilities in August - being followed up with sites. All problems have been tracked in savannah tickets, which have doubled in numbers during August.

  • ALICE -

  • LHCb reports - (Roberto) *
    • Request to reshuffling of CASTORLHDB disk pools has been fulfilled at CERN CASTOR operations team. One open question, which needs to be confirmed by Philippe: it seems like reduction of the lhcbraw space by removing out-of-warranty hardware is sufficient and it shouldn't be necessary to drain + remove other servers
    • PIC and NL-T1: ROOT - dCache client incompatibility problem have been solved by running a specific version of the client packaged with the LHCb software. Would like the sites to install the client natively on the nodes. Being followed up by the sites.

Sites / Services round table:

  • PIC (Gonzalo) - NTR
  • BNL (Michael) - NTR
  • RAL (Tiju) - NTR
  • NL-T1 (Ron) - next Monday the tape backend will be down for the whole day due to tape robot change. The FTS issue with the name change have been solved.
  • FZK (Jos) - some dCache disk nodes went down due to hardware problems. Data also no tape so access not be affected. TEAM ticket about FTS transfers to Tier-2 sites in the US, which required some efforts on the FZK side.
  • ASGC (Gang) - NTR
  • CERN (Jan) -
    • BNL name change went ok on the FTS @ CERN.
    • EGEE broadcast about emergency kernel update. Will be rolled out tomorrow. Simone: kernel patch of the VOBoxes? Jan: everything will be prepared in Cdb but it is up to the VO contacts to deploy the new kernel on their own machines. A notification will be sent out

  • Middleware (Antonio): last Monday new gLite was released for production. A complete redesign of the WMS service. Feedback from sites participating in the pilot is good. It has the advantage that it can submit to CREAM. Next release for production to be carried out this week (tomorrow or next Monday): yaim core change impacts all services. Main thing for the WN is the glExec wrapper scripts together with a corresponding fix to the LCAS interface. Harry: what are CERN plans for the deployment of the new WMS?

AOB:

  • (MariaDZ) Appended file with email thread with GOCDB developer on broadcast tool behaviour change. Jamie: We need an early notification about changes like that?

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

-- JamieShiers - 2009-08-24

Topic attachments
I Attachment History Action Size Date Who Comment
Microsoft Outlook e-mail fileeml GOCdb_broadcast_tool.eml r1 manage 8.4 K 2009-08-26 - 10:07 MariaDimou GOCdb_broadcast_tool_missing_buttons
Edit | Attach | Watch | Print version | History: r12 | r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2009-08-27 - OlofBarring
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback