Week of 080811

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Harry, Jean-Philippe, Miguel, Ewan, Luca, Daniele, Simone);remote(Derek, Michael).

elog review:

Experiments round table:

CMS (DB): Following up on mostly CMS related issues and preparing for CRUZET 4 from 18 August. One observation - about 6-7 hours after ATLAS started exporting data this weekend we noticed a slowdown in the CMS rate of data export with no obvious reason. HR will look for correlations in lemon.

ATLAS (SC): Had a continuous data run this weekend (12 hours till a luminosity block was written) hence 1 GB/second into CASTOR for 12 hours into a single ATLAS 'data set'. This splits into 16 streams of which 5-6 are particularly big so there were high rate transfers, successful in fact, into the sites receiving those parts. Functional tests were stopped on Sunday but have now resumed. We have seen SRM problems at RAL before their LFC went down as scheduled. D.Ross reported they had some recurrence of the load issues seen last Friday and that today they were performing a global castor upgrade then an ATLAS-only one on Wednesday. SC reported a problem with 2 of their 4 muon calibration sites where both Naples and Nikhef report their disks are full when they should not be. Naples was cleaned up overnight so should only have a few MB of disk occupied.

Sites round table:

Core services (CERN) report: Today's scheduled rollout of the Oracle security patch to the public databases was cancelled for a second look. A new date will be discussed tomorrow.

DB services (CERN) report (LC):

- Atlas offline DB node 1 crashed on Saturday night at 2:50 AM for an HW problem. The node rebooted and went back to prod. Services were not affected

- Atlas offline DB node 2 crashed on Sunday night because of a core dump of the Oracle clusterware. The issues is under investigation and seems related to bug 7187896. The node was rebooted by the operators and went back to prod. Services were not affected

- LHCB offline database node 3 is currently down for an issue that appeared after applying CPU JUL08. This issues has not been observed before and is currently under investigation. Services are not affected as they keep running on the remaining 2 nodes of the LHCBR cluster

- As scheduled tomorrow ATLR and ATONR will be patched with CPU JUL08 (rolling upgrade)

Monitoring / dashboard report:

Release update:

AOB: SC asked about the status of the creation of two new requested ATLAS pools. Miguel said they are about to discuss the strategy of analysis pools so are not creating any before then. SC said one, of 10 TB, is not for analysis and he agreed to send a reminder to Miguel.

Tuesday:

Attendance: local(Jacek, Simone, Jean-Philippe, Andrea, Jamie, Harry, Miguel);remote(Michael, Gonzalo, Jeremy).

elog review:

Experiments round table:

  • ATLAS (Simone) - just 1 point - starting from 17:00 yesterday acron at CERN stopped working - ok from ~10:30 today (was a network switch!). Side-effect: functional test stopped for this period. Otherwise all ok. Sites performing well. Still tail of jobs from cosmic data taking this w/e.

  • CMS (Andrea) - similar! Also affecting by acron outtage - submissions of SAM tests and some monitoring info in SLS - frontier, DBS. Now ok. 2nd point: Daniele discussed elog for CMS. Harry - James away! Follow-up with Julia... Has arranged backup for James..

Sites round table:

Core services (CERN) report:

  • (Miguel) CASTORCMS upgrade this morning - went ok. LHCb tomorrow.

DB services (CERN) report:

  • (Jacek) - LHCb offline cluster problem, just a few minutes before meeting. 3rd node was odwn, try to add back, 1st node went down. Could not login for a few minutes as all services down. Investigating...

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

Attendance: local(Julia, Simone, Harry, Jamie, Jean-Philippe, Nick, Gavin, Luca);remote(Derek, Michael, Jeremy, Gonzalo).

elog review:

Experiments round table:

  • ATLAS (Simone) - problem overnight. Looked worse than what it was! Dataset created corrupted (zero length file) - still be investigated. Net effect - site services kept retrying to move file. Gave FTS problems - error message about filesize mismatch. Huge # errors on dashboard but all same files. Removed subscriptions this morning. Now left with genuine problems. Unavailability of RAL - scheduled downtime (everyone trying to get data from RAL - all other T1s - fails). Dig abit more: also problems putting files to NIKHEF & FZK but both buried in 'noise' Tickets to be opened. By end of month new dashboard visualization per project will be there - will help greatly. Announcement - tomorrow 12h of throughput tests 10:00 - 22:00. Will drain and then normal functional tests will start again. Will continue this weekly cycle until data is upon us. Derek - upgrading to CASTOR 2.1.7 hopefully up later today.

  • CMS (Daniele, by e-mail): Preparing for the 4th (last) mid-week Global Run exercise. At the Tier-0, CMS now moved to a single Tier-0 ProdAgent (repacker + real prompt reco system): more solid, and the thing they wanted to have. CMS DataOps shifters are already taking care constantly. On the repacker side, improvements on the migration/injection side. The prompt reco was rolled-out in MW GR #3, i.e. successfully tested last week, all problems found have been patched and all patches have been applied, it's already put it in production now for the current MW GR #4. It's running now, and smooth so far (it's the last one they run before the 1-week-long CRUZET-4). Apart from this, it worths notifying that DataOps reported some transfer issues to several T1 sites (maybe low babysitting due to holidays?), no tickets sent yet by them - to be followed up.

Action pending: news on setting up a CMS section in the ELOG at CERN? (we are defining sub-sections right now).

Sites round table:

Core services (CERN) report:

  • Upgraded CASTOR LHCb this morning to 2.1.7-14.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

  • Q (Simone) - LFC? A: rolled into this update.

  • CMS elog: separate instance and/or server? Need backup for James as important ops tool.

  • Gonzalo - anyone from LHCb? LFC for LHCb. OPS SAM tests include write test. JPB will check with SAM people.

AOB:

Thursday

Attendance: local(Nick, Luca, Jamie, Gavin, Jean-Philippe, Julia, Harry, Simone);remote(Michael, Jeremy, Gonzalo).

elog review:

Experiments round table:

  • ATLAS (Simone) - not much to report. 1. Observation - 12h throughput test started. Start seeing increase of rate from ~now. Stop at 22:00 and then a draining period. No major problems. 2 sites in scheduled downtime - IN2P3 & NIKHEF. Didn't realize the latter earlier & sent ticket. Please close. 1 minor problem with Napoli site for calibration - srm unable to contact. T2 calibration sites -> GGUS.

  • CMS (Daniele - by email): Yesterday, soon after my report to you, we had issues at the repacker/PA/CMSSW level, now understood and addressed. Currently, CMS is suffering for a serious CVS corruption issue, and I have been informed that the relevant CMS stakeholders are waiting for feedback of CERN-IT responsibles, things do move fast and they may have got it while I write.

Sites round table:

Core services (CERN) report:

  • CASTORALICE Upgraded today. (Gav)

DB services (CERN) report:

  • Storage corruption on Atlas online database was discovered on Thursday 14-8. The issue has been traced down to a defective disk. The issue is being solved by using the healthy mirror side of ASM. Services are not affected.
  • LHCB offline database instance n.3 could not be started up after the rolling intervention on Monday morning. The online intervention to fix the issue on Tuesday prevented new logon for 15 minutes. On Thursday morning at 8:30 a cluster freeze for 30 minutes caused unscheduled downtime. The issue was solved manually by bouncing one of the cluster instances.

Monitoring / dashboard report:

Release update:

  • Nick - Q re FTS clients on WNs. Assumption is not. Gather info & then decide. (Idea is to reduce what is installed on WNs).
  • LFC - was included with FTS SL4. Will disentangle for release hopefully Monday. Should fix crashes sites are seeing.

AOB:

Friday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf FTS-Aug12.pdf r1 manage 63.4 K 2008-08-13 - 13:23 JamieShiers Problem with SL4 version of FTS released today (gLite 3.1 Update 28)
PDFpdf Oracle-Aug13.pdf r1 manage 47.0 K 2008-08-13 - 13:24 JamieShiers Installation of Oracle July Critical Patch Update on the downstream databases (ATLDSC and LHCBDSC)
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2008-08-14 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback