Week of 080811

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Harry, Jean-Philippe, Miguel, Ewan, Luca, Daniele, Simone);remote(Derek, Michael).

elog review:

Experiments round table:

CMS (DB): Following up on mostly CMS related issues and preparing for CRUZET 4 from 18 August. One observation - about 6-7 hours after ATLAS started exporting data this weekend we noticed a slowdown in the CMS rate of data export with no obvious reason. HR will look for correlations in lemon.

ATLAS (SC): Had a continuous data run this weekend (12 hours till a luminosity block was written) hence 1 GB/second into CASTOR for 12 hours into a single ATLAS 'data set'. This splits into 16 streams of which 5-6 are particularly big so there were high rate transfers, successful in fact, into the sites receiving those parts. Functional tests were stopped on Sunday but have now resumed. We have seen SRM problems at RAL before their LFC went down as scheduled. D.Ross reported they had some recurrence of the load issues seen last Friday and that today they were performing a global castor upgrade then an ATLAS-only one on Wednesday. SC reported a problem with 2 of their 4 muon calibration sites where both Naples and Nikhef report their disks are full when they should not be. Naples was cleaned up overnight so should only have a few MB of disk occupied.

Sites round table:

Core services (CERN) report: Today's scheduled rollout of the Oracle security patch to the public databases was cancelled for a second look. A new date will be discussed tomorrow.

DB services (CERN) report (LC):

- Atlas offline DB node 1 crashed on Saturday night at 2:50 AM for an HW problem. The node rebooted and went back to prod. Services were not affected

- Atlas offline DB node 2 crashed on Sunday night because of a core dump of the Oracle clusterware. The issues is under investigation and seems related to bug 7187896. The node was rebooted by the operators and went back to prod. Services were not affected

- LHCB offline database node 3 is currently down for an issue that appeared after applying CPU JUL08. This issues has not been observed before and is currently under investigation. Services are not affected as they keep running on the remaining 2 nodes of the LHCBR cluster

- As scheduled tomorrow ATLR and ATONR will be patched with CPU JUL08 (rolling upgrade)

Monitoring / dashboard report:

Release update:

AOB: SC asked about the status of the creation of two new requested ATLAS pools. Miguel said they are about to discuss the strategy of analysis pools so are not creating any before then. SC said one, of 10 TB, is not for analysis and he agreed to send a reminder to Miguel.

Tuesday:

Attendance: local(Jacek, Simone, Jean-Philippe, Andrea, Jamie, Harry, Miguel);remote(Michael, Gonzalo, Jeremy).

elog review:

Experiments round table:

  • ATLAS (Simone) - just 1 point - starting from 17:00 yesterday acron at CERN stopped working - ok from ~10:30 today (was a network switch!). Side-effect: functional test stopped for this period. Otherwise all ok. Sites performing well. Still tail of jobs from cosmic data taking this w/e.

  • CMS (Andrea) - similar! Also affecting by acron outtage - submissions of SAM tests and some monitoring info in SLS - frontier, DBS. Now ok. 2nd point: Daniele discussed elog for CMS. Harry - James away! Follow-up with Julia... Has arranged backup for James..

Sites round table:

Core services (CERN) report:

  • (Miguel) CASTORCMS upgrade this morning - went ok. LHCb tomorrow.

DB services (CERN) report:

  • (Jacek) - LHCb offline cluster problem, just a few minutes before meeting. 3rd node was odwn, try to add back, 1st node went down. Could not login for a few minutes as all services down. Investigating...

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

Attendance: local(Julia, Simone, Harry, Jamie, Jean-Philippe, Nick, Gavin, Luca);remote(Derek, Michael, Jeremy, Gonzalo).

elog review:

Experiments round table:

  • ATLAS (Simone) - problem overnight. Looked worse than what it was! Dataset created corrupted (zero length file) - still be investigated. Net effect - site services kept retrying to move file. Gave FTS problems - error message about filesize mismatch. Huge # errors on dashboard but all same files. Removed subscriptions this morning. Now left with genuine problems. Unavailability of RAL - scheduled downtime (everyone trying to get data from RAL - all other T1s - fails). Dig abit more: also problems putting files to NIKHEF & FZK but both buried in 'noise' Tickets to be opened. By end of month new dashboard visualization per project will be there - will help greatly. Announcement - tomorrow 12h of throughput tests 10:00 - 22:00. Will drain and then normal functional tests will start again. Will continue this weekly cycle until data is upon us. Derek - upgrading to CASTOR 2.1.7 hopefully up later today.

  • CMS (Daniele, by e-mail): Preparing for the 4th (last) mid-week Global Run exercise. At the Tier-0, CMS now moved to a single Tier-0 ProdAgent (repacker + real prompt reco system): more solid, and the thing they wanted to have. CMS DataOps shifters are already taking care constantly. On the repacker side, improvements on the migration/injection side. The prompt reco was rolled-out in MW GR #3, i.e. successfully tested last week, all problems found have been patched and all patches have been applied, it's already put it in production now for the current MW GR #4. It's running now, and smooth so far (it's the last one they run before the 1-week-long CRUZET-4). Apart from this, it worths notifying that DataOps reported some transfer issues to several T1 sites (maybe low babysitting due to holidays?), no tickets sent yet by them - to be followed up.

Action pending: news on setting up a CMS section in the ELOG at CERN? (we are defining sub-sections right now).

Sites round table:

Core services (CERN) report:

  • Upgraded CASTOR LHCb this morning to 2.1.7-14.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

  • Q (Simone) - LFC? A: rolled into this update.

  • CMS elog: separate instance and/or server? Need backup for James as important ops tool.

  • Gonzalo - anyone from LHCb? LFC for LHCb. OPS SAM tests include write test. JPB will check with SAM people.

AOB:

Thursday

Attendance: local(Nick, Luca, Jamie, Gavin, Jean-Philippe, Julia, Harry, Simone);remote(Michael, Jeremy, Gonzalo).

elog review:

Experiments round table:

  • ATLAS (Simone) - not much to report. 1. Observation - 12h throughput test started. Start seeing increase of rate from ~now. Stop at 22:00 and then a draining period. No major problems. 2 sites in scheduled downtime - IN2P3 & NIKHEF. Didn't realize the latter earlier & sent ticket. Please close. 1 minor problem with Napoli site for calibration - srm unable to contact. T2 calibration sites -> GGUS.

  • CMS (Daniele - by email): Yesterday, soon after my report to you, we had issues at the repacker/PA/CMSSW level, now understood and addressed. Currently, CMS is suffering for a serious CVS corruption issue, and I have been informed that the relevant CMS stakeholders are waiting for feedback of CERN-IT responsibles, things do move fast and they may have got it while I write.

Sites round table:

Core services (CERN) report:

  • CASTORALICE Upgraded today. (Gav)

DB services (CERN) report:

  • Storage corruption on Atlas online database was discovered on Thursday 14-8. The issue has been traced down to a defective disk. The issue is being solved by using the healthy mirror side of ASM. Services are not affected.
  • LHCB offline database instance n.3 could not be started up after the rolling intervention on Monday morning. The online intervention to fix the issue on Tuesday prevented new logon for 15 minutes. On Thursday morning at 8:30 a cluster freeze for 30 minutes caused unscheduled downtime. The issue was solved manually by bouncing one of the cluster instances.

Monitoring / dashboard report:

Release update:

  • Nick - Q re FTS clients on WNs. Assumption is not. Gather info & then decide. (Idea is to reduce what is installed on WNs).
  • LFC - was included with FTS SL4. Will disentangle for release hopefully Monday. Should fix crashes sites are seeing.

AOB:

Friday

Attendance: local(Julia, Simone, Jean-Philippe, Harry, Jamie, Luca, Nick);remote(Daniele, Derek, Michael).

elog review:

Experiments round table:

  • CMS (Daniele) - posted a couple of issues yesterday. 1st solved. Issue related to repacked - addressed. 2nd: CVS corruption. Not clear at time - was waiting for IT feedback. In end was wrong operation in a commit operation, hence both now solved(!). Exercise from yesterday finished. Preparing normal activities for CRUZET4 which starts 09:00 Monday CERN time. Progress in defining summer 2008 MC production for different tiers. Table of amount of data each of 7 Tier1s - 13TB PIC -> 90TB at FNAL. Average 40TB (RAW+RECO). List of logical file names prepared and sites can prepare tape families.

  • ATLAS (Simone) - open/closed problems. Yesterday mentioned space on SE in NIKHEF. Solved since. Now catching up backlog. Some hours unavailability of storage in BNL. Michael sent announcement people looking into it. Details? Michael - no downtime but failures. Overall efficiency went down to 50% or lower. Problem caused by user analysis jobs doing excessive metadata look-ups. PNFS server extremely slow. SRM prepare to get/put relies on response from PNFS. If not received within limit SRM gives up and appears as if transfer fails. Investigated this morning. Identified user analysis jobs - need to figure out how to handle this. This sort of situation could occur and any dCache site and any time. JPB - plan to use Chimera to solve? A: testbed on site, numerous issues to go through, incl conversion of existing PNFS inventory to Chimera. Tigran (developer) will be there (BNL) in Sep to work on migration. Simone - solution is to throttle analysis jobs via PANDA? A - pathena jobs - user opening thousands of files, currently > 300 jobs. Overall load on PNFS postgres DB - 1.8 M blocks/minute accessed! Something that has to be better understood and controlled. Since users start running these jobs shortly likely to see more such problems. Simone - other remaining problem at dCache in Lyon - notified yesterday, Initially wrong host certificate on pool node. Since fixed. Generic SRM unreachable / timeout > 180s. Lyon people aware and looking into it... Today is holiday in France... Concerning functional tests - stopped yesterday as announced. Did not restart at normal 10% as long tail of datasets which still have to get to NIKHEF. Looked as though ATLAS site services not scheduling transfers. Want to understand this better before restarting / moving on. Debug with ATLAS DM experiments. Transfers will restart at some point in afternoon... Other thing discussed yesterday: reprocessing. FDR reprocessed at all Tiers1 - 90% of total data volume. Report from Sasha on remaining 10% - where errors are. Mostly computing exercise. Exercise will happen again.. Demonstrates reprocessing at all sites. Q: includes calib DB lookup? A: yes, addresses two points, COOL lookup and pre-staging. From Sep 1st active 8 x 5 shifter at point 1. Starting from now this person will be the one reporting problems to site etc. Experienced but not 100% so... Maybe some tickets will be incomplete in first instance - ask for more info - necessary training.

Sites round table:

  • JPB - would be very good if sites could document problem and fix - follow example of Michael. Reporting exact cause and its resolution.

Core services (CERN) report:

DB services (CERN) report:

  • Luca - ATLAS_CONF_TRIGGER_V2 has been added to Atlas streams replication chain from online DB to offline DB and from offline to Tier-1s
  • replication of ATLS_COOLOFL_DCS from offline to Tier-1s has been set-up. Replication to Taiwan could not be set up because Taiwan's database is not using the same character set as the other Tier1-s and this prevents the use of transportable tablespace to move data in bulk for instantiation. We are in contact with Taiwan DBAs on this issue.

Monitoring / dashboard report:

  • elog request from CMS - James still away for another week. Error! Need coverage for this.

Release update:

  • Still chasing LFC 1.6.11 - hopefully through by Monday. Still problems with coverage in release area over summer.

  • SL(C)5 - WN almost building. Maybe skip SL4 for FTS as SL5 so close and now is not time for (in)stability.

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf FTS-Aug12.pdf r1 manage 63.4 K 2008-08-13 - 13:23 JamieShiers Problem with SL4 version of FTS released today (gLite 3.1 Update 28)
PDFpdf Oracle-Aug13.pdf r1 manage 47.0 K 2008-08-13 - 13:24 JamieShiers Installation of Oracle July Critical Patch Update on the downstream databases (ATLDSC and LHCBDSC)
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2008-08-15 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback