Week of 081027

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Jamie, Steve, Jean-Philippe, Julia, Markus, Simone, Andrea, Patricia, Miguel);remote(Michael, Gareth).

elog review:

Experiments round table:

  • ATLAS (Simone) - main problem ATLAS facing is file registration with various LFCs. Backlog of ~100K files (each) for registration in TRIUMF, CNAF, PIC and ASGC. Small backlog in a few other sites but rapidly disappeared. A lot of testing over w/e. Looked at what ATLAS code does, put same code in test suite. No problem at LFC level. Time takes to register depends on RTT. Always 10-15Hz - worst case 1Hz. At DDM level ~500 files / hour. Difference of factor 5-10! (ATLAS) site services problem. Current load cannot schedule registration of files fast enough. Being looked at by DDM developers (Miguel Branco et al). Slowest site - 1.5s / file - ASGC (routes through Vancouver), TRIUMF ~1s/file. Jean-Philippe - send list of methods - maybe a compound method with single round-trip. Andrea - files registered in parallel? Simone - single thread, opens session. Other problem: CASTOR at ASGC (Oracle b/e) Jason working on problem but no estimate of how long it will take. Started on Saturday. Some problems (mainly SRM) fixed over w/e (IN2P3 & SARA).

  • ATLAS (Simone) - On Monday, 27th of October, the combined running for 2008 comes to an end. However there will still be much activity, frequently involving the central trigger and DAQ systems. To plan this activity, address other running issues, and start preparations for data taking 2009, we will still have the weekly run meeting on. The schedule will again be kept up to date on the online Twiki: https://pc-atlas-www.cern.ch/twiki/bin/view/Main/RunningSchedules also reachable via the Operations page under "Operation Schedule"-> "Run Schedule"

  • ATLAS (Sasha) - follow-up on streams issue affecting NGDF and overall service:
    • ATLAS is no longer running any stress tests at the Tier-1 sites and does not plan to run those in the near future. As you know, our joint task force is now performing stress tests in a controlled environment at CERN IT.
    • The NDGF high load happen during typical ATLAS data reprocessing operations. (It is my understanding, that the NDGF high load was caused by the server misconfiguration, (to be checked with DBA team)since after the instance restart ATLAS data reprocessing task finished at NDGF successfully.)
    • Over the weekend the ATLAS reprocessing team monitored task progress at NDGF and was ready to abort the task in case of any problem with ATLAS Streams. As you know, the 3D Streams dashboard was all green during that weekend.
  • Comments from NDGF (Olli Tourunen [olli.tourunen@csc.fi]) - All evidence I see points to the server starving under a load that was too heavy. However, I can see that on Monday after the host was reset almost equal amount of connections were made from Titan, and these did not cause any problems of this magnitude - although they slowed down the replication catch-up. Maybe this is an effect of the ramp-up rate? [Also] I enabled the sniped session removal right away on Monday. However, according to the procedure log table, no sniped sessions have been killed.

  • CMS (Andrea) - taking data without too many problems over w/e. Yesterday ~13:00 magnet ramped down due to failure in cooling of return yoke. Ramped up again - 3.6T 30' ago. Some job failures (at Tier0), due to AFS area for log files was full. Cleaned up - now ok. Sites: CMS cannot use ASGC due to CASTOR problem.

  • ALICE (Patricia) - very calm during w/e. Some small issues at T2s re WMS. WMS migration will be completed today. (i.e. ALICE no longer uses RB). Andrea - now only SAM using RB? A: yes, some LHCb usage, some non-HEP VOs.

Sites round table:

  • SARA - Currently we are experiencing problems with our mass storage backend and suffer from performance problems of our SRM. For these reasons we are in unscheduled maintenance. We hope to be back asap.
    • Start of downtime [UTC]: 27-10-2008 07:16
    • End downtime [UTC]: 27-10-2008 14:31
    • SARA-MATRIX/srm.grid.sara.nl/SE
    • SARA-MATRIX/srm.grid.sara.nl/SRM
    • Production is restarted. The CXFS cache file system is restored. There is further investigation if any files are lost.

  • ASGC - CASTOR down since Saturday - 100% failure, degraded since Friday. (Oracle b/e).

Services round table:

  • voms.cern.ch serving VOMS credentials for LHC VOs. Date: Friday 24th October Unavailable: 5 minutes around 19:40 (automatic DNS update) Degraded: 19:40 -> 22:29 Analysis: Possibly related to security scan, more investigation needed to confirm. Security Scan ran from 19:40 to 22:11. (Alarm to operator didn't make it through - followup. JPB - VOMS memory leak in library - affects LFC. Markus - new release just made. Also client libraries - > 6 months to certify. Wait this new fix before moving to production? Feeling NO - would possibly delay too long - to be followed up.

  • DBs (Maria) - following post-mortem on streams problem of last w/e will ask DBA from NDGF to add addendum to describe problem with DB. Applied all October critical patches on int, validation and test clusters and will schedule production deployment in 2 weeks. Continuing with Sasha stress tests - 300 concurrent jobs do not load DB. Trying today with 3000. Received a patch for streams which we were waiting for for nearly 2 years - interim fixes were not working. This patch seems to fix ORA 600 error when dropping propagation. If proves to be a good fix will have a much simpler procedure for splitting / re-merging sites. (Current work-around requires quite a bit of manual work). Scheduled intervention tomorrow from 18:00 for 1 day(!) for infrastructure work. ATLAS & LHCb RAC affected by physical storage move. Probably all 3D & FTS/LFC.

AOB:

Tuesday:

Attendance: local(Jamie, Eva, Julia, Jean-Philippe, Simone, Harry, Andrea, Patricia);remote(Michael, JT, Gareth, Jeremy).

elog review:

Experiments round table:

  • CMS (Andrea) - run "CRAFT" on-going - magnet still on since yesterday afternoon. Everything basically fine except abcklog of queued transfers to CASTOR tapes over w/e every time a new run started. From CASTOR point of view "nothing wrong" - CMS trying a different way(s) of patterns of copying out data to CASTOR. CAF - disk space running out, CMS in contact with IT to order new disks, also deleted some very old data. Problem with CASTOR at ASGC still not solved, in contact with Oracle global support.

  • ATLAS (Simone) - status of registration in LFC backlog. Yesterday and this morning problem has been found - in ATLAS DDM site services. Several fixes applied to site VO boxes. PIC + FZK backlog from 200K to 100K. Same remedy to other boxes which were in trouble. Will help at the current scale - 1M files a day being move across the grid. Major change in site services foreseen in next 15-20 days should solve problem. JPB - no LFC change? A:correct. Exchange of email about testing CASTOR SRM PPS at CERN. Problems came with old bug in FTS on SLC3. Cured in SLC4. Still have to check if this fixes it.. Problems with sites: confirm observation about ASGC, CASTOR still inaccessible; problem yesterday with powercut in NIKHEF: JT: restoring things still, site still not officially up. Remaining problems are with VMs. Relatively new tech here - most services on VMs, first time we had a power cut. Somethings still not automated / configured correctly. Struggling to get all VMs up. Problem rather interesting - no power cut from external world. Short circuit in one of UPS batteries. Everything with redundant power went down and all without stayed up! This morning traffic jam around AMS and major problems with trains - unrelated but didn't help! Problem started about 21:00 last night. Simone - Kors raised issue of how site contacts VO(s). Alarms for contacting sites but other way round? (User abusing something, extremely important communication. JT - what's wrong with broadcast? We are in "damage control" mode. Simone - problem to disentangle v important broadcasts from others. Jeremy - (ab)user issues to - straight to VO? --> Maite & Nick i.e. weekly ops. --> MB report later. Jeremy - ATLAS running v few jobs across whole grid - correct? A: this is just phase of activity. When new s/w release is validated -> bulk processing. Until then "bits & pieces" of production here and there...

  • ALICE (Patricia) - one issue related to sites, WMS migration. Beginning tests and optimisation of distribution of WMS across sites. WMS at FZK (largest T1 for ALICE) - 3 WMS for ALICE but none working correctly. One using the LB system of a T2 in France! Configuration?? As FZK is providing 3 it would be valuable if these were configured and working correctly. Experiment only running a tiny number of jobs at the moment. Waiting for a new bunch of jobs to see...

Sites round table:

  • FZK - intervention today in evening for ~1 day. Move physical storage and several DBs will be affected at different times. ATLAS, LHCb & LFC for LHCB, replication affected tomorrow 07:30 - 11:00.

  • RAL - multiple disk failure on disk server part of ATLAS MCDISK service class - some data may have been lost! At the moment still investigating whether we can recover system - keep posted..

Services round table:

  • LFC (David Smith) - We believe we've traced the underlying problem that gives the increasing memory usage - it is being worked on now: http://savannah.cern.ch/bugs/?43306. The component in question (voms) is a subsystem which the LFC needs: when possible we'll have the fix included for installation on LFC nodes, to avoid this problem in the future.

AOB:

Wednesday

Attendance: local(Sophie, Jamie, Harry, Gareth,Eva, Maria, Simone, Olof);remote(Jeremy, Michael, Reda).

elog review:

Experiments round table:

  • ATLAS (Simone) - discovered bug in tool which assigns datasets to set. Backlog of ~6 days - a lot of traffic to be digested. ~550 d/s assigned this morning. Being digested very well. Some sites - BNL @ 900MB/s<CERN - issues: FZK in scheduled downtime and on-going problems at ASGC. Situation for rest looks very good. Yesterday, following communcation from Olof & FTS developers problem to CASTOR PPS due to FTS bug which is cured on SLC4. Switched to this - rate now as expected and zero failures. xrootd castor instance - Markus Elsing's group - now testing and things look good. CASTORT3. Work in progress... Olof: question to ATLAS - have to apply Oracle security patch to SRM db. Intrusive intervention. Profit to cleanup request table which is growing very large. 1.5 - 2 hours intervention - Miguel Coehlo will send request.

  • ALICE (Patricia) : Continuing with the implementation of the WMS into the specific Alice s/w. A pilot version of the submission module of ALICE has been implemented in Torino and at CERN to ensure a load balance among different WMS per site. To be presented tomorrow during the ALICE TF meeting

Sites round table:

  • RAL (John Gordon - to MB list) - load on ATLAS SRM at RAL Recent load drop since Monday caused by doubling number of front end servers.

  • RAL (Gareth) - There has been a multiple disk failure in a server that forms part of the ATLASMCDISK space in Castor at the RAL Tier1. It is probable that data has been lost from this server. Approximately 4 Terabytes of the disk capacity was used on this system, and there are some 72,000 entries in the nameserver for files on this disk.

    The disk server (gdss154) has a RAID5 array with a hot spare. It suffered a double disk failure during the night of Friday 17th October. Replacement disks were ordered on the Monday for delivery the next working day. The disks didn’t turn up, and the server was overlooked. There was a further disk failure on Monday (27th October) which led to failing file transfers which were noticed yesterday. Work is ongoing to see if data can be recovered from the server, but this is rather hopeful. We are reviewing our procedures to learn from this.

    The disk server forms part of the ATLASMCDISK are. A first analysis of the data on the disk shows that 90% is MC data for which this is a secondary copy. Of the remaining 10% we expect that the bulk has already been copied elsewhere. Simone - from site point of view action will be to provide list of files declared unrecoverable. Some might be in other Tier1s, some might have to be processed or produced again. Harry - whose job to cleanup catalog? Simone - ATLAS contact in UK.

  • TRIUMF (Reda(!)) - everything good!

Services round table:

  • DB (Eva) - applied security patches on validation systems two weeks ago, would like to move to production DBs. ATLAS Wed pm next week, LHCb Wed am, propose LCG next Tuesday am. Simone - need reminder of what is where. Eva - rolling intervention. Maria - LFC global in LCG.

AOB:

Thursday

Attendance: local(Jamie, Jean-Philippe, Harry, Simone, Nick, Andrea);remote(Gonzalo, JT, Derek).

elog review:

Experiments round table:

  • ATLAS (Simone) - still some cosmic data being collected. Not all detector, mostly ecal - not exported automatically but on demand. Tool to request, no auto distribution. Still also little q of cosmics to be distributed. Stopped for CNAF (full - <2TB free in space token, some clarification needed on when new space available). FZK back to life end of morning. Downtime extended > scheduled. Only late this morning. Last thing to report: problem getting a few files from CERN and exporting. Alessandro sent this morning a team ticket to CERN (first time this sort of ticket used?) Team ticket goes directly to site without GGUS escalation (ROC etc.) -> site manager. What does this mean in the case of CERN? Nick - in theory -> SMOD. Harry - will look in Remedy.

  • CMS (Andrea) - still going on with cosmic run. All relatively smooth. CAF: problem with low free disk space. CASTOR team gave +150TB to CAF. CMS will still make an effort to delete as much data as possible.

Sites round table:

  • RAL (Gareth, on behalf of John Gordon) - Castor at RAL has suffered several major incidents over the last 2 months, since the upgrade to 2.1.7 and conversion to Oracle RAC, which are causing significant concern to users and project managers. Some of these are still not understood. A summary of major recent/current issues:
    1. Possible crosstalk between stager schemas on same Oracle RAC. Trying to set up a test platform to see if this can be reproduced. Not seen since incident in Aug, since when the secret Oracle patches have been applied and Oracle upgraded to 10.0.2.4. We are hopeful this has been fixed.
    2. Very large, out-of-range ids inserted into id2type table - recurring problem. This recurred again in the last week on our CMS instance. Although the value of this id should come from an ‘Oracle sequence’, we believe the cause is within Castor. CERN developers are on standby to investigate next occurrence.
    3. Hardware issues in SAN for disk storage for Oracle RAC: We should have 4 redundant paths between Oracle servers and storage managed by Linux multipath. We have had at least 2 incidents - and possibly more - where all 4 paths go down simultaneously. Planning a review of this.
    4. Load issues for ATLAS: We are seeing multiple issues of not enough resource to support current ATLAS load, including 2 SRM frontends running at 100% CPU utilization, memory swapping on atlas stager, and insufficient disk space for LSF logs. Working on multiple upgrades to manage the current load (e.g. number of front ends already doubled) but are missing the data on what generated this load to enable effective capacity planning.

      We are planning an in-depth review of Oracle issues to consider if our current configuration of RAC is optimal, and to look at database support requirements.

      There has also been one incident of 3 disks failing in a RAID array which looks like resulting in data loss but this is a fabric failure unrelated to Castor, so not listed above. The supplier did not replace the first two failing disks quickly enough. We are putting procedures in place to prevent a repeat.

  • NIKHEF (JT) - real quiet - no jobs!

Services round table:

  • DB (Eva) - We have prepared a postmortem report describing the problem with the ATLAS replication from ATLR to Tier1 databases which happened 2 weekends ago (18.10.2008-20.10.2008) caused by some problems at the NDGF database: https://twiki.cern.ch/twiki/bin/view/PSSGroup/StreamsPostMortem.
    Could you (NDGF) please send us a report explaining what was exactly the problem at NDGF so we can add/link it to the postmortem report?
    Thank you in advance.

AOB:

Friday

Attendance: local(Harry, Steve, Miguel, Olof, Jean-Philippe, Andrea, Simone, Julia, Patricia);remote(Michael, Derek, Jeff).

elog review:

Experiments round table:

CMS(AS): Nothing new. The global cosmics run continues till 11 November.

ATLAS: From atlas-uk - Following a double RAID array failure on a 4 TB disk server last Monday (27th Oct), we lost 58732 files that were stored on the ATLASMCDISK tokens server at RAL. The loss is shared almost equally between production data and AOD replication. It is planned start cleaning the LFC by 2 pm GMT this afternoon. From SC - Many site problems 1) ASGC still in downtime with CASTOR broken and no progress. They had set their downtime entry in GOCDB to no notification so it was not picked up. 2) IN2P3 had a high load on their pnfs server so decided to stop T1 to T2 transfers in their cloud leaving only the T1 to CERN channel open. 3) SARA have an srm problem again from 01.00 probably due to a pnfs overload. An intervention was made this morning to increase the number of pnfs threads and that seems to have cured the problem. Their MCDISK endpoint is extremely unstable so MC production in their Russian T2 cloud has been stopped. 4) INFN have been unstable since a few days having a problem in their STORM srm database. They had a 2 hour downtime this morning to clean it up and now seem ok. 5) FZK srm is very problematic - may be another pnfs overload. Being looked into now. 6) CERN had difficulties exporting to several T1 yesterday when an export file got copied to the default pool on a busy disk server by a user accessing it from the T0export pool. Subsequent export requests tried from the busy diskserver and exceeded the 180 seconds timeout. There was then confusion over the GGUS ticket ATLAS submitted where the included FTS error message implied there was a problem with the destination being full. Michael asked if any more insight could be gained from this incident and Simone will follow this up. The planning is to eventually close down the default pools. 7) BNL suffered from a network interruption at 09.00 that was rapidly fixed (by 09.20).

ALICE (PM): Running a small number of jobs at CERN and some T2 sites while waiting for a new software version. This small production is being used to test balancing the job load across WMS servers. Waiting for the French cloud to provide a WMS server.

Sites round table:

NL-T1 (JT): reported that they are following up on the services which did not restart automatically after their power-off incidents, largely associated to virtual machines failing to restart. They are considering running systematic reboot tests for verification and will raise this at an MB meeting.

Services round table:

VOMS (ST): From 20th to 31st of October the VomsPilot service has been used 7 times by 4 unique dteam users. No requests have been made by ops, alice, atlas, cms or lhcb. We will arrange for the SAM validation service and consequently ops to use VomsPilot but some LHC experiment use would be good. See: https://twiki.cern.ch/twiki/bin/view/LCG/VomsPilot . A postmortem of last Friday's voms interrupt and degradation has been published at https://twiki.cern.ch/twiki/bin/view/LCG/VomsPostMortem2008x10x24

AOB: Andrea has submitted a Savana bug report that if a CE is accepting not an entire VO, but only some VOMS groups or roles within that VO, the SAM database does not associate the CE to that VO. He proposes a change that should be implemented reasonably quickly because due to this the CEs at the IN2P3 CMS Tier-1 cannot be tested by SAM.

-- JamieShiers - 27 Oct 2008

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2008-10-31 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback