-- HarryRenshall - 29 Jan 2009

Week of 090202

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

  • Reports due from FZK for LFC/FTS DB b/e problem last w/e & CMS "lost files" (see MB report)

GGUS Team / Alarm Tickets during last week

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:


Attendance: local(Jamie, Maria, Jean-Philippe, Harry, Markus, Jan, Uli, Gavin, Andrea, Roberto, Simone, Julia);remote(Michael, Michel, Gonzalo, Gareth, JT, Brian).

Experiments round table:

  • ATLAS (Simone) - main ongoing activity is MC production. Many jobs running everywhere and a lot of data movement. Over w/e exceeded 1M file transfers per day. Production transfers still run with old version of site services - 10M file test was with new version. Critical to upgrade - will be done by critical clouds, RAL foreseen for this pm. Then DE, IT, DL clouds(?) During w/e various problems at sites - main issue as downtime of LFC@SARA - came back a few hrs ago. Consequence of moving Oracle b/e to RAC? JT - got caught with unexpected situation. LFC for ATLAS moved to RAC i/f. Experience on the cluster had been with LFC R/O. Tuning not appropriate for "mostly writes" - put back on old server. Going to stay with this h/w for ATLAS LFC Until new 3D h/w comes in - ATLAS will have its own m/cs on this new h/w. Write level particularly high this w/e? Simone - big network intervention tomorrow and day after? JT - yes, not clear why "so big". Router intervention - OPN down. Failover to GEANT? Discuss tomorrow in T1 ops meeting. Simone - leave site services for SARA draining overnight and upgrade tomorrow. JT - a hopefully transparent network intervention at NIKHEF (20" downtime for new firmware - 20" to back out). Planning: one of main activities of next week and one after are pre-staging tests before next reprocessing round. As those run before Xmas - ATLAS site services to bringonline and use SRMls to check that files are indeed online. Graeme reported that RAL CASTOR SRM has bug that prevents bringing files online. Can this be confirmed? Brian - issue with difference between bringonline command: srmls or checkbringonline. Discuss with CASTOR people at RAL and get back. CASTOR issue generically or specific to RAL setup? Jan - please take to hep project castor srm mailing list with Simone in cc? ASGC & CNAF are also on this list... Brian - clarify main reason RAL is hardest hit is because RAL has largest number of associated T2s - FTS channels etc. Not any other problem? Simone, yes, correct. Harry - what's the size of stagin test? Simone: bringonline 20TB worth of data per Tier1 - measure speed to bring online and whether you can rotate buffer as expected - keeping files online for 8 hours and rotate. Test of tape system, ATLAS services, etc.

  • CMS (Andrea) - data ops people still working that very small files will be created during reconstruction.

  • LHCb (Roberto) - famous dummy MC prod continues. During w/e some issues "grid-wide" due to application used for mc generation. Will restart with newest version of MC simulation chain. FEST: over the weekend we reconstructed all FEST data that could be reconstructed. This week there is no FEST activity running continually. We may perform an afternoon of nominal running but this
    is to be confirmed with the online people.
    So far - as reported last week - some problems, mainly Lyon and CNAF. CNAF prob fixed - lsf scheduler 4 castor - now ok. GGUS ticket has no explanation. 2nd issue - main - at Lyon where status of transferred files were "unavailable" and not "online". Saw before Xmas in staging activity. Problem with config of write pool at Lyon? Team GGUS ticket with fairly high priority - still waiting for a solution. Miguel - does data volume change or file size? A: filesize stays the same so more data. 1/6 data is redistributed to T1s, i.e. 1 file every 10'. T1s will also get 20x more data - still "peanuts".

  • ALICE (Patricia) - the experiment has ramped up until 1600 concurrent jobs during the weekend. This has helped a lot to the testing of the last WMS submission module which has shown a good behavior. This morning the Russian federation has been fully migrated to the last submission module (it was still pending).

Sites / Services round table:

  • RAL (Gareth) - couple of hours break in OPN link Sat evening. Failed over to normal network and came back ok. Have an "at risk" tomorrow for Oracle patches behind CASTOR.

  • PIC (Gonzalo) - starting migration WNs to 64b tomorrow. Should be transparent.. 3 CEs that act as gateways to qs. Will deploy new qs. One CE will be reconfigured tomorrow to publish new qs. ~1 week of time when WNs will be reinstalled in bunches. In 1 week will decommission 32b qs. Warning - no hard-coding of q names! New names in IS.


  • USAG (Maria) - meeting last Thursday with excellent representation, ALICE and CMS reminded that they should use GGUS routinely. A faq page is in preparation explaining differences between TEAM and normal tickets that now go directly to sites.

  • SAM + GridView (Nick) - outage Wed morning due to h/w changes. Offline for a few hours.


Attendance: local(Maria, Jamie, Uli, Julia, Simone, Harry, Miguel, Roberto, Olof, Jan);remote(Gonzalo, Jeremy, Michael, Gareth, Daniele).

Experiments round table:

  • ATLAS (Simone) - migration of site services for UK and NDGF clouds took place -took a bit longer than expected, a few probs, completed at 21:00 (e.g. probs recreating db schema). Since then restarted at quite high rate. FTS job slots in RAL filled in a way that is fast enough - improvement over before. ATLAS ready to migrate other boxes at CERN to new version of SS. Today plan is today all serving clouds except SARA & CNAF - backlog for those clouds. Reinstall scratches DB and hence qs lost. These VO boxes need ~12h to drain - probably done tomorrow. This completes upgrade for SS run at CERN. At end of day will contact US people for BNL then T2s in US. Miguel will circulate email with rel. notes & upgrade procedures. Gareth - had some FTS probs overnight. Not directly linked to ATLAS changes. Since sometime last week FTS running on older h/w in prep for move to new computer building. Will bring up new FTS on new h/w in new building and flip across with minimal interruption - old FTS h/w will not move to new building. Somewhat independent of ATLAS changes but ATLAS can now load us even more....

  • CMS (Daniele) - on Tuesdays a bit longer following Monday pm OPS. Replay and Cruzet repack in prog. LSF problems seen recently due to misconfig now understood and fixed. MC prod continues - issues fixed rel. WMS. Some new issues affecting MC, e.g. Pisa T2: prod agent related?? Savannah tracked. Transfers - 400TB last week over full phedex top. Reproc up to date. T1 sites getting jobs running. Still digesting transfer reqs. Issue of small files size at IN2P3 now addressed. T1 site issues: IN2P3 filesize issue; all jobs of skim d/s got stuck on WNs at IN2P3 - run state for several days (normally hours). Errors = file opened; unexpected failure from dcache errno 33. CMS site contacts noticed that transfers HPSS to dcache not transferred after Fri down. IN2P3: FTS server having probs transferring out to Florida T2 (at least). FZK: many CMS files lost due to dcache bug - files being deleted while being copied to tape. ASGC: back running CMS jobs since while. PIC, RAL, CNAF, FNAL running stably. ~12 savannah tickets to CMS contacts at T2 sites. No GGUS tickets as CMS-related so far.

  • LHCb (Roberto) - tomorrow in pm full nominal rate test for FEST09 from 13:30 - 17:30 - write at 1.8kHz freq - about 25kb event size. Online -> Castor 45-50MB/s. Each t1 gets 1/6 for recons - run ~100 jobs per site. Issues: gridftp server issue & problem of copying out using lcg_utils. Prob due to lcg_utils version still in cert - most likely will grab from AA into dirac dm. (Mainly affects user activity). Lyon: wrong status report (unavail and not online) waiting dcache patch expected Thursday. This brings IN2P3 out of tomorrow's test... Finally: CNAF shared area now fixed! smile Run happily smile MC activity on T2 site of batch farm - some WNs still point to old shared area but being fixed. Luca - yes, you are right!

  • ALICE (Patricia) - currently running 158 concurrent jobs, which are coming from the last production bunch but as Alice already announced it is a non-massive production.
    It is also important to mention that during the last Alice TF meeting, FIO members were invited (Ulrich and Daniel) to explain the last CPU usage calculations which have been applied to LSF at CERN in order to be used by Alice. It is a very interesting development that will be used also to the experiment level.

Sites / Services round table:

  • NL-T1 (JT) -
    • network router firmware at Nikhef upgraded successfully.
      Connectivity was lost for a short period of time (order of a minute).
    • new router at SARA : installation and configuration proceeding ahead of schedule as of 14.00.
    • ATLAS jobs at Nikhef using zero CPU time. Ganga problem (not sure what kind). User has been notified and working with ganga developer to diagnose.



Attendance: local(Jamie, MariaG, MariaDZ, Gavin, Alessandro, Uli, Harry, Nick, Steve);remote(Michael, Daniele, JT, Jeremy).

Experiments round table:

  • CMS (Daniele) - wrt yesterday 1 site issue: CNAF went out of upgrade of FTS server to 2.1 (glite update 38 on sl4 64bit.) successfull - downtime over, expt should notify any issues. Sched downtime of PIC Tuesday 10th-February (announcement was sent today): affecting several CMS+ activities. WMS at CERN only: prompt help (please!) on problem seen only at CERN, i.e. not CNAF. Is it a CMS or WMS? Outstanding since some days: Vincenzo Miccio - last reply to thread some days old. Some follow up please? Some lines in twiki. Nick: GGUS ticket to go with WMS issue at CERN? A: no, Vincenzo contacted wms.support, reply from Ewan. more details

  • ATLAS (Alessandro) - yesterday site services put in draining mode for ASGC, CERN, FZK, PIC and IN2P3. Today upgrading on 3 boxes - still ongoing, hopefully will finish before 21:00 (as happened yesterday) FTS in RAL: GGUS ticket open, 5-10% inefficiency. Problem still occurs this morning so maybe still after move back. Jeremy - ticket is being followed up.

  • LHCb (Roberto) - LHCb data quality team gave green light after analysing express stream from pit to proceed with full recons at T1s. Data flowing, jobs will come soon. CERN issue with diskservers - news from Jan that rolled back to external gridftp configuration. Jan - few qs to Andrew yesterday - response please! This was done due to probs with lcg_cp fixed in code Dec. Like to revert to internal gridftp asap. Roberto - testing of this activity on AA - several bugs reported.

Sites / Services round table:

  • RAL (Gareth) - In preparation for the move to a new computer building the FTS agent was moved to different hardware last week. This aim of this was to reduce the need for any interruption to the service during the move. However, performance issues have arisen and the FTS was moved back to the former hardware this morning. The dates for the move to the new building are still uncertain.

  • NL-T1 (JT) - progress on sched downtime in SARA. Was going well, network part fine, problem with SRM. postgres db was very slow. 15' ago had fixed problem with postgres - should be out of sched. down. soon (now out=15:20). NIKHEF: at risk at same time - some network router upgrades. Were apparently successful. Problem targeted by firmware worked. Some ongoing network problems to NIKHEF trying to fix asap. Still within scheduled "at risk". Until 18:00 UTC.

  • dCache issue (From Onno Zweers, one of the dCache honchos at SARA) - We saw a lot of IOwait, but we were unable to pinpoint which process caused this other than Postgres. There were no SRM connections from outside. A simple vacuum of Postgres took two hours until we killed it. We did not reboot the machine immediately because first we were trying to find out what was wrong. After reboot, the same vacuum finished normally in much less than half an hour. Before rebooting, we restarted Postgres but that didn't solve it.
    After reboot, we needed to fix also our Postgres backup script which had a bug.

  • DB (MariaG) - ASGC has been reinstantiated using transportable tablespaces and PIC as source. Now both running in separate setup until tomorrow morning. Waiting for green light from ASGC to rejoin to normal setup.

  • MariaDZ GOCDB / GGUS - NIKHEF & SARA - please check names your sites have in GOCDB and names in Twiki TierOneContactDetails for Tier1s. Please use same names and same address.

  • FZK lost files - CMS data ops looking for copies of files and files that cannot be recovered.

  • PIC (RS) - started upgrade to 64bit O/S. Issue with libraries for 32bit compatibility & those installed natively. Problem with packaging of dcache client - both libs in same path.

  • gLite 3.0 (Nick) - point from weekly grid operations meeting to retire all glite 3.0 m/w by end April. Discussion needs to start with expts + sites to see if feasible. Sites don't have to upgrade < this date but would be date of end of support. Testing > this date would not include glite 3.0 services. occ.support@cernNOSPAMPLEASE.ch in case of comments. JT - RB = gone? A:Y



Attendance: local();remote().

Experiments round table:

Sites / Services round table:



Attendance: local();remote().

Experiments round table:

Sites / Services round table:


Edit | Attach | Watch | Print version | History: r15 | r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r11 - 2009-02-05 - JamieShiers
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback