Week of 091214

LHC Operations

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs

Experiment and Site Plans for the end of 2009 Holidays here


Monday:

Attendance: local(Harry(chair), Simone, Gavin, Ueda, Alessandro, Andrea, Jan, Miguel, Nick, Eva, Ignacio, Jean-Philippe, Patricia, Roberto, Dirk);remote(Gonzalo(PIC), Kyle(OSG), Jason(ASGC), Tiju(RAL), Andreas(KIT), Rolf(IN2P3), Ron(NL-T1), Jason(ASGC), Gang(ASGC), Daniele(CMS)).

Experiments round table:

  • ATLAS - 1) Timeout problems on Saturday transferring files via srmatlas for more than 1 hour and also this morning. Found that it only happens when using FTS 2.1 (this is to the 5 muon calibration Tier-2 sites). 2) Have had a job stuck in FTS since 11 Dec - ggus ticket 53695. 3) SAM tests failed on Sunday due to a user certificate problem. ATLAS will recompute the correct site availabilities.

  • CMS reports - ASGC: several issues with the repacking. Doing focussed meeting with ASGC. Also, deletions stuck (following up) and some specific file checks. IN2P3: 1) FNAL->IN2P3 transfer of big files still being worked on. 2) no space for CMSSW releases (actions in progress, on both IN2P3 and CMS sides): case closed. PIC: 1) CMS Squid problems: upgrade done, case closed; 2) transfer problem with Lisbon NCG T2 RAL: some files pending to migrate to MSS, still an open case.

  • ALICE - 1) On Sunday reconstructed the collision events taken on Saturday with no particular problems. 2) A new version of the 64-bit CREAM-CE 1.5 has been announced and 3 ALICE sites in Russia will be testing this over Xmas. 3) Sites running only SL4 and without a migration plan to SL5 will now be blacklisted - only 5 such are left.

  • LHCb reports - 1) We don't anticipate massive re-processings during the Xmas break and January. We have no particular MC requests pending. Even if some may come up during the week, but unlikely to fill our share during 2 weeks. Analysis will go on (MC09 and real data) at Tier1s, these the plans after collecting ~250,000 collisions (5 seconds of LHCb at full power ;-)) 2) On Saturday the link ONLINE-CASTOR went down starting from 1:30 am. because of a problem in the LHCB core-router. No files were transferred to offline but data were safe on the ONLINE buffer. As soon as the last run completed on Sunday (at ~11) ONLINE people worked to send them to offline and indeed runs 63807, 63809, 63813, 63814, 63815 were sent to offline, migrated to CASTOR tape and reconstructed in less than 2 hours. 3) MC production running at low pace.Nonetheless a 50% of the jobs crashing (again something related with the core application) 4) At CERN received an "no read" alarm on our RO LFC instance. Problem due to an DIRAC agent got crazy because of an expired proxy on volhcb10 looping infinitely and hammering the catalog. 5) IN2p3: Users having data access problems using gsidcap protocol there. Open GGUS ticket against IN2p3. This resembles the same issue experienced at SARA. 6) LFC issue reported at RAL was a consequence of the 3d streaming incident (a couple of weeks ago) when we decided to keep un-synchronized RAL and GridKA LFC. 7) WMS216 at CERN is giving job submission errors from ICE - maybe a configuration issue.

Sites / Services round table:

  • ASGC: Tracking file transfer issues related to small atlasmcdisk files.

  • RAL: Problems accessing a small number of CMS files - under investigation.

  • KIT: 2 problems over the weekend: 1) a blackhole workernode led to failing CE tests. 2) The information provider of the CMS dcache instance failed and was repaired this morning.

  • IN2P3: 1) The SIR of the DNS loadbalancing failure has been submitted. 2) Mass storage will be 'at risk' on 22 December from 7.30 to 15.00. 3) srm dcache will be down all day on 4 January.

  • NL-T1: 1) Disk storage has been increased in Nikhef. 2) Confirm the migration of dcache to Chimera on Jan 11th.

  • RAL: Problems transferring data from RAL to BNL were due to a draining disk server. Looking how to mitigate this.

AOB: Experiment Xmas plans have now been received from all 4 experiments and Tier-1 sites are invited to submit their plans.

Tuesday:

Attendance: local(Julia, Jean-Philippe, Andrea, Jamie, Ueda, SImone, Ale, Roberto, Maria);remote(Xavier, Daniele, Onno, Brian, Gonzalo, Gareth, Jason).

Experiments round table:

  • ATLAS - No particular issue to report - yesterday noticed alarm ticket sent to CERN did not go to ATLAS operations mailing list - updated GGUS ticket on this. For LHC - stop tomorrow at 18:00. ATLAS will also stop at same time. Until then many things to ramp up. No stable beam expected during this time.

  • CMS reports - closing 1 of tickets to ASGC regarding lsit of files to be invalidate - done. Work in progress on all other tickets, including at T2 level. Only 1 T2 ticket (Lisbon-PIC) transfers closed - not Lisbon issue. Network issues affecting TIFR solved - new gear in prod. Resuming bulk transfers. Can they digest all pending migrations? (To close transfers in todo <Xmas ; else invalidate and follow up later).

  • ALICE -

  • LHCb reports - About 400,000 collisions at 450 GeV per beam - most likely data for Xmas. No large production activity on going. Few COLLISION09 type files received last night no MC productions running at all. just few hundreds user analysis jobs. Problem with 1 WMS at CERN submitting to CREAM CEs. Site SB stuck this morning - devs alerted and problem solved. Looking for a suitable slot (in January) to deploy new privileges in the production stagers for LHCb. We have been in touch with CASTOR developers and confirmed the setup on a development replica instance of the stager was OK. T1 level: SARA and IN2p3: there are two tickets open (for the similar file access problems). The SEs are currently banned out the mask; discussions in LHCb to decide whether we want to risk to have them off all Xmas break. The ball is indeed now on dCache devs hands (dCache ticket 5313) being the issue due to a Java third party library. Andrea - dCache problem recognised as bug? A: y, 3rd party library problem.

Sites / Services round table:

  • KIT: ntr
  • NL-T1: issue with LHCOPN connection - cable damaged somewhere. This morning at 06:00 emergency downtime and another on Thursday at 11:00. Ale: GGUS ticket for this? A: not sure if reported there. Q: is backup link ok? A: yes
  • RAL: ntr
  • PIC: intervention on tape library happened this morning - transparent - very little ongoing activity. Really transparent smile
  • ASGC:
  • IN2P3: correction of item reported yesterday - in minutes mention of "at risk" on 22nd Jan - should read 22nd December ! MSS at risk outtage.

  • CERN: 2 issues for ATLAS - alarm ticket didn't make it to CERN. Tested with an alarm ticket - please call OPS for the time being. GGUS investigating. [ Ueda will send test ] Lost 10K files from ATLAS as disk server burned out after powercut. List passed to collaboration. frown Ale - we have copies of some of the data.

AOB:

Wednesday

Attendance: local(Jamie, Ueda, Gav, DIrk, Antonio, Jan, Maria, Ale, SImone, Roberto, Edoardo);remote(Michael, Onno, Gonzalo, Tiju Idiculla (RAL), IN2P3).

Experiments round table:

  • ATLAS - No big issue - one peculiar error with SARA-MATRIX. Authenitification failed - bad sequence. No ticket yet - only 10% failures. After retry ok(?) - if persistent will then send a ticket. Otherwise no big issues - LHC stops tonight at 18:00 and not expecting stable beams before closing...

  • CMS reports - The main issue as of last 24 hrs is that the CMSSW deployment team started to work to install the first SL5-built-only CMSSW release (version 3.4.0) at the distributed Tiers supporting the CMS VO. As of yesterday dinner time (CET), CMS had ~15 T1/T2 sites (on both EGEE and OSG) with this release properly installed already.

  • ALICE - report follows:
    • Yesterday during the evening shift (16:00-24:00) there was no beam, no big activity in that period
    • During the night shift (24:00-08:00) Alice had a lot of activity recording several good runs which entered immediately in reconstruction. No issues to report in terms of services at the T0, again they have behaved perfectly.
    • One of the Alice voboxes dedicated to the raw reconstruction(voalice12) has been configured this morning to continue the new bunch of exercises to test Virtual machines. This is done in collaboration with Ulrich's group. The idea is to maintain the setup during the whole vacations period to record enough feedback. The experts have already announced, Alice agents are already arriving to the virtual machines
    • Finally we are just finishing the setup of the latest SL5 voboxes which have entered production before Xmas. They are intended to be used during the Alice MC production expected during the Xmas time. We are therefore finishing the setups to ensure a stable production during such period.

  • LHCb reports - See plot in Twiki of reprocessing jobs in last 24h. Reprocessing activity over all collisions collected so far launched yesterday and "happily completed" - all data now available for users. A couple of MC productions running - 2K concurrent jobs. No major issues to report apart from previous issues at SARA & Lyon - dCache issue - now in developers' hands. (Java lib used by dCache).

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • IN2P3: ntr
  • NL-T1: 1 issue - had gsidcap door crashed - DNS round robin - meant some WNs using this door that had crashed: jobs there were failing. WNs cache host info hence these nodes turned into black holes. Restarted door and changed config so host info not cached to avoid black-hole problem.
  • PIC: ntr
  • RAL: ntr
  • KIT: ntr

  • CERN: starting planning for rollout of CASTOR 2.1.9 in January. Continuing investigation of double CASTOR SRM ATLAS failure - post-mortem being perfomed.
  • DB: An intervention is planned for tomorrow on ATLAS online DB

Release report: deployment status wiki page - nothing to add to last week's report

AOB:

Thursday

Attendance: local(Julia, Roberto, Dirk, Jean-Philippe, Jamie, Maria, Ale, Simone, Ueda, Ignacio);remote(Gang, Xavier, Tiju, Daniele, Lyon, Ronald, NDGF).

Experiments round table:

  • ATLAS - yesterday LHC stopped for technical stop. Still some activity with tails of reprocessing which went on overnight and now export to T1/T1. Ongoing without problems. Another T0 reprocessing campaign will happen tonight - just a few d/s. Outcome shipped tomorrow. BIg reprocessing over Xmas as described in activities of ATLAS over Xmas. Site validation started already ... Today at ATLAS OPS meeting will be more info on volume of data per T1 and # jobs etc. A new d/s not on disk - those were from cosmics which will also be reprocessing. This will require recall from tape. Some have not yet been shipped out from CERN (express stream). This now has to be reprocessed and hence shipped out. Will happen today / tomorrow. ATLAS collected 150TB of data - a significant fraction of which will be reprocessed.

  • CMS reports - For T1s progress for FNAL (low transfer quality), RAL (pending migrations to MSS), IN2p3 transfers to FNAL (large file size - FTS timeout increased to 6Ks). T1 open issues: ASGC - repacking, deletions stuck; IN2P3 - not enough disk space area for CMS SW installation - potentially blocking s/w v 340 - the one for Xmas reprocessing. Just a couple of days to fix! Minor issues at PIC (transfer problem to Lisbon) RAL (above issue now closed!) T2: following up case by case, e.g. UERJ BR T2, Turkish T2 - both tickets closed. TIFR - promising news on data connection - waiting for feedback from this site... IN2P3: some info on local CMS support guy who said that space problem is partially due to old versions of s/w - these are not deleted.

  • ALICE (Patricia, report sent before the meeting) - The final list of sites to blacklist in January 2010 is ready and will be presented during the ALICE TF meeting this afternoon. It total about 8 T2 sites will not be in time providing both WNs and VOBOXES in SL5. Final registration of new VOBOXES before Christmas time sent to PX support this morning (concerning ITEP in Russia and Madrid). Final configuration of new SL5 VOBOXES before the vacations period ongoing. This task should be finished by today

  • LHCb reports - Nothing much to report apart from a few issues at non-T1 sites with instability of shared area. Jobs crash with timeouts - GGUS tickets opened per site.

Sites / Services round table:

(Sites with ntr suppressed)

  • NL-T1: gridftp doors hang sometimes - 2 doors restarted this am. dCache devs have made fixed that should be released soon.
  • NDGF: delay update of dCache head nodes until after Xmas.
  • ASGC: CMS data deletion - some 5K files pending. Can't be deleted by PhEDEx but only CASTOR admin. Any update? Pending since some days...

AOB:

CERN service database: cern.ch/servicedb

Friday

Attendance: local(Jamie, Julia, Ueda, Jean-Philippe);remote(Onno, Gonzalo, Michael, Brian, Daniele, Gareth, Rolf).

Experiments round table:

  • ATLAS - No problem - did some reprocessing since yesterday and exported some other data this morning. After confirmation of these results will start reprocessing in T1s during vacation.

  • CMS reports - IN2P3 fixed problem with disk space for CMS SW release - added 60GB which is enough for SL5 based area. CMSSW340 installed. For others: following up with PIC, ASGC (still no reply to Q from Gang - CMS Data ops submitted 3 requested to remove 45K files and 90% of these were deleted manually by Gang. The others cannot be done by Gang due to permission problems. T2: closing Purdue / Aachen tickets; TIFR: network problems in better situation wrt a week ago - they can download files but still problems to upload. Cancelling pending transfers.

  • ALICE (Patricia before the meeting) - The last list of sites which will be blacklisted after the Christmas period has been announced yesterday during the Alice TF meeting. A small bunch of last minute setups has been done yesterday afternoon/night to ensure the maximal possible number of sites participating during the MC production expected during Christmas period.

  • LHCb reports - Few remaining jobs of one MC production. In total <1000 jobs in the system now. T1: Some files were discovered to be not available yesterday at RAL. These files are sitting in a disk-server that was disabled. T2: Again another site with shared area problems

Sites / Services round table:

  • NL-T1: SARA SRM - bug in dCache 1.9.5 which has probably caused data corruption in files written using xrootd protocol. Applied quick fix at beginning of afternoon. To check if data is corrupted need list from ALICE of files stored on SARA SRM + md5 checksums so that the site can check.

  • RAL - disk server out of prod (LHCb report) - showing errors with fsprobe test. Scheduled outtage 22nd Dec to reboot some disk servers (in GOCDB).

  • ASGC: minor update for the SL5 migration, we'll online 1540 cores of WN with SL5 OS and have the system running together with SL4 for another two weeks. we'll continue the migration of SL4 end of the year. full scale of computing pool expect earlier of 2010. Pending CMS issue - will try to address today.

AOB:

  • Mail from DG to CERN staff Thursday:
Dear CERNois,

Yesterday [ i.e. Wednesday - ed ] evening at 18:03, the LHC ended its first full period of operation in style. Collisions at 2.36 TeV recorded since last weekend have set a new world record and brought to a close a successful first run. The LHC has now been put into standby mode, and will restart in February 2010 following a short technical stop to prepare for higher energy collisions and the start of the main research programme.

A technical stop is needed to prepare the LHC for higher energy running in 2010. Before the 2009 running period began, all the necessary preparations to run up to a collision energy of 2.36 TeV had been carried out. To run at higher energy requires higher electrical currents in the LHC magnet circuits. This places more exacting demands on the new machine protection systems, which need to be readied for the task. Commissioning work for higher energies will be carried out in January, along with necessary adaptations to the hardware and software of the protection systems that have come to light during the 2009 run.

The success of the 2009 run is down to the skill and dedication of every one of you. Congratulations and thanks to you all.

Rolf Heuer

-- JamieShiers - 11-Dec-2009

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng lhc-xmas.png r1 manage 44.7 K 2009-12-17 - 10:42 JamieShiers  
Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2009-12-18 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback