December 2009 Reports

To the main

18th December 2009 (Friday)

Experiment activities:


  • Few remaining jobs of one MC production. In total <1000 jobs in the system now.

GGUS (or RT) tickets * 2 new tickets since yesterday

T0 sites issues: none

T1 sites issues: none

  • Some files were discovered to be not available yesterday at RAL. These files are sitting in a disk-server that was disabled.

T2 sites issues:

  • Again another site with shared area problems

17th December 2009 (Thursday)

Experiment activities:


  • Just one MC production running at low level and random user activity.

GGUS (or RT) tickets * 1 new tickets since yesterday

T0 sites issues: none

T1 sites issues: none

T2 sites issues:

  • Shared area issues. Open ticket to affected sites

16th December 2009 (Wednesday)

Experiment activities:


  • Here the last Collaboration meeting agenda where it is reported the fist data taking experience and how everything (included Grid Computing) behaved smoothly.
  • Yesterday afternoon launched (and completed successfully) the reprocessing of the 450GeV collision data using the right version of Brunel application. This is visible from the plot in figure below showing the number of these reprocessing jobs run in the last 24h.
  • A couple of MC productions running at the pace of 2K jobs concurrently.

reprocessing_jobs.png

GGUS (or RT) tickets * 2 new tickets since yesterday

T0 sites issues: none

T1 sites issues: nothing news (still valid the dCache issue at SARA and IN2p3)

T2 sites issues:

  • Shared area issues at 2 T2s.

15th December 2009 (Tuesday)

Experiment activities:

  • Reported today in LHCb that ~400K collisions at 450GeV per beam have been collected in total so far. This number might slightly increase in the next hours before the shutdown.
  • No large production activity on going. Few COLLISION09 type files received last night no MC productions running at all.
  • just few hundreds user analysis jobs.

GGUS (or RT) tickets * 1 new tickets since yesterday

T0 sites issues:

  • WMS216 problems submitting to CREAMCEs. Open GGUS
  • SSB seemed to suffer some problems this morning not refreshing information.Alerted supporters problem gone.
  • Looking for a suitable slot (in January) to deploy new privileges in the production stagers for LHCb. We have been in touch with CASTOR developers and confirmed the setup on a development replica instance of the stager was OK.

T1 sites issues

  • SARA and IN2p3: there are two tickets open (for the similar file access problems). The SEs are currently banned out the mask; discussions in LHCb to decide whether we want to risk to have them off all Xmas break. The ball is indeed now on dCache devs hands (dCache ticket 5313) being the issue due to a Java third party library.

14th December 2009 (Monday)

Experiment activities:

  • Plans over Xmas: of course, we don't anticipate massive re-processings during the Xmas break and January. We have no particular MC requests pending. Even some may come up during the week, but unlikely to fill our share during 2 weeks. Analysis will go on (MC09 and real data) at Tier1s. This is what we can say after collecting ~250,000 collisions.
  • On Saturday the link ONLINE-CASTOR got down starting from 1:30 am. because of a problem in the LHCB core-router. No files were transferred to offline but data were safe on the ONLINE buffer. As soon as the last run completed on Sunday (at ~11) ONLINE people worked to send them to offline and indeed runs 63807, 63809, 63813, 63814, 63815 were sent to offline, migrated to CASTOR tape and reconstructed in less than 2 hours.
  • MC production running at low pace.Nonetheless a 50% of the jobs crashing (again something related with the core application)
  • Curiosity: last night (4-6 am) LHCb has recorded the world's highest energy pp collisions (as well as the other experiments!) nice_event.jpg

GGUS (or RT) tickets * 2 new tickets since Friday

T0 sites issues:

  • Received an "no read" alarm on our RO LFC instance. Problem due to an DIRAC agent got crazy because of an expired proxy on volhcb10 looping infinitely and hammering the catalog.

T1 sites issues

  • IN2p3: Users having data access problems using gsidcap protocol there. Open GGUS ticket against IN2p3. This resembles the same issue experienced at SARA.
  • LFC issue reported at RAL was a consequence of the 3d streaming incident (a couple of weeks ago) when we decided to keep un-synchronized RAL and GridKA LFC.

11th December 2009 (Friday)

Experiment activities:

  • Thanks to the very stable beam a fairly intensive (compared with previous days) activity from the pit last night and this morning with tens of large (>3GB) COLLISION09 type files coming and migrating to tape for being later reconstructed;much more files for various apparatus sub-detector calibration also received.
  • Rolled back to direct file access (rfio/dcap).

GGUS (or RT) tickets * 6 new tickets since yesterday

T0 sites issues:

  • WMS: the patch 3489 that fixes the bug #59054) has urgently to be rolled out on our WMS instances at CERN before Xmas (wms203 and wms216). It looks indeed like that when we will achieve 1500 queued jobs there the WMS will also stop to receive CondorG jobs and not just CREAMCE SAM tests jobs. Request has been submitted via GGUS ticket.wms203_ICE.PNG

T1 sites issues:

  • SARA: users reporting failure accessing data (file doesn't exist). Netherlands Storage Element banned from production mask
  • RAL: LFC read-only slave: "No user mapping" issue.
  • GridKA: reported some issue with the shared area there (timeout setting up the environment)

T2 sites issues:

  • Reopened tickets against MILANO (SQLite issue) and GRIF (shared area) that have got lost as reported yesterday after the GGUS portal upgrade. Open a ticket for SQLite issue at Liverpool

10th December 2009 (Thursday)

Experiment activities:

  • System now empty.Received 5 files from ONLINE and reconstructed them promptly.
  • One of the shifters reports about two LHCb TEAM GGUS tickets newly created in the last days that seem to have disappeared from the system after the November upgrade (that took place recently). The ticket were open against GRIF and INFN-MILANO-ATLASC by user Vladimir Romanovskiy. Open a GGUS ticket for this problem (54002).

GGUS (or RT) tickets * 1 new tickets since yesterday

T0 sites issues:

  • none

T1 sites issues:

  • IN2p3: they fixed the published MaxCPUTime limit causing in the past the time left utility to wrongly estimate the remaining time and then the batch system killing jobs because exceeding the limit. No more failures observed because of that problem

9th December 2009 (Wednesday)

Experiment activities:

  • this morning received (and migrated to tape in just ~1 hour) a collision09 type file and then reconstructed.
  • running at very low level MC simulation (in total less than 1K jobs in the system).
  • curiosity: there is possible evidence for a Lambda0 mass peak in the first data, albeit with very few events in the plot after cuts..

GGUS (or RT) tickets * 1 new tickets since yesterday

T0 sites issues:

  • none

T1 sites issues:

  • A permission issue at SARA during a old test data clean up activity

8th December 2009 (Tuesday)

Experiment activities:

  • Reconstruction of (few) collision data was launched on Sunday but reprocessing done only yesterday (in just one hour) after fixing problem with Brunel application crashing and once data condition have been updated by hand. Everything went like a dream using the schema to first download data into the WN and then opening locally.
  • Also big achievement as far as concerns Moore and Physics stripping (b/c inclusive and minimum bias) that managed to run over all data (100% except for RAL where we lost data). Merging has been flushed and produced final DSTs.
  • At lower priority (than real data reco or MC stripping) also running MC simulation that keeps warm non-T1 sites. Discussed at today's TF meeting (and also visible from the SSB) how the 25% of resources can be roughly considered wasted because of some internal problem with LHCb Gaudi applications causing jobs to fail. This has to be addressed by LHCb core application people.

GGUS (or RT) tickets * 1 new tickets since Friday

T0 sites issues:

  • VOLHCB09 had another alarm for swap full and the python process killed. Suspicious on one of the new piece of code that has been temporarily disabled.
  • Considering the very low DAQ rate - and in order to demonstrate how the whole system can quickly serve data to end users, it has been requested to CASTOR guys to relax a bit the migration to tape policies for the remaining days of data taking this year moving from current 8 hours to the lowest threshold feasible in CASTOR.

T1 sites issues:

  • The dcap file access issue at SARA and IN2p3 reported in the weekend has been confirmed to have the same root (thanks to Ron and Lionel) and dCache developers have been alerted by Lionel.

7th December 2009 (Monday)

Experiment activities:

  • Interesting weekend with collision and magnetic filed on: some data fully reconstructed made available on the LHCb BK for users. LHCb were prepared to receive 1 million collisions and we barely got 15,000. Some issues observed with both the code (Brunel crashes) and with the Conditions database (wrong magnetic field). Both are being fixed now and reprocessing will start later today (not a big deal).
  • Following some problems observed at T1s, DIRAC will move to what had been decided long ago: move to data copy for reconstruction jobs. This will guarantee data to be at least reconstructed
  • Few MC simulation requests received from the PPG are now running at low pace in the system.
  • User jobs.

GGUS tickets * 6new tickets since Friday

T0 sites issues:

  • VOLHCB09 had an alarm with swap full also during this weekend preventing data to be accessed through the hosted BK service. The sysadmin found a python processes eating the swap. This process has been killed and DIRAC experts looking at the source of this problem
  • Observed (open a TEAM ticket) an anomalous delay in migrating real data files to tape. The picture shows this (felt by LHCb as a) problem; triggered a discussion about policies for data migration; next time we will be more prepared and confident in opening GGUS tickets when we suspect a real problem is happening.castor_raw.gif

T1 sites issues:

  • SARA: issue accessing raw data (both from NIKHEF and SARA WNs)
  • IN2p3: issue accessing raw data. The symptoms are similar to the ones observed at NL-T1
Unknown code z 0
Unknown code z 0
Error in <TDCacheFile::TDCacheFile>: file
gsidcap://ccdcacsn123.in2p3.fr:22128//pnfs/in2p3.fr/data/lhcb/data/2009/RAW/FULL/LHCb/COLLISION09/63480/063480_0000000001.raw?filetype=raw
does not exist
2009-12-06 18:45:57 UTC IODataManager       ERROR Error: connectDataIO> Cannot connect to database:
PFN=root:gsidcap://ccdcacsn123.in2p3.fr:22128//pnfs/in2p3.fr/data/lhcb/data/2009/RAW/FULL/LHCb/COLLISION09/63480/063480_0000000001.raw
FID=8915ca90-e24d-11de-bfa2-00188b8565c8
2009-12-06 18:45:57 UTC EventSelector.D...  ERROR Failed to connect
to:LFN:/lhcb/data/2009/RAW/FULL/LHCb/COLLISION09/63480/063480_0000000001.raw

The error (quoting Ron) seems related to a third party library that might affect all dCache centers. We let various sysadmins (via GGUS) to share this bit of information

4th December 2009 (Friday)

Experiment activities:

  • Mainly user activity going on.

GGUS tickets * 0 new tickets since Friday

T0 sites issues:

  • VOLHCB09 has an alarm with swap full and we can not login to the machine to see what's happened. The sysadmin looking at the problem.
  • Joel sent a mail to the administrator of the machine lxbra2509 (hosting the old LHCb AMGA service) to stop and remove completely the AMGA software from there. Farida agreed to do so.

T1 sites issues:

  • nothing

3rd December 2009 (Thursday)

Experiment activities:

  • LHCb week
  • There should be stable beam starting on Saturday morning and LHCb is going to take data with magnet field ON. Since the SQLDDDB cannot give the correct value of the magnetic field it is therefore very important to be sure the online DB is propagated correctly to all sites and the production is set up for using Oracle. This means that ConditionDB access and its stream replication becomes crucial this weekend. Alerted operations team to stay tuned in order to not be the show-stopper to physicists having a look at first data.

GGUS tickets * 0 new tickets since Friday

T0 sites issues:

  • The issue reported yesterday about a bug affecting submission to CREAMCE has a patch (3489) already certified and is now in staged rollout ready to be deployed at sites.
  • AMGA server, after the power cut, have been reboot and brought to life hammering the LHCb DB. This service is not used anymore and should be removed.
  • Received request to schedule an intervention on CASTORLHCB to apply level 2.1.8-17 and on SRM-LHCB to apply 2.8-5 for this Tuesday (8 December).Considering that even small hiccups have be avoided by Monday, LHCb strongly discouraged to do it either on Monday or Tuesday and rather suggested this afternoon slot. It has been decided to carry it at 14:00 today. The intervention is however not demonstrating to be so transparent as claimed with many users complaining about CASTOR unavailability after 10 minutes ;-). The intervention is not in GOCDB as UNSCHEDULED
[lxplus302] ~ $ stageqry 
send2stgd: STG00 - stage daemon not available on castorlhcb (Connection refused)
STG161 - Stage not available or in pause mode - Please wait

T1 sites issues:

  • RAL: Requested a Post Mortem on the data lost incident on Monday. Agreed by Gareth.

2nd December 2009 (Wednesday)

Experiment activities:

  • LHCb week
  • Reprocessing and Stripping in the last 24 hours: 80 jobs (see SSB). Prevalently user activity going on.

GGUS tickets * 2 new tickets since Friday

T0 sites issues:

  • One of the LHCb dedicated WMS services (wms216) has run in troubles submitting to CREAMCEs. It seems the bug (#59054) is the reason of this problem and machines at CERN (but not only) must be upgraded with the patch.
  • Power cut last night affecting many disk servers spread across all LHCb TxD1 space tokens. Zero space returned and then many activities affected. The disk servers were not in the critical power and had to be restarted manually this morning. Also other perturbations observed (like acronjobs timing out). All VOBOXEs restarted properly.

T1 sites issues:

  • LHCb is evaluating SQUID mechanism for SW distribution at PIC.

1st December 2009 (Tuesday)

Experiment activities:

  • At the ongoing LHCb week the plans until the end of 2009 have been presented. The two-fold primary goal by the end of this year is to power ON the full detector (RICH included) and calibrate/polish all channels.This week they plan to switch on some other sub-detectors (VELO and ST included) to take alignment data with magnet OFF. This alignment with collisions will proceed until10K events are collected (~1/2 hour) and should impact marginally OFFLINE and Grid activities. For that reason LHCb is eagerly looking for stable beams (may be starting from Thursday?)
  • Running at very low rate the L0HLT and then physics stripping of MB and c/b-inclusive events from MC09 (order of tens of jobs). OFFLINE system is basically empty

GGUS tickets * 0 new tickets since Friday

T0 sites issues:

  • none

T1 sites issues:

  • RAL: double disk failure in a disk array for LHCb causing many stripped (or being stripped) data lost definitely. Envisaging the possibility to require a PM costing this some man hours

-- RobertoSantinel - 05-Jan-2010

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2010-01-29 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback