February 2010 Reports

To the main

25th February 2010 (Thursday)

Experiment activities:

  • Drained the merging jobs at CERN (stalled issue), more MC jobs have been submitted yesterday to the grid (4K jobs running).

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
    • none
  • T2 sites issues:
    • EFDA-JET illegal instruction
    • UKI-SOUTHGRID-BRIS-HEP jobs aborting

24th February 2010 (Wednesday)

Experiment activities:

  • Ramped up MC productions + chaotic user analysis.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0 sites issues:
    • Having troubles with Castor at CERN. Merging jobs got stalled at CERN because of no CPU being used. Typically on a merging job this means the connections have been lost to the disk server (lhcbdata). Happening since yesterday it does not seem to be correlated with the intervention today.

queued_tranfers.pngno ticket open, being first investigated on LHCb side.

  • T1 sites issues:
    • none
  • T2 sites issues:
    • CBPF object not found in shared area
    • CY-01-KIMON shared area issue

23rd February 2010 (Tuesday)

Experiment activities:

  • 8-9 MC requests to be verified prior submission, others are running fairly smoothly at the good rate of 10K concurrently. Stripping on Minimum bias tuning retention factor number of events size of the output and CPU lenght in order to accommodate jobs in the available queues.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 2
  • other: 1

Issues at the sites and services

  • GGUS: trapped a short unavailability this morning (#55807).
  • T0 sites issues:
    • LHCB will come up with a request to move their RDST space from T1D0 to T1D1 and reverse the share RDST<--> RAW.
  • T1 sites issues:
    • RAL: most likely some issue with gdss378 gridFTP server.
  • T2 sites issues: *g++ not found at INFN-TORINO and pilots not picking up payload at RRC-KI

22nd February 2010 (Monday)

Experiment activities:

  • System kept busy during the weekend by some MC productions and data stripping. Not very much extensive though. More production submitted this morning (currently 4K jobs) .

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0 sites issues:
    • Reported very low efficiency transferring to CASTOR at CERN on Friday evening due to many gridftp errors as shown in the following plot on Friday evening. Preventively a mail sent to Jan informing that we were ramping up the activities and were observing on lemon an increased load on our lhcdata pool. The problem seems to have gone away by itself apart RAL-CERN where we have still failures at the source for the file availability(to be investigated).

    • The ticket (TEAM- urgent) submitted for the previous (CASTOR) issue is still in status progress (set by fnaz supporter this morning). Nobody touched it since its original submission on Friday at 20:37 (CT0000000662919)

xfers_to_cern.png

  • T1 sites issues:
    • RAL: most likely some issue with gdss378 gridFTP server.
  • T2 sites issues:
    • shared area at IL-TAU-HEP, +g+ not found at BIFI and SAM tests failing at egee.fesb.hr

19th February 2010 (Friday)

Experiment activities:

  • There are no large activities on going in the system right now. Mainly user problems due to a non completely transparent migration of DIRAC backend (proxies not uploaded and then jobs failing with proxy expiration). For this weekend a couple of consistent MC productions are about to come + stripping activity (background) at RAL CNAF and CERN ( the only T1's working for LHCb as of todayfor the incompatibility issue root/dcap ).

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 2

Issues with distributed services

  • SLS reported for all T1's and T0 read-only a shortage this morning at about ~2:00 UTC. Perhaps connected with some known streams replication issue?

SLS_replication.png

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
  • T2 sites issues:
    • Shared area problem at Barcelona
    • SQLIte DB problem at CYFRONET-LCG2
    • Pbs uploading output files at PDC

18th February 2010 (Thursday)

Experiment activities:

  • DIRAC has been restarted yesterday according the schedule. Few minor problems reported by users due to a changed interface and GANGA stick to old version of DIRAC. Promptly done the alignment GANGA/DIRAC. Launched a small production yesterday evening to test the new brand of setup and no major problems have been reported with just a few percent of jobs failing. SAM activity wasn't impacted too much while other information on SLS and Dashboard had to be adapted to the new DIRAC. There are currently in the system few thousand jobs running for some small MC productions (6M-12M events to be produced each) while some more consistent production is about to come.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 5

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
    • none
  • T2 sites issues:
    • INFN-LHCb-T2: issue with permissions in the shared area
    • CESGA: issues uploading logs/sandbox to LogSE
    • Padova, Sofia and Bologna: issue with g++ compiler missing. Worth to mention explicitly in the VO-ID Card

17th February 2010 (Wednesday)

Experiment activities:

  • DIRAC is switched off. No activity going on at all.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
    • none * T2 sites issues:

16th February 2010 (Tuesday)

Experiment activities:

  • All activities of production team are focused on preparing the major migration to DIRACv5r0. This implies testing new (already certificated) software in its new configuration, running over new machines. DIRAC will be switched off today at 17:00 (if this pre-production like activity does not show up any major problem)

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • Open a low priority ticket to track down the timeout issue accessing data via root at CERN with some concrete information put in to help CASTOR guys to debug.
  • T1 sites issues:
    • none * T2 sites issues: * none

15th February 2010 (Monday)

Experiment activities:

  • Very low level user activity. Tomorrow at 17 DIRAC will be turned off and on Wednesday the major s/w and h/w upgrade.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • reported some user jobs failing at CERN to access data requests timing out as it was last week. Problem just transient.
  • T1 sites issues:
    • none * T2 sites issues: * none

12th February 2010 (Friday)

Experiment activities:

  • System draining last MC productions for preparing for the major upgrade of all DIRAC central machines next Tuesday.
  • Released LCG58a from AA that fixes the incompatibility issue preventing to use dCache sites (now banned)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 4

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
    • RAL: disk server on USER Space Token not available for a short while yesterday.
  • T2 sites issues:
    • Shared area issues at: UFRJ-IF, ru-Moscow-SINP-LCG2 and INFN-NAPOLI-CMS.
    • SAM jobs failing at UKI-NORTHGRID-LANCS-HEP.

11th February 2010 (Thursday)

Experiment activities:

  • No much activity happening (500 jobs in the system from a couple of users). Draining previous MC productions which are running at very low level. No further MC will be submitted to have an empty system to run the upgrade of all DIRAC central machines next Tuesday.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • none

  • T1 sites issues:
    • All dCache sites: file access issue reported yesterdat found to be due to an known incompatibility between the version of root (5.26/00a) used by LHCb analysis application (DaVinci) since Monday and dcap libraries.
    • Issue with Spanish CA affecting all Spanish users and services running with Spanish credentials. Now fixed (Spanish host blacklisted by mistake not certificate where downloaded at CERN)

10th February 2010 (Wednesday)

Experiment activities:

  • Yesterday submitted and run small MC productions; today these same are complete other in the pipe.
  • Internal problem with software installation modules:only CNAF among T1's seems to have the LHCB application properly installed. LHCb experts looking at that.
  • SAM tests in the last 24 hours failed to publish in SAMDB. Unblocked but not clear the ultimate cause despite it seems tobe a problem with the dedicated vobox used for that at CERN

GGUS (or RT) tickets:

  • T0: 0
  • T1: 5
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • none

  • T1 sites issues:
    • SARA/NIKHEF: users with UK issued certificate seems to experience problems accessing data
    • GRIDKA:users with UK issued certificate seems to experience problems accessing data
    • IN2p3: users with UK issued certificate seems to experience problems accessing data
    • pic: users with UK issued certificate seems to experience problems accessing data
    • CNAF:users with UK issued certificate seems to experience problems accessing data

9th February 2010 (Tuesday)

Experiment activities:

  • No large activity in the system today.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0 sites issues:
    • LFC : propagation of trusted vobox into the quattor template. Has it been done?

  • T1 sites issues:
    • RAL: Network intervention extended by ~ one hour. Some lhcb users noticed the outage.
    • GridKA: Jobs having permission problems at user script level in local WNs. (open)
    • CNAF: CREAMCE is failing all jobs submitted through.

  • T2 sites issues:
    • Jobs failing at BG05-SUGrid. UKI-LT2-Brunel SQLite DB problem

8th February 2010 (Monday)

Experiment activities:

  • Few remaining MC-productions running to completion during the weekend; submitted few more productions (order of 10M events to be produced)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1

Issues at the sites and services

  • T0 sites issues:
    • LFC streams SAM test failing against pic RAL and CNAF all at the same time. Checked that there was announced on IT Status Board an intervention on down streams databases but we missed it.

  • T1 sites issues:
    • NIKHEF: MC efficiency lower than anywhere else.Long jobs landed into shortish queues. GGUS open and closed ...most likely some problem application side.
    • GridKA: Jobs having permission problems at user script level in local WNs. (GGUS open, under investigation)
    • RAL: Shortage in the power supply affecting the air conditioning system for a short while. No appreciated impact on LHCb. RAL: intervention on Network extended by one hour. Users notice the outage.

  • T2 sites issues:
    • Shared area issue at ITPA-LCG2. Jobs failing at ITWM.

5th February 2010 (Friday)

Experiment activities:

  • Several MC production launched yesterday evening and now running at a low rate of few thousand jobs reached a peak of ~9k concurrently running MC simulation jobs last night. In the picture the evidence of this activity over the last 24h. as reported by the SSB.last_24_hs_MC_activity.png

GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 1

Issues at the sites and services

  • T0 sites issues:
    • SRM is unusable: LHCb open a GGUS ticket because noticed that SAM jobs failing since this morning ~3:00 am with all SRM requests timing out. Jan was sending at the same time a mail reporting about the funny state the SRM endpoint was. There are a couple of possible reasons behind but LHCB do not believe that is putting in the system a load much larger than in the past (despite the increased activities in the last 24 hours) Issue found to be related to new h/w delivered on lhcbdata pool but not properly seen by LSF
      1. high load on 'lhcbdata' by user "lhcbprod" (2.4k outstanding requests, these seem to write rather slowly (not at wire speed) to the handful of servers that have space left)
      2. SRM-2.8 bug that makes SRM a single point of failure over all pools https://savannah.cern.ch/bugs/?45082 (fixed in 2.9, but that needs further validation) - otherwise this load would only fails access to "lhcbdata", but now it affects other pools as well.

  • T1 sites issues:
    • IN2p3: SRM endpoint became unresponsive: both SAM tests and normal activity from our data manager were failing with the error. The suspicious is that some CA certificate is not properly updated on the remote SRM. In this case the CERN CA. SRM resterted
[SE][srmRm][] httpg://ccsrm.in2p3.fr:8443/srm/managerv2: CGSI-gSOAP running on lxplus303.cern.ch reports Error reading token data header: Connection closed

    • RAL: a Oracle glitch reported by our contact person immediately fixed by RAL but that might have caused some jobs failing/stalling due to a temporarily service unavailability.
  • T2 sites issues:
    • ITWM: one WN acting as black hole

4th February 2010 (Thursday)

Experiment activities:

  • Discrete user activity(~1000 jobs concurrently running).

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 4
  • Other: 2
    • VOMS: 1 (via the new third party GGUS feature)
    • GGUS: 1

Issues at the sites and services

  • T0 sites issues:
    • New delivered VOBOXes have been registered in the list of trusted hosts by LFC master. To be effective this change has to be propagated to the Quattor template.

  • T1 sites issues:
    • None
  • T2 sites issues:
    • Lancaster: issue with the mount point of SL5 sub-cluster shared area. They are using one single endpoint for the CE service pointing to two different sub-clusters with different OS.
    • IL-TAU-HEP shared area issue
    • UKI-LT2-IC: wrong mapping of Role=production (again the issue of a too generous mapping in case of gridmap-file mechanism is used)
    • CYFRONET-LCG2: problems installing software
  • Services:
    • New GGUS portal: TEAM tickets lose the information about concerned VO resulting to affect "none" instead of the expected "lhcb"
    • VOMS <-> VOMRS synchronization: probably due to the original UK CA certificate a user has been firstly registered (and expired long while ago). Steve fixed it by hands with Tania's help.
      • Many other users potentially affected by.

3rd February 2009 (Wednesday)

Experiment activities:

  • No activity.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 1

Problems at the sites

  • T0 sites issues:
    • Request to verify whether the master instance of LFC at CERN hasnew delivered VOBOXes at CERN in the list of trusted hosts.
    • problem of synchronization between VOMS and VOMRS ()
  • T1 sites issues:
    • IN2p3: Banned because of the intervention on BQS backend
  • T2 sites issues:
    • Shared area issue at Manchester.

2nd February 2010 (Tuesday)

Experiment activities:

  • No activity. Software week. Under discussion within the collaboration: to extend to T2's (under specific and restrictive conditions) the possibility to host distributed analysis (besides T1 and CERN-CAF). Interesting this talk with proposed amendments to the LHCb Computing Model (to be approved by CB).

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Problems at the sites

  • T0 sites issues:
    • Got another machine behind LFC_RO at CERN. Everything seems to be OK
    • CASTOR intervention took longer than expected (one more hours).
  • T1 sites issues:
    • IN2p3: perturbation with the new MySQL DB put behind BQS last Tuesday causing some jobs not being submitted through.
    • RAL: LFC SAM tests for streams replication are failing systematically since yesterday.Most likely due to the 3D intervention over there.
  • T2 sites issues:

1st February 2010 (Monday)

Experiment activities:

  • No activity. Software week

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1

Problems at the sites

  • T0 sites issues:
    • CASTOR intervention tomorrow: LHCb asked to suspend running jobs on its dedicated queue (grid and not-grid)
  • T1 sites issues:
    • IN2p3: 1h. unscheduled downtime on Saturday but received the notification from CIC portal 13 hours after the END (see pics below)
  • T2 sites issues:
    • ITEP: wrong mapping of FQAN /lhcb/Role=production. The issue there is more general and affects the backup mapping solution based on old static gridmap-files. LHCb clearly states in its VO Id Card that static mapping through gridmap file should only maps users to normal pool account (.lhcb) instead of to any other super-privileged account.

unscheduled_in2p3.bmp

-- RobertoSantinel - 29-Jan-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2010-03-01 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback