Difference: ProductionOperationsWLCGJul10Reports (1 vs. 2)

Revision 22010-08-04 - unknown

Line: 1 to 1
 

July 2010 Reports

To the main
Added:
>
>

30th July 2010 (Friday)

Experiment activities:

  • Received some data. Reconstruction, MonteCarlo and user analysis jobs.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1

Issues at the sites and services

  • T0 site issues:
    • nothing
  • T1 site issues:
    • SARA problems with NIKHEF_M-DST only
  • T2 site issues:
    • nothing

29th July 2010 (Thursday)

Experiment activities:

  • Received some data last night (3 out of 7 runs sent to OFFLINE). Expected much more tonight. False alarm when data was not yet migrated to tape on CASTOR. No disturbance following the intervention on SRM .

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • Intervention on SRM to 2.9.4.
  • T1 site issues:
    • GridKA: GGUS:60647 Pilot aborting at GridKA. Problems seems related with 2 (out of three) CREAMCEs there.
    • SARA: GGUS:60603 Reported timeouts retrieving tURL information of a set of files on M-DST space token. It seems to be a general problem on their SE from 2am till 7am UTC and they claim the timeout on DIRAC is a bit too aggressive. We never saw this problem before.
  • T2 site issues:
    • none

28th July 2010 (Wednesday)

Experiment activities:

  • Many MC production jobs running (15K in the last 24 h/s) and user analysis. Data reprocessing is running to completion mainly at NL-T1 site where the 90% of them is failing resolving data.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • Volhcb12:run the intervention w/o major problems.
    • Tomorrow LHCb can give the green light to run the intervention on SRM.
  • T1 site issues:
    • SARA: Problem with shared area (GGUS:60571)
    • SARA:GGUS:60603 Reported timeouts retrieving tURL information of a set of files on M-DST space token.
    • IN2p3 : still some evidence of shared area problem with initialization scripts timing out
  • T2 site issues:
    • none

27th July 2010 (Tuesday)

Experiment activities:

  • Activities dominated by several MC production and user analysis. Data reprocessing is running to completion.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • Received request to run an intervention on volhcb12 (development machine). No major problem but not today.
    • Received request to retire out-of-warranty machine volhcb10. Still awaiting for a reply from LHCb
    • No major complaints received to run the transparent intervention on SRM.
  • T1 site issues:
    • SARA: Problem with shared area (GGUS:60571)
    • SARA:Reported timeouts accessing tURL information on a pool of user data (GGUS:60603)
    • IN2p3 : still some evidence of shared area problem with initialization scripts timing out (GGUS:59880)
  • T2 site issues:
    • none

26th July 2010 (Monday)

Experiment activities:

  • MC Production running at full steam (27K jobs run in the last 24 hours); reprocessing of data ongoing reasonably smoothly almost everywhere. Also many user jobs running in the system that makes for this week LHCb the most active VO in terms of CPU time consumed. gridview.png

GGUS (or RT) tickets:

  • T0: 0
  • T1: 3
  • T2: 3

Issues at the sites and services

  • T0 site issues:
    • none
  • T1 site issues:
    • RAL: After having increased the memory to 3GB we did not run anylonger in problems there (apart few jobs running out of wall clock time)
    • SARA: After the SE has been reintegrated they started to run reconstruction smoothly during the week end
    • IN2p3 ~250 jobs stalled. Some killed by BQS (exceeding CPU or memory) another fraction is under investigation
    • CNAF: jobs failing while pilot alive: GGUS:60445
    • CNAF: replication problem of LFC. Excluded that this has to do with the recent migration of the 3D database.. GGUS:60458
    • pic : many (440 jobs) failed suddenly at 1 am on Sunday. GGUS:60451. This was due to a fraction of the farm that had to be halted due to a cooling problem in the center.
  • T2 site issues:
    • Jobs failing at UKI-LT2-UCL-HEP and DORTMUND. Shared area issue at Grif.

23rd July 2010 (Friday)

Experiment activities:

  • MC Production running with lowest priority to allow Reconstruction and analysis running firstly

GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • none
  • T1 site issues:
    • RAL: Suffering jobs failing because exceeding memory (which is a bit lower than what requested with the new data). Requested to increase to 3GB the memory (GGUS:60413)
    • SARA: downtime at risk, many users affected accessing data. The Storage has been banned yesterday. What is the status of this intervention (we need to run reconstruction at SARA)
    • IN2p3: Shared area issue: 25% reconstruction jobs failed today because of that (GGUS:59880)
    • IN2p3: CREAMCE cccreamceli03.in2p3.fr. Corrected the information, tbc (GGUS:60366)
  • T2 site issues:
    • Shared area issue at some sites.

22nd July 2010 (Thursday)

Experiment activities:

  • A lot of MC production run last 24 hours and a lot of new data to be reconstructed and to be made available to users for ICHEP. Playing with internal priorities to boost the reconstruction activity.
  • All T1 site admins supporting LHCb are invited to update their WMSes with the patch 3621. This is now running happily in production at CERN since weeks and would cure CREAMCE problems.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • Keeping for traceability the still ongoing issue VOMS-VOMRS syncronization (GGUS:60017) awaiting for expert to come back.
  • T1 site issues:
    • CNAF: GGUS:60314 (HOME not set " issue as seen at NIKHEF). Solved: problem with ldap.
    • RAL: reported some SAM jobs for FTS are failing.
    • PIC: noticed some SAM tests failing because of the shared area.
    • SARA: the queue there, after the intervention, are not matched by the LHCb production JDL. (GGUS:60343)
    • IN2p3: Shared area issue: going to deploy a solution adopted at CERN across all WN (GGUS:59880)
    • IN2p3: CREAMCE cccreamceli03.in2p3.fr seems to be not publishing properly and then is not matched GGUS:60366
    • IN2P3: All transfers were failing to IN2p3. SRM bug, needed to be restarted at Lyon. GGUS:60329
  • T2 site issues:
    • none

21st July 2010 (Wednesday)

  • Experiment activities:
  • ~10K MC plus user analysis. Reconstruction almost done. The backlog of jobs formed in DIRAC has been drained restarting one agent.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 3
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • none
    • Keeping for traceability the still ongoing issue VOMS-VOMRS syncronization (GGUS:60017) awaiting for expert to come back.
  • T1 site issues:
    • CNAF: GGUS:60314 (HOME not set " issue as seen at NIKHEF)
    • RAL Following the intervention on Oracle yesterday the LFC there was unreachable. (GGUS:60274)
    • IN2p3: Shared area still problematic (SAM jobs failing there, old ticket open GGUS:59880)
    • IN2p3: Inconsistency between BDII and SAMDB/GOCDB (GGUS:60321).
  • T2 site issues:
    • none

20th July 2010 (Tuesday)

  • Experiment activities:
  • There are some MC production ongoing and few remaining runs to be reco-stripped. Activities dominated by user activity and a new group has been added in DIRAC with the highest priority (lhcb_conf) to get resources for running jobs with highest precedence due to the forthcoming ICHEP conference.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • The cooling problem in the vault reported yesterday affected also some lhcb voboxes and some of them hosting critical services for LHCb (Sandboxing service). Through CDB templates the priority of these nodes has been risen while other vobox priorities have been relaxed.
    • Keeping for traceability the still ongoing issue VOMS-VOMRS syncronization (GGUS:60017) awaiting for expert to come back.
  • T1 site issues:
    • SARA: due to an AT RISK intervention at the SARA storage, some user jobs were failing being declared stalled and then killed.
    • RAL at risk intervention on various services today, the 3D Database for LHCb went down for a while. No problem to reboot to roll back.
  • T2 site issues:
    • It seems that we are hitting again the problem of uploading output form UK (and not only, Barcelona in primis).

19th July 2010 (Monday)

  • Experiment activities:
  • Reconstruction productions progressing despite the internal issue in DIRAC with many jobs (70K) waiting in the central Task Queue. Some MC productions running concurrently (~10K jobs running in the system).

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • The VOMS-VOMRS syncronization problem still there (GGUS:60017)
    • problem with the motherboard of one of the CERN VOBOXES (volhcb12)
  • T1 site issues:
    • GRIDKA: now running at full capacity the reconstruction of the runs associated to the site accordingly the CPU share. No problem
    • NIKHEF: no HOME set issue: GGUS:60211. Restarted the ncsd servers stuck on some WN.
    • IN2p3: shared area instabilities. GGUS:59880. Problem seems related to site internal script that makes temporarily unavailable WNs. Requested to increase the timeouts to 20-30 minutes.
    • SARA: received a list of files lost following the recent data corruption incident.
    • pic: in SD tomorrow.
    • RAL: received the early announcement of a major upgrade of their castor instance to 2.1.9. LHCb will give any help to validate it and by the time of this intervention most likely an instance of the LHCb HC will be in place to contribute to the validation.
  • T2 site issues:
    • none

15th July 2010 (Thursday)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 1

Issues at the sites and services

  • T0 site issues:
    • NTR
  • T1 site issues:
    • Pilots aborting at INFN-T1. Very urgent. GGUS:60104
    • The ticket against FZK-LCG2 is closed and verified. GGUS:60087. Site running at 20% capacity, CPU share momentarily reduced.
  • T2 site issues:
    • Site SNS-PISA removed from production after they closed the LHCb queue. Ticket unsolved. GGUS:55017

14th July 2010 (Wednesday)

  • Experiment activities:
  • Many reconstruction productions running, as well as MC. 35k jobs submitted in the last 24 hours

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • NTR
  • T1 site issues:
    • FZK-LCG2: SAM tests failing, problems staging and accessing files. SEs put out of the mask. GGUS:60087
    • Shared area degradation at IN2P3: problem understood and reproduced. GGUS:59880
  • T2 site issues:
    • Transfers from UK (and not only) to CERN: seems OK regarding UK site, not yet for spanish ones. GGUS:59422

13th July 2010 (Tuesday)

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 1

Issues at the sites and services

  • T0 site issues:
  • T1 site issues:
    • NTR
  • T2 site issues:
    • Jobs failing at IN2P3-LAPP. Possible shared area problems, might be temporary. GGUS:60040

12th July 2010 (Monday)

  • Experiment activities:
  • Reconstruction production, on a selected number of runs (only T1 sites involved), is running. Only problems are GridKA (powercut) and IN2P3 (shared area)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • DownTime notifications: no notifications arrived in the weekend, (e.g.: GridKA notification not received)
  • T1 site issues:
    • Degradations in IN2P3 shared area. Still no solution provided. GGUS:59880

9th July 2010 (Friday)

  • Experiment activities:
  • Test reconstruction production stopped, new test production launched on smaller files. No MC ongoing.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0/T1 site issues: * NTR
  • T2 sites issues: * Shared area quota not sufficient at INFN-TORINO, site now out of the mask: GGUS:59912 * Shared area problem at INFN-LNS, site out of the mask: GGUS:59917

8th July 2010 (Thursday)

  • Experiment activities:
  • Reconstruction production, on a selected number of runs (only T1 sites involved), is running

GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CREAM CE still aborting many pilots, more investigations requested. GGUS:59559 reopened
  • T1 site issues:
  • T2 sites issues: * NTR

7th July 2010 (Wednesday)

  • Experiment activities:
  • Just Launched new reconstruction production, on a selected number of runs (only T1 sites involved)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0/T1 site issues:
    • CERN and CNAF CREAM CE patched, GGUS:59559 closed
  • T2 sites issues:
    • Closed tickets regarding certificate not updated on a couple of sites. GGUS:59734 GGUS:59735,
    • sites WEIZMANN-LCG2 and RO-07-NIPNE put back in production mask.

6th July 2010 (Tuesday)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 2

Issues at the sites and services

  • T1 site issues:
    • Problems with SARA storage: since there's the possibility for data corruption, some space token have been excluded. LHCb can still write on tape, and stage from it. We can write on T1D1 (100TB will not be available). User disk is not available.
  • T2 sites issues:

5th July 2010 (Monday)

  • Experiment activities:
  • Decided to run new (test) reconstruction and stripping productions to process the new events (about 130M).
  • MC productions stopped due to an application problem.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 4

Issues at the sites and services

  • T0 site issues:
    • Problems accessing files with RFIO at CERN. That was due to a peak of queued transfers on the default service class on CASTOR. GGUS:59632
  • T1 site issues:
  • T2 sites issues:
    • UK sites issue uploading to CERN and T1: we don't see uploading problems from Glasgow at the moment. Still few issues from the other sites.
    • Could not determine shared area at site RO-15-NIPNE. GGUS:59688
    • SharedArea problem at dangus.itpa.lt ITPA-LCG2. GGUS:59691
    • Jobs aborted at ce.reef.man.poznan.pl PSNC. GGUS:59693
    • Jobs aborted at cert-15.pd.infn.it INFN-PADOVA. GGUS:59706

CREAM CE

  • Issue with CREAMCE originally discovered at CERN with GGUS:59559 now requires intervention of developers: reassigned to CREAM-BLAH.

1st July 2010 (Thursday)

  • Experiment activities:
  • Reconstruction and stripping of recent (and forthcoming) data have been halted as the LHCb applications are taking anomalous amount of time to process each event.
  • Essentially MC production running (20k jobs finished in the last 24 hs with a failure rate of ~7%)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1

Issues at the sites and services

  • T0 site issues:
    • Issue with CREAMCE requires intervention of developers (all proxies are really expired in the CREAM belly) GGUS:59559
  • T1 site issues:
    • (cfr CNAF-T2) also many production jobs are aborting at the T1
  • T2 sites issues:
    • UK sites issue uploading to CERN and T1: people from Glasgow seem to have found a solution (see UK_sites_issue.docx) that boils down to a network tuning.
    • CNAF-T2: misconfiguration of the ce02 many jobs failing because pilots aborting. GGUS:59621
    • Sheffield: many jobs failing there not clear the reason yet (it seems a shared area issue), site banned.
 

-- RobertoSantinel - 29-Jan-2010

Revision 12010-01-29 - unknown

Line: 1 to 1
Added:
>
>

July 2010 Reports

To the main

-- RobertoSantinel - 29-Jan-2010

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback