Difference: ProductionOperationsWLCGJan11Reports (1 vs. 3)

Revision 32012-02-06 - JoelClosier

Line: 1 to 1
 

January 2011 Reports

To the main

Revision 22011-02-01 - unknown

Line: 1 to 1
Deleted:
<
<
 

January 2011 Reports

To the main
Added:
>
>

31st January 2011 (Monday)

Experiment activities:

  • MC productions. Re-stripping. General overload on almost all SRM services caused by demanding stripping jobs.Reduced the concurrent number of jobs running in parallel on each site.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 9
Issues at the sites and services
  • T0
    • CASTOR Vs xroot on Friday (GGUS:66731). The xrootd server crashed. Now it is OK ticket left open
    • VOBOX: following the scheduled intervention this morning some update screwed up the ssh port and not login was possible. Fixed.
  • T1
    • IN2p3 Many job failures on Saturday opening files via dcap or resolving turl (GGUS:66745). The ticket hasn't been updated since Friday.
    • RAL: Scheduled downtime for Oracle upgrade.Site drained since Sunday.
    • PIC: Pilot jobs aborting on the CREAMCE (GGUS:66793)
    • SARA: issue on Saturday with SRM that was down and restarted few hours later. Discussion in a thread no GGUS opened. Backlog of re-stripping jobs formed over there.
  • T2 site issues: :
    • A lot of problems at non-T1 sites with ~10 tickets opened since Friday

28th January 2011 (Friday)

Experiment activities: * MC productions. Re-stripping.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • xroot server down.
  • T1 site issues:
    • IN2P3: Most Re-stripping jobs failed. Under investigation.
    • GRIDKA: FTS problem
    • It seems due to re-stripping jobs we overloaded SRM,

  • T2 site issues: :

27th January 2011 (Thursday)

Experiment activities: * MC productions. Re-stripping launched yesterday

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0
Issues at the sites and services
  • T0
    • Experienced issues accessing data files on lhcbrdst SvcClass and also resolving tURLs. One part of the problem is a wrongly formed tURL that triggers a d2d copy (bug in Gaudi) but there is something not yet fully understood in the communication btw CASTOR and xroot once this copy is performed. Ponce investigating on that.
  • T1 site issues:
    • GridKA: shared area issue (not mounted after the intervention). (GGUS:66700). All jobs failing, also SAM reports that.
    • NIKHEF: Problem with CERNVMFS with some s/w not properly propagated down to their cache. They need to kill the CVMFS process for the lhcb mount point on each WN and have to remount it. Banned NIKHEF to drain jobs in there.
  • T2 site issues: :

26th January 2011 (Wednesday)

Experiment activities: * MC productions. No issues. We have recived a mail telling us without any consultation that ALL our voboxes and some disk servers will be unavailable the 3rd of February. It is absolutely impossible for us. We would like a consultation before any decisiosn is taken.

New GGUS (or RT) tickets:

  • T0: 1 (ALARM) STILL OPENED .. WHy?
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • Intervention on Oracle DB (scheduled) affecting many important critical services (VOMS/LFC/FTS). After the intervention can not access VOMS. (GGUS:66462) still opened ....
    • intervention on castorlhcb
    • GRIDKA annouced as "AT RISK" but in the message it was written "site full down" so it should be OUTAGE no!!!!!
    • RAL: problem with some Information provider in one of the CEs causing huge amount of pilots going there even if the LRMS was full already
  • T1 site issues:
    • NTR
  • T2 site issues: :

25th January 2011 (Tuesday)

Experiment activities: * MC productions. No issues. We have recived a mail telling us without any consultation that ALL our voboxes and some disk servers will be unavailable the 3rd of February. It is absolutely impossible for us. We would like a consultation before any decisiosn is taken.

New GGUS (or RT) tickets:

  • T0: 1 (ALARM)
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • Intervention on Oracle DB (scheduled) affecting many important critical services (VOMS/LFC/FTS). After the intervention can not access VOMS. (GGUS:66462)
  • T1 site issues:
    • NTR
  • T2 site issues: :

24th January 2011 (Monday)

Experiment activities: * MC only running so far. Submitted the request for stripping. MC09 cleaning up campaign.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 4
Issues at the sites and services
  • T0
    • NTR
  • T1 site issues:
    • GridKA: Unscheduled Downtime after the scheduled one.
  • T2 site issues:
    • IC: jobs killed running out of wall clock time set equal to the Max CPU time on their queue.

21st January 2011 (Friday)

Experiment activities:

  • MC jobs running at full steam (30-40K jobs per day).

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0
Issues at the sites and services
  • T0
    • SAM spotting SRM problems at CERN. GGUS:66351
    • Observed an anomalous failure rate of MC jobs due to timeout in setting up the environment (shared area issue). Increased the timeout to 2400 seconds.
  • T1 site issues:
    • pic: FTS issues submitting trasfer jobs (seems a clock not in sync): GGUS:66355
  • T2 site issues:

20th January 2011 (Thursday)

Experiment activities:

  • MC jobs running at full steam (30-40K jobs per day).
  • Stripping restarted

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1 site issues:
    • NIKHEF: Issue with CERNVMFS investigation on going. Found a potential problem with LAN.
  • T2 site issues:

19th January 2011 (Wednesday)

Experiment activities:

  • MC jobs running at full steam (30-40K jobs per day).
  • Stripping restarted

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0
Issues at the sites and services
  • T0
    • SLS issue: Shibboleth authentication always requested form outside CERN accessing rrdgraph.php pages (used by internal components in DIRAC portal at PIC). CT738660. Fixed promptly yesterday.
  • T1 site issues:
    • NIKHEF: SAM jobs for shared area failing occasionally. This is consistent with the 30% failure rate observed for MC jobs at NIKHEF setting up environment using CERNVMFS (GGUS:66287).
  • T2 site issues:

18th January 2011 (Tuesday)

Experiment activities:

  • MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference.
  • Today we will restart the stripping in a manual mode to confirm that the results are ok before moving to an automatic way..

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0
Issues at the sites and services
  • T0
    • to fix the problem of MyProxyServer with CREAM CE. Can we change the configuration on AFS for LHCb only or do we need to release for THAT of the UI ?
  • T1 site issues:
    • PIC : downtime (LHCb web poratl redirect to the CERN one.
  • T2 site issues:

17th January 2011 (Monday)

Experiment activities:

  • MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference.
  • Today we will restart the stripping (we thought about restarting it Friday, still waiting for a last test result).

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 2
Issues at the sites and services
  • T0
  • T1 site issues:
    • IN2P3 : (GGUS:59880). "Historical" problem with shared area. Ticket finally closed.
  • T2 site issues:
    • CBPF : GGUS:66204 : all pilots aborted, host cert expired
    • Glasgow: GGUS:66203 : data uploading problems, investigating

14th January 2011 (Friday)

Experiment activities:

  • MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference. The decrease of running jobs which we see yesterday was due to a problem in the LHCbDIRAC code. (we tried to fix a problem of delegation for the proxy for CREAM CE but it was not working. So we went back to the previous version and we are discussing with the CREAM CE developer to properly fix it)..
  • Today we will restart the stripping.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1
Issues at the sites and services
  • T0
    • CERN : GGUS:66067 can not import lfc with sl5 grid-env (closed but with problem not solved)
  • T1 site issues:
    • CNAF : problem of voview fixed
  • T2 site issues:
    • Torino : GGUS:66093 : 4 core with 4Gb of memory is a little bit too low for LHCb usage.

13th January 2011 (Thursday)

Experiment activities:

  • MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference.

New GGUS (or RT) tickets:

  • T0: 2
  • T1: 1
  • T2: 0
Issues at the sites and services
  • T0
    • CERN : GGUS:66067 can not import lfc with sl5 grid-env
    • CERN: Problem with the information published by the CEs. The problem boiled down to the overload generated by concurrent ILC activity. Problem also spotted by Alice (GGUS:65947)
  • T1 site issues:
    • CNAF : ce07-lcg lot of pilot send even if the system is full (investigation in progress). It seems related to a VOVIEW problem.

12th January 2011 (Wednesday)

Experiment activities:

  • MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1 site issues:
    • IN2P3 : this afternoon the intervention is an outage not at risk

11th January 2011 (Tuesday)

Experiment activities:

  • MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1 site issues:
    • IN2p3: After the SW has been installed yesterday MC jobs ramped up at IN2p3-CC and IN2P3-T2 centers.

10th January 2011 (Monday)

Experiment activities:

  • MC productions on going. Need to rerun the stripping for two streams (CHARM FULL and CHARM CONTROL). This is a very huge activity over all 2010 data.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1 site issues:
    • IN2p3: still problem installing software in their AFS area (vos release problem). A meeting LHCb-IN2p3 is currently being held to discuss about the status of the shared area and plans for the future.

7th January 2011 (Friday)

Experiment activities:

  • MC productions only.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1 site issues:
    • NTR

6th January 2011 (Thursday)

Experiment activities:

  • Ongoing MC productions

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • Intervention on Oracle DB today affecting CASTOR and other Oracle based services (BKK, LFC)
  • T1 site issues:
    • NTR

5th January 2011 (Wednesday)

Experiment activities:

  • Huge MC production during Xmas period (pace of 40K jobs per day). Suffering problem with the logSE and in general with MC space tokens at CERN and different T1s. To be redone the stripping of a couple of streams and still to be completed the reprocessing of 2010 data (30 problematic files still to be processed).

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • Intervention on the downstream capture database (impacting the replication of LFC to T1). Done this morning. Tomorrow a patch of the whole RAC serving CASTOR,LFC ConditionDB
    • During the closure period a problem with all our VOBOXes due to the netlog process filling up the /var partition with zilions of messages. Joel will follow this up with IT people.
  • T1 site issues:
    • IN2p3: people there close to the solution of poor performances to setup the environment of LHCb jobs. Still to be validated the solution to the problem of installing software.
  -- RobertoSantinel - 02-Dec-2010

Revision 12010-12-02 - unknown

Line: 1 to 1
Added:
>
>

January 2011 Reports

To the main

-- RobertoSantinel - 02-Dec-2010

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback