Difference: ProductionOperationsWLCGApr11Reports (1 vs. 2)

Revision 22011-05-10 - JoelClosier

Line: 1 to 1
Deleted:
<
<
 

April 2011 Reports

To the main
Added:
>
>

29th April 2011 (Friday)

Experiment activities:

  • RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
  • A lot of MC continues to run.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • CERN: (GGUS:70135) files can't be staged when trying a replication.
  • T1
  • T2

28th April 2011 (Thursday)

Experiment activities:

  • RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
  • A lot of MC continues to run.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
  • T1
    • IN2P3: solved problems with jobs hitting the memory limits
    • RAL: backlog of activities. Staging problems. Disk pools were full. Starting to going through, submission was throttled 2 days, now slowly recovering
  • T2

27th April 2011 (Wednesday)

Experiment activities:

  • RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
  • A lot of MC continues to run.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 2
Issues at the sites and services

  • T0
  • T1
    • IN2P3: our jobs hitting the memory limits, being killed. Not clear why only at Asked to increase limits to 5Gb.
    • RAL: backlog of activities. Staging problems. Disk pools were full.
  • T2

26th April 2011 (Tuesday)

Experiment activities:

  • RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
  • A lot of MC continues to run.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 2
Issues at the sites and services

  • T0
    • Problem SOLVED: Problems staging files out of Tape (72 files). Files requested to be staged yesterday we would have expected them online.
    • Yesterday, there were files missing to the OFF-LINE due to a hardware failure. Now (earlier) files start to move to OFF-LINE.
  • T1
    • IN2P3: Problems with software installation, "share" set to zero. Solution is in progress.
    • RAL: Storage Elements full (RAW and RDST, which use the same Space Token). It was reported that "some tape drives becoming stuck and not working", which seems to be fixed. However, there still a big backlog.
  • T2

21st April 2011 (Thursday)

Experiment activities:

  • RAW data distribution and their FULL reconstruction (Magnet Polarity UP) is going on w/o major problems on all T1s.
  • Decided to stop the stripping of reconstructed data with Magnet Polarity DOWN having found a problem in the options used. Most likely will restart today with just one input file per job (instead of 3 due to a known memory leak issue at application level).

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • Setup two new space tokens for LHCb: LHCb_Disk and LHCb_Tape. Migration of disk servers after Easter.
    • Problems staging files out of Tape (72 files). Files requested to be staged yesterday we would have expected them online. GGUS ticket most likely to be filed, Philippe looking on that.
  • T1
    • PIC: back form the downtime, no major problems to report with.
    • RAL: reported 72 files corrupted, trying to recover them from CERN
  • T2

20th April 2011 (Wednesday)

Experiment activities:

  • Low rate of reconstructions and stripping activities going now.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

19th April 2011 (Tuesday)

Experiment activities:

  • Data taking in the weekend, no data now. Reconstructions and stripping activities going now. Hitting memory due to application problems: temporary solution found, special thanks to IN2P3 and CNAF for letting us completing.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 3
  • T2: 0
Issues at the sites and services

  • T0
    • New space token have been setup
  • T1
    • Two GGUS tickets open (GGUS:69812 GGUS:69813) and closed quickly by RAL and IN2p3 to increase memory there.. Thanks for increasing the memory to run our stripping for the next days.
    • GGUS:69827 Slowness with FTS xfers at Lyon . Channel IN2p3-In2p3 was closed
  • T2

18th April 2011 (Monday)

Experiment activities:

  • Data taking in the weekend, no data now. Reconstructions and stripping activities going now. Hitting memory due to application problems: temporary solution found, special thanks to IN2P3 and CNAF for letting us completing.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 1
Issues at the sites and services

15th April 2011 (Friday)

Experiment activities:

  • Data taking restarted.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • Tape drive allocation has been done. modification of space token at CERN (GGUS:69709)
  • T1

14th April 2011 (Thursday)

Experiment activities:

  • Validation of the stripping. some code have been removed and the stripping is running.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • modification of space token at CERN (GGUS:69709)
  • T1
    • PIC : transfer problem identified and solved. Only one stream was defined between PIC and CNAF. it has been increased to 10 and transfer are happily running.

13th April 2011 (Wednesday)

Experiment activities:

  • Validation of the stripping. Stripping problem with memory usage.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • Any news about the increase of number tape drive for LHCb
  • T1
    • SARA : the big amount of 'dark data' have been eliminated by cleaning the dCache instance at SARA by Ron Trompert. (the discrepancy in SARA-DST was 60 TB, and in SARA-USER was about 35 TB.)

12th April 2011 (Thursday)

Experiment activities:

  • Validation of the stripping. Full reprocessing will be launched today

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0
Issues at the sites and services

  • T0
  • T1
    • PIC: Implementation of new space token today and tomorrow.
    • RAL: SRM castor intervention. CVMFS is in production on all WN

11th April 2011 (Monday)

Experiment activities:

  • Reconstruction and MC simulation
  • Validation of the stripping.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 3
  • T2: 0
Issues at the sites and services

  • T0
    • Request to have more tape drive for LHCb
  • T1
    • PIC: Problem of FTS transfer to PIC-ARCHIVE
    • SARA : Outage
    • RAL: small network problem during the week-end which impacted SRM for LHCb

8th April 2011 (Friday)

Experiment activities:

  • Reconstruction and MC simulation
  • Validation of the stripping.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0
Issues at the sites and services

  • T0
    • NTR
  • T1
    • PIC: reported a problem transferring data to PIC using the FTS server there. The problem is the ambiguity of the membership of the Data Manager credentials and FTS does not resolve properly using VOMS FQAN. They will open a FTS Support case (GGUS:69520)
    • NIKHEF: problem with CVMFS yesterday due to a reconfiguration of the WNs. Seems OK now. (GGUS:69501)
    • RAL: no being setup yet a callout for CVMFS service, the will revert for the week end to NFS. It could well be that since CVMFS has been put in production some application are not longer deployed in the old NFS. LHCb would appreciate not to step back from CVMFS, even if with a reduced support in the w/e

7th April 2011 (Thursday)

Experiment activities:

  • Reconstruction and MC simulation

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • NTR
  • T1
    • PIC: one file transfer problem in the channel RAL-PIC. Investigation going on, dashboard does not report any obvious problem with FTS configuration at PIC.

6th April 2011 (Wednesday)

Experiment activities:

  • Reconstruction and MC simulation

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • ~20% of reconstruction jobs showed a problem at CERN related with the CVMFS (either application not found or problem in setting up the environment). The problem as pointed out by Steve is due to cold caches on some "monster 48-cores WNs"; the first job served the rest of the jobs in these machines run fine.
  • T1
    • PIC: many transfers failing in the channel RAL-PIC. PIC people are looking at the FTS configuration to check whether the channel is properly configured.
    • GridKA: we confirmed that the issue with the d-cache reported yesterday was affecting the xfers to GridKA for a few hours in the morning. Now it is OK.

5th April 2011 (Tuesday)

Experiment activities:

  • Launched this morning the FULL Reconstruction production over 2011 data.
  • MC productions running smoothly.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • First archival of 2.8TB of at CERN and CNAF. Some slowness in migration to tape at CERN due to the usage of one single tape drive.
  • T1
    • RAL discussion on how to deploy in production CVMFS there. LHCb proposed to setup a small bunch of batch nodes as pre-prod before moving all of them to CVMFS.

4th April 2011 (Monday)

Experiment activities:

  • Waiting for a new CondDB TAG to run the Reconstruction for Collision11 data.
  • MC productions running smoothly.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • CASTOR: sent a request to evaluate the possibility to change the LHCb space token definition by merging all space tokens witht he same service class.
    • CVMFS: Steve will change on all batch nodes the environment to point to CVMFS as mount point for the shared area.
  • T1
    • RAL kept accidentally banned during the w/e and then not running any job but SAM

1st April 2011 (Friday)

Experiment activities:

  • EXPRESS and FULL validation of work flows for Collision11

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • SRM castor problem : list of protocol pass by the client was take into the rigth order. Problem fix by changing the default list of protocol on the SRN side before the fix is deploy in production..
    • all WN at CERN using CERNVMFS (problem of value of the variable VO_LHCB_SW_DIR on some WN during 2 hours)
  • T1 *

  • T2 site issues:
    • NTR
  -- RobertoSantinel - 02-Dec-2010

Revision 12010-12-02 - unknown

Line: 1 to 1
Added:
>
>

April 2011 Reports

To the main

-- RobertoSantinel - 02-Dec-2010

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback