Difference: ProductionOperationsWLCGAugust12Reports (1 vs. 3)

Revision 32012-10-10 - JoelClosier

Line: 1 to 1
 
META TOPICPARENT name="ProductionOperationsWLCG2012Reports"

August 2012 Reports

To the main

Revision 22012-09-12 - JoelClosier

Line: 1 to 1
 
META TOPICPARENT name="ProductionOperationsWLCG2012Reports"

August 2012 Reports

To the main
Line: 15 to 15
  * T1 : * IN2P3: Jobs failures, cannot load shared library, site was banned for production (GGUS:85644)
Changed:
<
<

30 Aug 2012 (Thursday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* Simulation at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of aborted pilot observed during last week GGUS:85385

* T1 :
*


---++ 29 Aug 2012 (Wednesday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* Simulation at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of failed pilot observed during last week GGUS:85385

* T1 :
*


---++ 28 Aug 2012 (Tuesday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* Simulation at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of failed pilot observed during last week GGUS:85385

* T1 :
* GRIDKA: LHCb VOBOX not accessible (GGUS:85544), swiftly fixed this morning by rebooting the node.


---++ 27 Aug 2012 (Monday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* Simulation at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of failed pilot observed during last week GGUS:85385

* T1 :
*

---++ 24 Aug 2012 (Friday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* user analysis at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of failed pilot observed during last week GGUS:85385

* T1 :
* In2p3: problem relative to failed FTS transfers in the channel IN2P3-PIC and IN2P3-CNAF solved, GGUS:85305 can be closed.


---++ 22 Aug 2012 (Wednesday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* MC simulation and user analysis at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of failed pilot observed during last week, slightly improved this morning (though no update in the ticket) GGUS:85385

* T1 :
* In2p3: all transfers fail in the channel IN2P3-PIC and some transfers fail in channel IN2P3-CNAF, GGUS:85305

---++ 21 Aug 2012 (Tuesday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* MC simulation and user analysis at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of failed pilot observed during last week, GGUS:85385

* T1 :
* GridKa : bunch of failed FTS transfers to Gridka around 1AM UTC, GGUS:85270

---++ 20 Aug 2012 (Monday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
*

* T1 :
* GridKa : GGUS:85270, problem with disk servers, all failed servers have been recovered by this morning.


---++ 17 Aug 2012 (Friday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* GGUS:85260 Missing files - actually caused by unmounted file system. Fixed.

* T1 :
* GridKa : GGUS:85270 SE problems since early this morning due to "Fault in storage subsystem, vendor support is involved to solve it."

---++ 16 Aug 2012 (Thursday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* GGUS:85213 Thanks for the extra space.

* T1 :
* GridKa : GGUS:85208 srm problems resolved. But sudden downtime (?) for ~15 minutes at 2PM as a consequence to restore redundancy.
* CNAF : Diskserver rebalancing - possibly (?) completed - FTS backlog cleared overnight.

---++ 15 Aug 2012 (Wednesday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* GGUS:85134 Lost files marked as bad within LHCb. Much lower failure rate at CERN now. However as a result of pulling the faulty diskservers, we have are down to a low level of storage : only 22TB free now.

* T1 :
* GridKa : GGUS:85208 Possible srm problems. Turls not being returned for some files.
* CNAF : Diskserver rebalancing - slow access to storage causing jobs and transfers to fail / go very very slow since ~Monday. SE banned until the situation improves, backlog of files to be transferred to CNAF building up.


---++ 14 Aug 2012 (Tuesday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* GGUS:85069 Ongoing problems resolving turls by jobs - files lost in bad diskserver
* GGUS:85134 Opened second ticket as requested in above ticket. More files found lost.

---++ 13 Aug July 2012 (Monday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* GGUS:85062 Significant problems with jobs over the weekend again. Solved now.
* GGUS:85069 Ongoing problems resolving turls by jobs. Problematic diskserver(s)? Interestingly these files are (wrongly?) reported as nearline (they should not be having a tape copy).
* Lost files due to diskserver failure - dealt with within LHCb (https://lblogbook.cern.ch/Operations/11304).

* T1:
* IN2P3 : New file with corrupted checksum seen (re old ticket - GGUS:82247). It was created on 10 August. Filename : srm://ccsrm.in2p3.fr/pnfs/in2p3.fr/data/lhcb/MC/MC11a/ALLSTREAMS.DST/00019658/0000/00019658_00000818_5.AllStreams.dst

* Others
* FTS : Switched off checksum checks to allow very old files to be transferred to CNAF (GGUS:85039). For now do not see significant errors transferring to GridKa - will keep an eye on it.

---++ 10 Aug July 2012 (Friday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* A 'background' of aborted pilots. Jobs are still running though. CREAM losing contact with LSF?

* T1:
* GridKa:
* still stable - will close the tickets this afternoon (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670).
* Slow computers (GGUS:84988). Seems like an odd clock speed problem.
* This morning at ~11am SRM authentication issues seen. Quickly fixed.

* CNAF:
* What look like SRM errors stopping both FTS and internal transfers (GGUS:85041)

---++ 9 Aug July 2012 (Thursday)

* <strong>New GGUS (or RT) tickets </strong>

* T1:
* GridKa still stable (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670).
* IN2P3: Question about short pilots - due to development testing

* Other :

---++ 8 Aug July 2012 (Wednesday)

* <strong>New GGUS (or RT) tickets </strong>

* T1:
* GridKa transfers stable overnight. Will keep the tickets open for another 24 hours just to be sure (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670).

* Other :

---++ 7 Aug July 2012 (Tuesday)

* <strong>New GGUS (or RT) tickets </strong>

* T1:
* Ongoing issues with RAW export to Gridka. Still many files in "Ready" status. Very low transfer rate. GridKa still investigating (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670).
* GridKa have added 40TB of front end disk to the Tape system to allow transfers to come through (GGUS:84838).
* IN2P3: Re: Pilots not using CVMFS - there was confusion as CVMFS is in /opt at IN2P3 rather than /cvmfs. The problem turned out to be an LHCb config issue.

* Other :

---++ 6 Aug July 2012 (Monday)

* <strong>New GGUS (or RT) tickets </strong>

* Ongoing issues with RAW export to Gridka. Still many files in "Ready" status. Very low transfer rate. Latest updates on tickets say that CERN see SRM timeouts? (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670)

* T1:
* GridKa: Very low space left on GridKa-Tape cache. Possibly related related to above. Ticket opened (GGUS:84838)
* IN2P3: Investigating why pilots aren't using CVMFS both here and at the Tier2. This caused some job failures last week.

* Other :


---++ 3 Aug July 2012 (Friday)

*

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550, GGUS:84778 (Alarm)), ticket priority raised, situation unchanged since 1 week. In addition timeouts for Gridka srm is observed in Sam/Nagios tests
* Castor files unavailable, have been recovered by castor team, maybe more files missing (GGUS:84763)

* T1:
* Gridka: Pilots aborted, fixed this morning (GGUS:84740)

* Other :

---++ 2 Aug July 2012 (Thursday)

*

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* Application crashes on certain CERN batch node types (GGUS:84672)
* RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550), ticket priority raised, situation unchanged since 1 week. In addition timeouts for Gridka srm is observed in Sam/Nagios tests

* T1:

* Other :

---++ 1 Aug 2012 (Wednesday)

*

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* Application crashes on certain CERN batch node types (GGUS:84672)
* Castor default pool under stress, user has been contacted.

* T1:
* RAL: many pilots stuck in state "REALLY-RUNNING", have been cleaned (GGUS:84671)
* CNAF: FTS transfers from RAL not working because of srm endpoint not detectable from CNAF, fixed, was misconfiguration (site contacts informed)
* GRIDKA: RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550)
* PIC : SE under load and very slow streaming data to user jobs. These (user) jobs end up with very low efficiency and getting killed by DIRAC watchdog. Situation has improved, probably due to limiting of user jobs.

* Other :

>
>

30 Aug 2012 (Thursday)

  • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    * Simulation at T2s

    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * constant rate of aborted pilot observed during last week GGUS:85385

    * T1 :
    *


29 Aug 2012 (Wednesday)

  • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    * Simulation at T2s

    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * constant rate of failed pilot observed during last week GGUS:85385

    * T1 :
    *


28 Aug 2012 (Tuesday)

  • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    * Simulation at T2s

    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * constant rate of failed pilot observed during last week GGUS:85385

    * T1 :
    * GRIDKA: LHCb VOBOX not accessible (GGUS:85544), swiftly fixed this morning by rebooting the node.


27 Aug 2012 (Monday)

  • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    * Simulation at T2s

    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * constant rate of failed pilot observed during last week GGUS:85385

    * T1 :
    *

24 Aug 2012 (Friday)

  • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    * user analysis at T2s

    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * constant rate of failed pilot observed during last week GGUS:85385

    * T1 :
    * In2p3: problem relative to failed FTS transfers in the channel IN2P3-PIC and IN2P3-CNAF solved, GGUS:85305 can be closed.


22 Aug 2012 (Wednesday)

  • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    * MC simulation and user analysis at T2s

    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * constant rate of failed pilot observed during last week, slightly improved this morning (though no update in the ticket) GGUS:85385

    * T1 :
    * In2p3: all transfers fail in the channel IN2P3-PIC and some transfers fail in channel IN2P3-CNAF, GGUS:85305

21 Aug 2012 (Tuesday)

  • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    * MC simulation and user analysis at T2s

    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * constant rate of failed pilot observed during last week, GGUS:85385

    * T1 :
    * GridKa : bunch of failed FTS transfers to Gridka around 1AM UTC, GGUS:85270

20 Aug 2012 (Monday)

17 Aug 2012 (Friday)

16 Aug 2012 (Thursday)

15 Aug 2012 (Wednesday)

  • <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * GGUS:85134 Lost files marked as bad within LHCb. Much lower failure rate at CERN now. However as a result of pulling the faulty diskservers, we have are down to a low level of storage : only 22TB free now.

    * T1 :
    * GridKa : GGUS:85208 Possible srm problems. Turls not being returned for some files.
    * CNAF : Diskserver rebalancing - slow access to storage causing jobs and transfers to fail / go very very slow since ~Monday. SE banned until the situation improves, backlog of files to be transferred to CNAF building up.


14 Aug 2012 (Tuesday)

13 Aug July 2012 (Monday)

  • <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * GGUS:85062 Significant problems with jobs over the weekend again. Solved now.
    * GGUS:85069 Ongoing problems resolving turls by jobs. Problematic diskserver(s)? Interestingly these files are (wrongly?) reported as nearline (they should not be having a tape copy).
    * Lost files due to diskserver failure - dealt with within LHCb (https://lblogbook.cern.ch/Operations/11304).

    * T1:
    * IN2P3 : New file with corrupted checksum seen (re old ticket - GGUS:82247). It was created on 10 August. Filename : srm://ccsrm.in2p3.fr/pnfs/in2p3.fr/data/lhcb/MC/MC11a/ALLSTREAMS.DST/00019658/0000/00019658_00000818_5.AllStreams.dst

    * Others
    * FTS : Switched off checksum checks to allow very old files to be transferred to CNAF (GGUS:85039). For now do not see significant errors transferring to GridKa - will keep an eye on it.

10 Aug July 2012 (Friday)

  • <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * A 'background' of aborted pilots. Jobs are still running though. CREAM losing contact with LSF?

    * T1:
    * GridKa:
    * still stable - will close the tickets this afternoon (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670).
    * Slow computers (GGUS:84988). Seems like an odd clock speed problem.
    * This morning at ~11am SRM authentication issues seen. Quickly fixed.

    * CNAF:
    * What look like SRM errors stopping both FTS and internal transfers (GGUS:85041)

9 Aug July 2012 (Thursday)

8 Aug July 2012 (Wednesday)

7 Aug July 2012 (Tuesday)

  • <strong>New GGUS (or RT) tickets </strong>

    * T1:
    * Ongoing issues with RAW export to Gridka. Still many files in "Ready" status. Very low transfer rate. GridKa still investigating (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670).
    * GridKa have added 40TB of front end disk to the Tape system to allow transfers to come through (GGUS:84838).
    * IN2P3: Re: Pilots not using CVMFS - there was confusion as CVMFS is in /opt at IN2P3 rather than /cvmfs. The problem turned out to be an LHCb config issue.

    * Other :

6 Aug July 2012 (Monday)

  • <strong>New GGUS (or RT) tickets </strong>

    * Ongoing issues with RAW export to Gridka. Still many files in "Ready" status. Very low transfer rate. Latest updates on tickets say that CERN see SRM timeouts? (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670)

    * T1:
    * GridKa: Very low space left on GridKa-Tape cache. Possibly related related to above. Ticket opened (GGUS:84838)
    * IN2P3: Investigating why pilots aren't using CVMFS both here and at the Tier2. This caused some job failures last week.

    * Other :


3 Aug July 2012 (Friday)



  • * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550, GGUS:84778 (Alarm)), ticket priority raised, situation unchanged since 1 week. In addition timeouts for Gridka srm is observed in Sam/Nagios tests
    * Castor files unavailable, have been recovered by castor team, maybe more files missing (GGUS:84763)

    * T1:
    * Gridka: Pilots aborted, fixed this morning (GGUS:84740)

    * Other :

2 Aug July 2012 (Thursday)



  • * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * Application crashes on certain CERN batch node types (GGUS:84672)
    * RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550), ticket priority raised, situation unchanged since 1 week. In addition timeouts for Gridka srm is observed in Sam/Nagios tests

    * T1:

    * Other :

1 Aug 2012 (Wednesday)



  • * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * Application crashes on certain CERN batch node types (GGUS:84672)
    * Castor default pool under stress, user has been contacted.

    * T1:
    * RAL: many pilots stuck in state "REALLY-RUNNING", have been cleaned (GGUS:84671)
    * CNAF: FTS transfers from RAL not working because of srm endpoint not detectable from CNAF, fixed, was misconfiguration (site contacts informed)
    * GRIDKA: RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550)
    * PIC : SE under load and very slow streaming data to user jobs. These (user) jobs end up with very low efficiency and getting killed by DIRAC watchdog. Situation has improved, probably due to limiting of user jobs.

    * Other :
  -- JoelClosier - 12-Sep-2012

Revision 12012-09-12 - JoelClosier

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="ProductionOperationsWLCG2012Reports"

August 2012 Reports

To the main

31 Aug 2012 (Friday)

* T0: * constant rate of aborted pilot observed during last week GGUS:85385

* T1 : * IN2P3: Jobs failures, cannot load shared library, site was banned for production (GGUS:85644)

30 Aug 2012 (Thursday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* Simulation at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of aborted pilot observed during last week GGUS:85385

* T1 :
*


---++ 29 Aug 2012 (Wednesday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* Simulation at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of failed pilot observed during last week GGUS:85385

* T1 :
*


---++ 28 Aug 2012 (Tuesday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* Simulation at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of failed pilot observed during last week GGUS:85385

* T1 :
* GRIDKA: LHCb VOBOX not accessible (GGUS:85544), swiftly fixed this morning by rebooting the node.


---++ 27 Aug 2012 (Monday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* Simulation at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of failed pilot observed during last week GGUS:85385

* T1 :
*

---++ 24 Aug 2012 (Friday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* user analysis at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of failed pilot observed during last week GGUS:85385

* T1 :
* In2p3: problem relative to failed FTS transfers in the channel IN2P3-PIC and IN2P3-CNAF solved, GGUS:85305 can be closed.


---++ 22 Aug 2012 (Wednesday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* MC simulation and user analysis at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of failed pilot observed during last week, slightly improved this morning (though no update in the ticket) GGUS:85385

* T1 :
* In2p3: all transfers fail in the channel IN2P3-PIC and some transfers fail in channel IN2P3-CNAF, GGUS:85305

---++ 21 Aug 2012 (Tuesday)

* Running user analysis, prompt reconstruction and stripping at T0 and T1s
* MC simulation and user analysis at T2s

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* constant rate of failed pilot observed during last week, GGUS:85385

* T1 :
* GridKa : bunch of failed FTS transfers to Gridka around 1AM UTC, GGUS:85270

---++ 20 Aug 2012 (Monday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
*

* T1 :
* GridKa : GGUS:85270, problem with disk servers, all failed servers have been recovered by this morning.


---++ 17 Aug 2012 (Friday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* GGUS:85260 Missing files - actually caused by unmounted file system. Fixed.

* T1 :
* GridKa : GGUS:85270 SE problems since early this morning due to "Fault in storage subsystem, vendor support is involved to solve it."

---++ 16 Aug 2012 (Thursday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* GGUS:85213 Thanks for the extra space.

* T1 :
* GridKa : GGUS:85208 srm problems resolved. But sudden downtime (?) for ~15 minutes at 2PM as a consequence to restore redundancy.
* CNAF : Diskserver rebalancing - possibly (?) completed - FTS backlog cleared overnight.

---++ 15 Aug 2012 (Wednesday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* GGUS:85134 Lost files marked as bad within LHCb. Much lower failure rate at CERN now. However as a result of pulling the faulty diskservers, we have are down to a low level of storage : only 22TB free now.

* T1 :
* GridKa : GGUS:85208 Possible srm problems. Turls not being returned for some files.
* CNAF : Diskserver rebalancing - slow access to storage causing jobs and transfers to fail / go very very slow since ~Monday. SE banned until the situation improves, backlog of files to be transferred to CNAF building up.


---++ 14 Aug 2012 (Tuesday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* GGUS:85069 Ongoing problems resolving turls by jobs - files lost in bad diskserver
* GGUS:85134 Opened second ticket as requested in above ticket. More files found lost.

---++ 13 Aug July 2012 (Monday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* GGUS:85062 Significant problems with jobs over the weekend again. Solved now.
* GGUS:85069 Ongoing problems resolving turls by jobs. Problematic diskserver(s)? Interestingly these files are (wrongly?) reported as nearline (they should not be having a tape copy).
* Lost files due to diskserver failure - dealt with within LHCb (https://lblogbook.cern.ch/Operations/11304).

* T1:
* IN2P3 : New file with corrupted checksum seen (re old ticket - GGUS:82247). It was created on 10 August. Filename : srm://ccsrm.in2p3.fr/pnfs/in2p3.fr/data/lhcb/MC/MC11a/ALLSTREAMS.DST/00019658/0000/00019658_00000818_5.AllStreams.dst

* Others
* FTS : Switched off checksum checks to allow very old files to be transferred to CNAF (GGUS:85039). For now do not see significant errors transferring to GridKa - will keep an eye on it.

---++ 10 Aug July 2012 (Friday)

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* A 'background' of aborted pilots. Jobs are still running though. CREAM losing contact with LSF?

* T1:
* GridKa:
* still stable - will close the tickets this afternoon (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670).
* Slow computers (GGUS:84988). Seems like an odd clock speed problem.
* This morning at ~11am SRM authentication issues seen. Quickly fixed.

* CNAF:
* What look like SRM errors stopping both FTS and internal transfers (GGUS:85041)

---++ 9 Aug July 2012 (Thursday)

* <strong>New GGUS (or RT) tickets </strong>

* T1:
* GridKa still stable (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670).
* IN2P3: Question about short pilots - due to development testing

* Other :

---++ 8 Aug July 2012 (Wednesday)

* <strong>New GGUS (or RT) tickets </strong>

* T1:
* GridKa transfers stable overnight. Will keep the tickets open for another 24 hours just to be sure (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670).

* Other :

---++ 7 Aug July 2012 (Tuesday)

* <strong>New GGUS (or RT) tickets </strong>

* T1:
* Ongoing issues with RAW export to Gridka. Still many files in "Ready" status. Very low transfer rate. GridKa still investigating (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670).
* GridKa have added 40TB of front end disk to the Tape system to allow transfers to come through (GGUS:84838).
* IN2P3: Re: Pilots not using CVMFS - there was confusion as CVMFS is in /opt at IN2P3 rather than /cvmfs. The problem turned out to be an LHCb config issue.

* Other :

---++ 6 Aug July 2012 (Monday)

* <strong>New GGUS (or RT) tickets </strong>

* Ongoing issues with RAW export to Gridka. Still many files in "Ready" status. Very low transfer rate. Latest updates on tickets say that CERN see SRM timeouts? (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670)

* T1:
* GridKa: Very low space left on GridKa-Tape cache. Possibly related related to above. Ticket opened (GGUS:84838)
* IN2P3: Investigating why pilots aren't using CVMFS both here and at the Tier2. This caused some job failures last week.

* Other :


---++ 3 Aug July 2012 (Friday)

*

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550, GGUS:84778 (Alarm)), ticket priority raised, situation unchanged since 1 week. In addition timeouts for Gridka srm is observed in Sam/Nagios tests
* Castor files unavailable, have been recovered by castor team, maybe more files missing (GGUS:84763)

* T1:
* Gridka: Pilots aborted, fixed this morning (GGUS:84740)

* Other :

---++ 2 Aug July 2012 (Thursday)

*

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* Application crashes on certain CERN batch node types (GGUS:84672)
* RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550), ticket priority raised, situation unchanged since 1 week. In addition timeouts for Gridka srm is observed in Sam/Nagios tests

* T1:

* Other :

---++ 1 Aug 2012 (Wednesday)

*

* <strong>New GGUS (or RT) tickets </strong>

* T0:
* Application crashes on certain CERN batch node types (GGUS:84672)
* Castor default pool under stress, user has been contacted.

* T1:
* RAL: many pilots stuck in state "REALLY-RUNNING", have been cleaned (GGUS:84671)
* CNAF: FTS transfers from RAL not working because of srm endpoint not detectable from CNAF, fixed, was misconfiguration (site contacts informed)
* GRIDKA: RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550)
* PIC : SE under load and very slow streaming data to user jobs. These (user) jobs end up with very low efficiency and getting killed by DIRAC watchdog. Situation has improved, probably due to limiting of user jobs.

* Other :

-- JoelClosier - 12-Sep-2012

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback