December 2012 Reports
To the main
20th December 2012 (Thursday)
- Reprocessing running smoothly all files submitted, shall be finished by the week-end
- Simulation workflows on the online farm have been successfully validated, starting ramp up of production activities now
- T0:
- VOMS : (GGUS:89497
) voms-admin command timing out, being investigated
- LHCb pilots failing on the grid during the network glitch, b/c they tried to access an afs hosted web service
- Redundant pilots (GGUS:87448
), fixed on 3 CEs, started submitting pilots to those and no issues so far
- T1:
- RAL: picking up jobs again after the network problem yesterday
- GRIDKA: timeouts of transfers to Gridka Disk storage under investigation (GGUS:88425
)
- GRIDKA: new CEs are publishing 999999999 in the BDII for max CPU time (GGUS:89857
)
19th December 2012 (Wednesday)
- Reprocessing running smoothly. All reprocessing for 2012 submitted. LHCb will start to test the ONLINE FARM to be used during Christmas..
- T0:
- VOMS : (GGUS:89497
) voms-admin command timing out, being investigated
- T1:
18th December 2012 (Tuesday)
- Reprocessing running smoothly. All reprocessing for 2012 submitted. LHCb will start to test the ONLINE FARM to be used during Christmas..
- T0:
- VOMS : (GGUS:89497
) GGUS ticket not answer until today
- T1:
- RAL : Outage for a short period.
17th December 2012 (Monday)
- Reprocessing running smoothly. All reprocessing for 2012 submitted. LHCb will start to test the ONLINE FARM to be used during Christmas..
- T0:
- T1:
- CERN : some pilot failing (still same issue which is treated in some GGUS ticket)
14th December 2012 (Friday)
- Reprocessing running smoothly. All reprocessing for 2012 submitted.
- T0:
- T1:
- PIC: Jobs failing to access data due to TURL resolving errors. (GGUS:89664
) Reason: SRM instabilities. Huge queue of Get Requests from different experiments. Max queue length increased and SRM restarted. Problem solved quickly.
13th December 2012 (Thursday)
- Reprocessing running smoothly.
- T0:
- T1:
- IN2P3-T2: Lots of stalled jobs. Pilot output "[Job has been terminated (got SIGXCPU); reason=152]" indicating CPU limits, although jobs have been running for merely 5 hours
- CERN: Pilots still failing at CERN. (GGUS:88796
) submitted in November, still not resolved. Affects around 10% of our jobs. Erorr: Invalid CRL: The available CRL has expired.
12th December 2012 (Wednesday)
- Reprocessing running smoothly.
- T0:
- T1:
- NL-T1: Jobs failing to access files at SARA due to incorrect TURL resolution. Solved quickly. (GGUS:89511
)
- CERN: Pilots still failing at CERN. (GGUS:88796
) submitted in November, still not resolved. Erorr: Invalid CRL: The available CRL has expired (affects only some WNs); Also VOMS not responding yesterday. (GGUS:89497
) Today seems to work fine, though there was no response from the ticket.
11th December 2012 (Tuesday)
- Reprocessing running smoothly.
- T0:
- T1:
- NL-T1: Dowtime SARA finished
- PIC files lost from archive in the migration with no other replicas. Not a big issue, files were old and supposed to be deleted anyway.
- RAL has another T2 attached now: LCG.Krakow.pl
10th December 2012 (Monday)
- Started reprocessing activities, which means there will should be significant staging at the T1s.
- T0:
- T1:
- NL-T1: We expect the downtime of SARA to be finished today, but we would like a notification in case of delay.
- PIC had an ARCHIVAL deletion of 930 files by error, due to a bug after enstore update.
07th December 2012 (Wednesday)
- Prompt reconstruction at CERN + attached T2s. Monte Carlo at T1s and T2s
- The problem of agents submitting pilots at the sites seems related with network issues between our voboxes.
- T0:
- T1:
- GRIDKA: OK for the SE downtime of 15-17 Jannuary. Please fill the GOCDB and remind us few days before.
- NL-T1: The SE down of 10th December is ok. Do you plan to stop just tape backend or also disk?
06th December 2012 (Wednesday)
- Normal operation activities. Waiting for the new databases to start the last reprocessing step.
- Still some problem with agents responsible of submitting pilot to the sites. Investigation is ongoing.
- T0:
- T1:
- NL-T1: Bunch of failed FTS transfers just before lunch.
05th December 2012 (Wednesday)
- Normal operation activities. Waiting for the new databases to start the last reprocessing step.
- T0:
- T1:
04th December 2012 (Tuesday)
- Prompt reconstruction: CERN + 5 Tier2 sites
- MC productions at T2s and T1s (until reprocessing will restart)
- Had some problem because a partition on a vobox got full. Hot-fixed. Plan to reshuffle a bit the distribution of databases among the voboxes.
- T0:
- T1:
- RAL: Some problem in the early morning in with FTS transfer from CERN. It seemed to be a corruption in FTS database. It has been fixed quickly.
- IN2P3: Lots of FTS transfer failure during the night (also between IN2P3/IN2P3 and IN2P3/IN2P3-T2). Problem disappeared in the morning.
03th December 2012 (Monday)
- Reprocessing until last stop finished. New DataBases for the last step (from 30th of November) will be ready around Thursday this week.
- Prompt reconstruction: CERN + 5 Tier2 sites
- MC productions at T2s and T1s (until reprocessing will restart)
- T0:
- Is planned the upgrade of the LFC tu EMI.
- T1:
- RAL: Some problem in accessing data. A disk server is down and need a fsck before to put it again in production (not before tomorrow). [Tiju announces that it was put back in production this morning]
- CNAF: Installed the new disk storage (many thanks to CNAF people)
- GRIDKA: transfer failure to/from several sites. No clue from the site. [Pavel adds that experts are working on the problem: they increased the debug level and do see some transfers failing (but not all of them)]
This topic: LHCb
> ProductionOperations >
ProductionOperationsWLCG2012Reports > ProductionOperationsWLCGDec12Reports
Topic revision: r8 - 2013-01-09 - StefanRoiser