Difference: ProductionOperationsWLCGMay12Reports (1 vs. 2)

Revision 22012-09-12 - JoelClosier

Line: 1 to 1
 
META TOPICPARENT name="ProductionOperationsWLCG2012Reports"
Added:
>
>

May 2012 Reports

To the main
 

31st May 2012 (Thursday)

  • Users analysis and prompt reconstruction and stripping at T1s ongoing

Revision 12012-06-06 - JoelClosier

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="ProductionOperationsWLCG2012Reports"

31st May 2012 (Thursday)

  • Users analysis and prompt reconstruction and stripping at T1s ongoing
  • MC production at Tiers2

  • T1:
    • SARA: decreased the rate of job submission for merging jobs, in order to reduce a bit the pressure on the SRM
    • CERN : RAW transfer failing : (GGUS:82745 ) randomly...

30th May 2012 (Wednesday)

  • Users analysis and prompt reconstruction and stripping at T1s ongoing
  • MC production at Tiers2

  • T1:
    • IN2P3 ticket for jobs killed due to memory limit has been closed (GGUS:82544)

29th May 2012 (Tuesday)

  • Users analysis, DataReprocessing of 2012 data and prompt reconstruction at T1s ongoing
  • MC production at Tiers2

  • T0:
    • ntr
  • T1:
    • SARA: unscheduled downtime this morning
    • IN2P3: ongoing investigation about corrupted files (GGUS:82247)

  • Central services

25th May 2012 (Friday)

  • DataReprocessing of 2012 data and prompt reprocessing at T1s ongoing
  • MC generation mostly complete at present - more due

  • T0:
    • Inaccessible files have been verified (GGUS:82146)

  • T1
    • SARA / NIKHEF :
      • Previously reported Input Data Resolution errors possibly due to Swimming production overloading SE. Config changes made to prevent this.
    • IN2P3 :
      • Ongoing investigations into corrupt files (GGUS:82247).
      • Some MC files weren't accessible have been moved. Dodgy disk server is going to get repaired soon and no additional problems seen as yet.

24th May 2012 (Thursday)

  • DataReprocessing of 2012 data and prompt reprocessing at T1s going well
  • MC generation mostly complete at present

  • T1
    • SARA / NIKHEF :
      • This morning seen a large increase in 'Input Data Resolution' errors - any problems to report wrt to storage?
    • IN2P3 :
      • Ongoing investigations into corrupt files (GGUS:82247)

23rd May 2012 (Wednesday)

  • DataReprocessing of 2012 data and prompt reprocessing at T1s going well
  • MC generation running well after fixing some Bookkeeping issues

  • T0
    • Files now seem accessible. Waiting for Joel to close the ticket (GGUS:82146)

  • T1
    • SARA :
      • Pilot problem seems to have resolved itself but I'll leave the ticket open for a bit longer just in case (GGUS:82368).
    • IN2P3 :
      • Ongoing investigations into corrupt files (GGUS:82247)

22nd May 2012 (Tuesday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • New (and quite significant) MC campaign starting, T1s included as well
  • Prompt reprocessing of data

  • T0
    • Waiting for the files from the faulty diskserver to be recovered. Any updates?
    • Failed pilot problem from yesterday solved (GGUS:82356)

  • T1
    • RAL :
      • Diskserver problem from yesterday quickly solved
      • Some network glitches this morning but with little impact on jobs
    • SARA :
      • Data access continues to be good so I think we regard this problem as solved (GGUS:82178).
      • Still a background of aborted pilots that started a few days ago. Not a big problem but would be nice to understand it (GGUS:82368).
    • IN2P3 :
      • Offlined this morning due to downtime

21st May 2012 (Monday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation mostly complete for the moment
  • Prompt reprocessing of data
  • Disabled filling mode for pilots for next 10 days.

  • T0
    • Ongoing "Input data resolution" problems by jobs at CERN - DIRAC tunings were tried over the weekend but there still seems to be an issue.
    • Waiting for the files from the faulty diskserver to be recovered.
    • A peak of failed pilots this morning (GGUS:82356)

  • T1
    • RAL : Reported an unavilable disk server from LHCbUser space token - any news?
    • SARA :
      • Access to data seems to have improved considerably over the last 24h, though I assume we're waiting for the upgrade before marking this as solved (GGUS:82178).
      • Peak of aborted pilots this morning with qsub errors (GGUS:82368).

18th May 2012 (Friday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data
  • Disabled filling mode for pilots for next 10 days.

  • T0
    • Still "Input data resolution" problems by jobs at CERN. Waiting for the files from the faulty diskserver to be recovered.
  • T1
    • RAL : Problem with batch system / aborted pilots (GGUS:82304)
    • SARA :
      • Continuing huge problems with access to data (GGUS:82178). No new GGUS ticket opened.
      • Disk cache in front of tape very low (GGUS:82267). Solved.

16th May 2012 (Wednesday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

  • T0
    • Still "Input data resolution" problems by jobs at CERN. Suspect one of the faulty diskservers still does not have the files there recovered.
  • T1
    • IN2P3 :
      • Continuing problem with corrupted files - In addition to offline discussions, GGUS ticket opened (GGUS:82247).
    • SARA :

15th May 2012 (Tuesday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

  • T0
    • Retiring diskservers : Waiting for information from CERN before deciding what to do about the files which are currently unavailable.
  • T1
    • GridKa : Problems accessing data (GGUS:82200) - restarted a LFCDaemon which had died and not restarted automatically. Okay now.
    • IN2P3 :
      • Continuing problem with corrupted files - offline discussions ongoing.
      • Would like to know the DN of the "short" jobs which are coming into IN2P3 as reported yesterday and if possible a plot of the extent of the problem. We cannot see any significant problem from the LHCb side - we are doing the same thing for IN2P3, as for all other sites.
      • Pilots failing at IN2P3-T2 (GGUS:82228). Apparently error-152

14th May 2012 (Monday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

  • T0
    • Diskservers down at CERN. Also created a few problems for LHCb FTS transfers, due to the logic used for submitting the FTS requests (DIRAC issue - being looked at for improvement).
  • T1
    • SARA : Problem with FTS (GGUS:82178).
    • IN2P3 : Various corrupted files - some files apparently get corrupted (checksum mismatch) after some time, even if the file is seemingly okay after being transferred into IN2P3. Old & verified ticket (GGUS:80338) there, but the problem is ongoing. IN2P3 contact (Aresh) aware of this. Corruption seen in files from different diskservers.

11th May 2012 (Friday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

  • T0
  • T1

10th May 2012 (Thursday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

  • T0
    • CERN : (GGUS:82056) pilot not mappe properly at ce208.cern.ch
  • T1

9th May 2012 (Wednesday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

8 May 2012 (Tuesday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

  • T0
  • T1
    • NIKHEF : (GGUS:81930) Pilot aborted and queue lenght was reset and put back this morning
    • IN2P3 : (GGUS:81927) Pilot failed at one creamce05
    • GRIDKA : What are the plan for migration of the SRM instance ? Meanwhile can we have more space because we are low in space
  • Others
    • FTS : (GGUS:81996) Problem of FTS transfer between RAL and CNAF (CNAF is not seeing the RAL FTS instance)

7 May 2012 (Monday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

  • T0
  • T1
    • RAL : 1 disk server down during the week-end..
  • Others
    • FTS : Problem of FTS transfer between RAL and CNAF (under investigation By LHCb)

4 May 2012 (Friday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

  • T0
    • Corrupted file (RAW data) re-transferred and migrated
  • T1
    • CNAF : Still trying to understand proxy issues with directly submitted pilots.
  • Others
    • FTS : Problem with FTS proxy expiration. GGUS ticket submitted to FTS by RAL FTS admins (GGUS:81844)

3 May 2012 (Thursday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

  • T0
    • 1 corrupted file was discovered on CASTOR; the original is still available online
  • T1
    • PIC
      • Problem with queues sorted. Queue lengths temporarily increased for LHCb.
        • Hopefully the basic cause will be fixed with the next version of our reconstruction and stripping application software.
    • RAL
      • srm not visible from outside RAL (GGUS:81816)
        • Actually a problem with FTS. GGUS ticket submitted to FTS (GGUS:81844)
      • Prompt reconstruction failing due to squid proxy cache not refreshing itself.
        • Solved.
    • GridKa
    • CNAF : Trying to understand proxy issues with directly submitted pilots.

2 May 2012 (Wednesday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data taken overnight

  • T0
  • T1
    • PIC
      • Possible problem with queues / scaling of some nodes. Many jobs failing repeatedly due to lack of wall time before succeeding eventually.
        • GGUS ticket opened (GGUS:81814). Still trying to resolve the details of the problem.
    • SARA
      • Worker node with CVMFS possibly broken (GGUS:81787). Remounted just now.
      • Jobs failing to resolve input data. Investigating if it is an issue from LHCb side.
    • RAL
      • srm not visible from outside RAL (GGUS:81816)
      • Prompt reconstruction failing due to squid proxy cache not refreshing itself. In contact with Catalin at RAL about it.

-- JoelClosier - 06-Jun-2012

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback