April 2012 Reports

To the main

30 April 2012 (Monday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • cpu intensive workflows successful at Tier-2s.

  • T0
  • T1
    • GridKa
      • Problem with LFC (GGUS:81734) from 1PM on 29 April. Problem solved this morning.
        • Q: Not clear why the ticket was assigned to GridKa only at 5:20 AM this morning
        • Q: Not clear why the internal monitoring of GridKa did not pick this up before this - found by failing LHCb jobs.
        • Ticket open for 1 month, requesting to introduce the LFC probes into the LHCb_CRITICAL profile (GGUS:80753). Still waiting for action on it.
    • PIC
      • Possible problem with queues / scaling of some nodes. Many jobs failing repeatedly due to lack of wall time before succeeding eventually.
        • In contact with the LHCb-site contact. Will open a GGUS ticket if we cannot resolve this internally.

27 April 2012 (Friday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • One T2 successfully attached for new production workflow to T1 storage (cpu intensive, low I/O), plan to attach one T2 to each T1 storage

  • T0
  • T1
    • IN2P3
      • So far 10 corrupted files were found, file size is correct but checksum is not (GGUS:80338)
      • Pilots aborted tonight (GGUS:81677)

26 April 2012 (Thursday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Test: One T2 attached for new production workflow (cpu intensive, low I/O)

  • T0
    • DB streaming online -> offline affected since yesterday afternoon
  • T1
    • GRIDKA
      • 12 files lost at Gridka disk servers, removal happened mid March, no more log files available, no further investigation possible (GGUS:81322)
    • IN2P3
      • So far 10 corrupted files were found, file size is correct but checksum is not (GGUS:80338)

25 April 2012 (Wednesday)

  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2

  • T0
    • NTR
  • T1
    • RAL
      • another 18 disk servers to be rebooted today to avoid problem with network card
    • GRIDKA
      • 12 files lost at Gridka disk servers, first analysis shows that the files have been deleted, currently under investigation how this could happen (GGUS:81322)

24 April 2012 (Tuesday)

  • DataReprocessing of 2012 data at T1s with new alignment was launched today
  • MC simulation at Tiers2

  • T0
    • NTR
  • T1
    • GRIDKA:
      • job submission to Gridka WMS are failing (GGUS:81405). Ongoing (2 instances fixed, 2 failing), will be fixed during next DT
    • RAL
      • one diskserver stopped working yesterday 9pm, fixed this morning

23 April 2012 (Monday)

  • Prompt Reconstruction at T1s
  • DataReprocessing of 2012 data with new alignment to be launched today at T1s
  • MC simulation at Tiers2

  • T0
    • NTR
  • T1
    • GRIDKA:
      • job submission to Gridka WMS are failing (GGUS:81405). Ongoing (2 instances fixed, 2 failing), will be fixed during next DT

20th April 2012 (Friday)

  • Prompt data reconstruction, data stripping and users analysis going on at Tier1s and T0.
  • MC simulation at Tiers2

  • T0
    • 2500 LHCb jobs SIGSTOPed during last night. Not sure, but they were suspected of being responsible for killing the batch system with too many queries. Though on DIRAC side nothing has been changed with respect to queries to the batch system in the last year. Jobs re-started this morning
    • waiting for un update about the lost files due to a broken Castor disk server (GGUS:80973) (last update was on Monday)
  • T1
    • GRIDKA: job submission to Gridka WMS are failing (GGUS:81405). Ongoing.
    • SARA:
      • ticket (GGUS:81457) for pilots aborted with Reason=999
      • asked to upgrade to last CernVM-FS version (GGUS:81462)

19th April 2012 (Thursday)

  • Prompt data reconstruction, data stripping and users analysis going on at Tier1s and T0. Yesterday stopped productions for reconstruction and stripping with the previous application version, which were taking very long time, causing jobs to hit the end of queues at sites and being rescheduled.
  • MC simulation at Tiers2

  • T0
    • waiting for un update about the lost files due to a broken Castor disk server (GGUS:80973) (last update was on Monday)
  • T1
    • GRIDKA
      • 70 files of the last 2011 re-stripping missing from GRIDKA SE (GGUS:81322).
      • job submission to Gridka WMS are failing (GGUS:81405). Ongoing.
      • some FTS transfers failing, fixed now (GGUS:81398). Promptly fixed yesterday.
    • CNAF upgrade WMS version ongoing (GGUS:81291)

18th April 2012 (Wednesday)

  • Prompt data reconstruction, data stripping and users analysis going on at Tier1s and T0
  • MC simulation at Tiers2

  • Central services:
    • waiting for the release which fixes the problem of CernVm-fs stale cache (GGUS:81181)
  • T0
    • waiting for un update about the lost files due to a broken Castor disk server (GGUS:80973)
  • T1
    • GRIDKA: 70 files of the last 2011 re-stripping missing from GRIDKA SE (GGUS:81322)
  • T2

17th April 2012 (Tuesday)

  • Prompt data reconstruction, data stripping and users analysis going on at Tier1s and T0
  • Almost no activity at Tiers2 (very few MC simulation jobs)

  • T0

  • T1
    • GRIDKA: 70 files of the last 2011 re-stripping are missing from GRIDKA SE, investigations ongoing. Opened a ticket
  • T2

16th April 2012 (Monday)

  • Prompt reconstruction and stripping going on at Tier1s and T0
  • MC simulation at Tiers2

  • T0
    • some production jobs failed due to no space left on disk. Opened a ticket, already solved: 20GB scratch disk space guaranteed for LHCb jobs (according to Vo card)
  • T1
    • IN2P3: site banned during 3 hours yesterday for SRM unscheduled down-time
    • General for all Tiers1: new version of CernMV-fs client available with a fix for the problem with the cache (see ticket), should be deployed asap
    • CNAF: ticket for the update WMS version. Solved already.
    • PIC: ticket for the update WMS version
    • SARA: ticket
  • T2

13th April 2012 (Friday)

  • New Production going through smoothly now after yesterday's interuption.
  • MC simulation has completed for the momemt. Waiting on a Stripping fix before more tasks are submitted.

  • T0
    • GEANT network problem: All jobs started reporting as stalled and had configuration service authentication issues. A fix was applied for the CS and after the network recovered, all jobs back to normal
    • Will there be an incident report covering the issues?
    • CVMFS/Broken file problem: Fix in CVMFS client prepared and is currently being rolled out to CERN WNs.
  • T1
    • RAL: Minor SRM glitch this morning but recovered OK
    • IN2P3: SAM jobs found as a possible cause of a lot of do-nothing pilots. Fix is ready and will be rolled out ASAP.
  • T2

12th April 2012 (Thursday)

  • New Production going through. Some site issues (see below) but generally significantly better than previously
  • Fix has been validated for the Conditions DB issue - Will now check local DB install and if not complete will download via web server squid/proxy
  • MC simulation at Tiers2 ongoing

  • T0
  • T1
    • GridKa: Transferred to CVMFS. Had a lot of failures overnight due to empty caches.
    • Seem to be improving now and have increased the timeout on setup to compensate
  • T2

11th April 2012 (Wednesday)

  • Stopped the previous productions that were running over the bad runs and created a new one to run over the data from last night as a test
  • Jobs going through now - all T1s should be seeing Production jobs going through in the next 24 hours
  • There is still a known problem with big events slowing down the Stripping, but Reconstruction jobs should behave normally now
  • 6 hour delay has been put in to avoid the CondDB issue reported yesterday
  • MC simulation at Tiers2 ongoing

  • T0
    • Any more news on the RAID controller? (GGUS:80973)
  • T1
    • GridKa have updated their LFC and it has been marked online again.
  • T2

10th April 2012 (Tuesday)

  • Prompt Reconstruction continued over Easter
  • Stripping jobs for (both Re-Stripping 17b & Pre-Stripping 18) are 99.9% complete (last few files going through)
  • MC simulation at Tiers2 ongoing
  • Had issues with ONLINE farm - Dirac Removal & Transfer agents hanging and thus slowing the distribution of data. Investigations ongoing.
  • Due to different Trigger settings, quite a few of these early runs are taking a long time to process/strip which results in long jobs. We have identified the problem and are in the process of stopping the current production, marking bad runs and creating a new one.
  • There is another issue with the 3-4 hour delay between a CondDB update and it being propagated to the WNs. Short term, will put in a 6 hour delay between new Conditions and creating jobs but a permanent solution is in progress.
  • Finally, after investigating issues with slow file access at GridKa and IN2P3, we would like to make a request for lcg-cp to handle the dcap protocol (currently uses GridFTP only). This would be preferable for both sites (fewer GridFTP connections) and us (faster transfer).

  • T0
  • T1
    • Gridka: Some nodes using CVMFS are being validated.
  • T2

5th April 2012 (Thursday)

  • Prompt Reconstruction started
  • Stripping jobs (both Re-Stripping 17b & Pre-Stripping 18) are ~complete at all sites except GridKa
  • MC simulation at Tiers2 ongoing

  • T0
    • CERN : loss of one disk server. No notification to LHCb (GGUS:80973)
  • T1
    • Gridka: staging progressing very slow since last week (GGUS:80794)
    • Upgrade of LFC server versions needed at GRIDKA (GGUS:80777), RAL is done (GGUS:80775)
    • Gridka conditions DB not accessible by CERN/SLS sensors (GGUS:79800)
  • T2

4th April 2012 (Wednesday)

  • Final validation of 2012 workflows starting
  • Stripping jobs (both Re-Stripping 17b & Pre-Stripping 18) are ~complete at all sites except GridKa
  • MC simulation at Tiers2 ongoing

  • T0
    • CERN : loss of one disk server. No notification to LHCb.
  • T1
    • Gridka: staging progressing very slow since last week (GGUS:80794)
    • Upgrade of LFC server versions needed at GRIDKA (GGUS:80777), RAL (GGUS:80775)
    • Gridka conditions DB not accessible by CERN/SLS sensors (GGUS:79800)
  • T2

3 April 2012 (Tuesday)

  • Final validation of 2012 workflows starting
  • Stripping jobs (both Re-Stripping 17b & Pre-Stripping 18) are ~complete at all sites except GridKa
  • MC simulation at Tiers2 ongoing

  • T0
    • NTR
  • T1
    • Gridka: staging progressing very slow since last week (GGUS:80794)
    • Upgrade of LFC server versions needed at GRIDKA (GGUS:80777), RAL (GGUS:80775)
    • Gridka conditions DB not accessible by CERN/SLS sensors (GGUS:79800)
    • PIC in downtime today, site banned

2nd April 2012 (Monday)

  • Stripping jobs (both Re-Stripping 17b & Pre-Stripping 18) are ~complete at all sites except GridKa
  • MC simulation at Tiers2 ongoing

-- JoelClosier - 07-May-2012

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2012-09-12 - JoelClosier
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback