Difference: ProductionOperationsWLCGJuly12Reports ( vs. 1)

Revision 12012-09-12 - JoelClosier

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="ProductionOperationsWLCG2012Reports"

July 2012 Reports

To the main

31th July 2012 (Tuesday)

  • Validation productions for new versions of LHCb application software to be launched tomorrow

    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * Application crashes on certain CERN batch node types (GGUS:84672)
    * Castor default pool under stress, DMS group contacted to get more information about the current activity on the pool.

    * T1:
    * RAL: many pilots stuck in state "REALLY-RUNNING", was already observed before and site admins have procedure to cleanup (GGUS:84671)
    * CNAF: FTS transfers from RAL not working because of srm endpoint not detectable from CNAF (site contacts informed)
    * GRIDKA: RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550)
    * PIC : SE under load and very slow streaming data to user jobs. These (user) jobs end up with very low efficiency and getting killed by DIRAC watchdog. User jobs now limited to 150 at PIC. Ongoing consultation with LHCb contact at PIC.

    * Other :

30th July 2012 (Monday)

  • Validation productions for new versions of LHCb application software.

    * <strong>New GGUS (or RT) tickets </strong>

    * T0:

    * T1:
    * PIC : SE under load and very slow streaming data to user jobs. These (user) jobs end up with very low efficiency and getting killed by DIRAC watchdog. User jobs now limited to 150 at PIC. Ongoing consultation with LHCb contact at PIC.

    * Other :

27th July 2012 (Friday)

  • Users analysis and Reconstructionat at T1s
    * MC production at T2s
    * Prompt reconstruction at CERN + Tier-1s
    * Validation productions for new versions of LHCb application software.

    * <strong>New GGUS (or RT) tickets </strong>

    * T0:

    * T1:
    * PIC : SE under load and very slow streaming data to user jobs. These (user) jobs end up with very low efficiency and getting killed by DIRAC watchdog. User jobs now limited to 150 at PIC. Ongoing consultation with LHCb contact at PIC.

    * Other :

26th July 2012 (Thursday)

  • Users analysis and Reconstructionat at T1s
    * MC production at T2s
    * Prompt reconstruction at CERN + Tier-1s


    * <strong>New GGUS (or RT) tickets </strong>

    * T0:

    * T1:

    * Other :
    * Transfers from CERN->GridKa (GGUS:84550) have improved. No understanding as yet about why the throughput fell in the way it did.
    * DIRAC server hosting output sandboxes ran out of space causing job failures. Firefighting ongoing, immediate problem alleviated.

25th July 2012 (Wednesday)

  • Users analysis and Reconstructionat at T1s
    * MC production at T2s
    * Prompt reconstruction at CERN + Tier-1s


    * <strong>New GGUS (or RT) tickets </strong>

    * T0:

    * T1:

    * Other :
    * Transfers from CERN->GridKa have suddenly stopped since 11AM. >2200 files waiting to be transferred and strange pattern observed with the throughput decreasing in steps over the last day. GGUS ticket being opened.

    <img src="http://lhcbweb.pic.es/DIRAC/LHCb-Production/visitor/systems/accountingPlots/getPlotImg?file=Z:eNp1UstugzAQ_KJKGEgC3CiEBrV5CEh6rCxYEavERl5Havv1tcEkRFVPnlmPd3bkbTCIasGblNWqQUKiFFAxThUTvMNl9FLk6WuMxLfoKS6STX5aI_GmyvMxy9aFeWsLDcNPJM5E07JCsphYFudv-5PR3wRF_D57XQz6G1W0h5m22h5ml8fSNHIt_diOXt6NJ4_mujBqQKc-SKFELboOvSirStBhS3GVNXToR8m62JnG5rxHXo08ictqX9jYwVizxgO-RQxHvsn1pdWZrLY8BLV4SGnxGMoZyBSJWJbMjR7itFJce8bbDvWUZ8o5dAAaA28qdoEf_aOuQ1xgK2DuApiBnsZE4zB0nUUI3IwAX0rSWLY4bENHUZVgFgRZsPQdB-Zedyszr4ReSLWjl-G_qrMWtef-qnRTVFSqv2P4_40RROq7h7GVF6VU0X0PclhK-AWzP9jz&nocache=1343219835440">


24th July 2012 (Tuesday)

  • Users analysis and Reconstructionat at T1s
    * MC production at T2s
    * Prompt reconstruction at CERN + Tier-1s


    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * CERN: Backlog of jobs waiting to run

    * T1:
    * GridKa : (GGUS:84476) Mysterious problem with srm. Appeared at 3PM yesterday (23 July) and went away at 10 AM today (24 July). No action taken by anyone. Mainly affected FTS transfers into GridKa and some jobs were rescheduled.

23rd July 2012 (Monday)

  • Users analysis and Reconstructionat at T1s
    * MC production at T2s
    * Prompt reconstruction at CERN + Tier-1s


    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * CERN: (GGUS:84386 and GGUS:84126) User jobs and pilots fine now.

    * T1:
    * SARA: Missing files (GGUS:84328). More missing files found. Would like to understand what happened to the new set.
    * IN2P3 : Batch system reporting jobs running over CPU limit every night at ~8 - 9:30 PM UTC. Investigation ongoing with Aresh (LHCb site contact). Waiting to understand before opening GGUS ticket.

20th July 2012 (Friday)

19th July 2012 (Thursday)

18th July 2012 (Wednesday)

17th July 2012 (Tuesday)

16th July 2012 (Monday)

6th July 2012 (Friday)

  • Users analysis and Reconstructionat at T1s
    * MC production at T2s


    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    *

    * T1:
    * PIC: Files unavailable after downtime (GGUS:83916); Fixed
    * IN2P3: Reconstructions jobs failed due to memomy; Under investigation
    * FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
    * IN2P3 / GridKa file corruption


5th July 2012 (Thursday)

  • Users analysis and Reconstructionat at T1s
    * MC production at T2s


    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    *

    * T1:
    * IN2P3: Reconstructions jobs failed due to memomy; Under investigation
    * FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
    * IN2P3 / GridKa file corruption

4th July 2012 (Wednesday)

  • Users analysis and Reconstructionat at T1s
    * MC production at T2s


    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * Problems with the SAM SRM probes (GGUS:83782); Requested to remove probes [org.lhcb.SRM-VOLs, org.lhcb.SRM-VODel] from profile LHCB_CRITICAL


    * T1:
    * FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
    * IN2P3 / GridKa file corruption
    * IN2P3: CVMFS problem; fixed


3d July 2012 (Tuesday)

  • Users analysis at T1s ongoing, Reconstruction started
    * MC production at all sites


    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * Moving DIRAC accounting services to new machines.
    * CERN: (GGUS:83713) 30 TB are still missing on LHCb-Disk; waiting for repair
    * Problems with the SAM SRM probes (GGUS:83782); Requested to remove probes [org.lhcb.SRM-VOLs, org.lhcb.SRM-VODel] from profile LHCB_CRITICAL


    * T1:
    * FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
    * IN2P3 / GridKa file corruption


2nd July 2012 (Monday)

  • Users analysis at T1s ongoing
    * MC production at all sites


    * <strong>New GGUS (or RT) tickets </strong>

    * T0:
    * Moving DIRAC accounting services to new machines.
    * CERN: (GGUS:83713) 30 TB are still missing on LHCb-Disk

    * T1:
    * FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
    * IN2P3 / GridKa file corruption

-- JoelClosier - 12-Sep-2012

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback