Difference: ProductionOperationsWLCGMay10Reports (1 vs. 3)

Revision 32010-06-11 - PeterJones

Line: 1 to 1
 

May 2010 Reports

To the main
Line: 390 to 390
  Issues at the sites and services
  • T0 site issues:
Changed:
<
<
    • LFC:RO still overloaded by some users using old version of the framework (w/o CORAL patch) and/or not using the workaround put in place in DIRAC.
>
>
    • LFC:RO still overloaded by some users using old version of the framework (w/o Persistency patch) and/or not using the workaround put in place in DIRAC.
 
  • T1 site issues:
    • NL-T1: still actual the problem with the SE affecting also reconstruction jobs attempting to upload output (GGUS: 57812).
    • IN2p3: going to upgrade their MSS to put tape protection and only allowing lhcb production role to stage files in. Access of data staged on disk will occur through dcap protocol w/o any authorization.

Revision 22010-06-01 - unknown

Line: 1 to 1
 

May 2010 Reports

To the main
Added:
>
>

31st May 2010 (Monday)

GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 4

Issues at the sites and services

  • T0 site issues:
    • problem with LHCBMDST fixed on Friday by adding 11 more disk servers into the service class serving this space token. During the week end no more problem observed or reported by users.
    • AFS problem starting from 10:30. (ALARM GGUS: 58643). Many users affected, many monitoring probes submitting jobs via acrontab affected, e-logbook not working and other services relying on were severely affected. Shared area for grid jobs not affected however. Issue boiled down to a power failure on a part of the rack.
  • T1 site issues:
    • Issue at Lyon throttling activities last week was not due to a limitation of the new very-long queue but to CREAMCE jobs status wrongly reported by the gLite WMS.
    • Issue at SARA with SAM tests. It seems due to the CERN CA expired in one gridftp server (GGUS 58647)
  • T2 sites issues:
    • Shared area issues at CBPF and PSNC. Jobs failing at UKI-LT2-QMUL and UFRJ

28th May 2010 (Friday)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • In the last days have been observed many queued transfers on CASTOR LHCBMDST causing a large fraction of user jobs to fail at CERN as well as many transfers od DST outgoing to T1's (GGUS 58523). As a temporary measure CERN has been banned yesterday to accept further user jobs and as soon as the backlog has been drained jobs re-started to succeed and transfers of DST started to succeed again (see picture).There are available 490 TB of new h/w delivered. Ignacio started to install as an emergency measure - 4 new diskserver on lhcbmdst (the loaded service class root of these problems) and today we do expect them in production (once few issue with the configuration will be sorted out).Also asked to disable the GC on this SC (it is T1D1 and not T1D0). getPlotImg-4.png

  • T1 site issues:
    • IN2p3: huge amount of queued jobs and very few running. This is throttling one of the reprocessing activity backlog to be drained. The problem seems to be due to the very-long queue recently open to lhcb that accepts too few concurrently running jobs. (GGUS 58572)
    • GRIDKA: Shared area issue affecting user jobs (GGUS 58595)
  • T2 sites issues:
    • none

27th May 2010 (Thursday)

GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 1

Issues at the sites and services

  • T0 site issues:
    • We are seeing many queued transfers on CASTOR LHCBMDST (GGUS 58523). See attached plots showing active and queued transfers. We are investigating if we can slightly modify how the LHCb applications handle open files, but we would also like new hardware as fast as possible. A request has been made to CASTOR.
mdst_transfers.png

queuedmdst.png *T1 site issues:

    • IN2p3: investigating actively on the shared area shortage last week.
    • IN2p3: the CREAM CE cccreamceli01.in2p3.fr has currently 7386 Scheduled pilots while the BDII publish ~4500 free slots. This impacts the ranking mechanism attracting wrongly jobs there. GGUS ticket submitted (58572).

  • T2 sites issues:
    • SharedArea problem at epgr04.ph.bham.ac.uk UKI-SOUTHGRID-BHAM-HEP

26th May 2010 (Wednesday)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0 site issues:
    • Ticket against CASTOR closed. Not a CASTOR problem.

  • T1 site issues:
    • PIC: PIC-USER space token is full.
    • NL-T1: SARA dCache is banned due to ongoing maintenance.

  • T2 sites issues:

25th May 2010 (Tuesday)

  • Experiment activities:
  • Very intense data taking activity doubling the 2010 statistics.
  • MC production ongoing.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0 site issues:
    • On Monday the M-DST space was presenting many transfers queued and users report their jobs hanging (and then killed by the watch dog). SLS effectively showed this problem yesterday today recovered. We are also seeing continued degradation of service on service class where RAW data exists. Can we have some indication what and why this happened? GGUS ticket submitted (58482). mdst_transfers.png
    • On Saturday afternoon we had default pool overloaded triggering an alarm on SLS; it recovered by itself.

  • T1 site issues:
    • IN2p3 Open a GGUS because ~10% of the jobs were failing yesterday with a shared area issues (GGUS 58283 ). Reproduced the problem. Anyt news form IN2p3?
    • RAL: request to increase the current 6 parallel transfer allowed in the FTS for the SARA-RAL channel in order to clean the current backlog draining too slowly.
    • RAL: lost a disk server (same one as last time). Files have been recovered.
    • CNAF: some FTS transfers seem to fail with the error below. CNAF people discovered a bug in Storm to clean up failed transfers at basis of this problem
      SOURCE error during TRANSFER_PREPARATION phase: [INVALID_PATH] Requested file is still in SRM_SPACE_AVAILABLE state!
    • PIC: Power failure on the week end causing the site being unavailable.
  • T2 sites issues:
    • CSCS and PSNC both failing jobs with SW area issue

21st May 2010 (Friday)

Experiment activities: For this weekend are expected 13 bunch crossings which means LHCb will get equal data to what they already have for 2010. Problem with data processing are due to DaVinci being sensitive to new releases of SQLDDDB when it shouldn't be. This has to be fixed today. Ongoing processing will be stopped.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 3
  • T2: 2

Issues at the sites and services

  • T0 site issues:
    • One of the service class in CASTOR (serving the FAILOVER and DEBUG space tokens) was highly overloaded yesterday evening to trigger concurrently worries on both Jan and Andrew. This is due to an unexpected load that LHCB started to put on this token in turn due to too many jobs crashing for the problem with the Davinci application. Suspiciously there is a concurrent problem affecting also many user 's activities accessing data in other space tokens (a.k.a. :M-DST,M-MC-DST). (GGUS 58435)

  • T1 site issues:
    • IN2p3 Open a GGUS because ~10% of the jobs were failing yesterday with a shared area issues (GGUS 58283 ). Reproduced the problem planning to increase the reliability by adding read-only volumes
    • SARA: dcap port crashed causing many jobs failing to access data (GGUS 58396).
    • SARA: WMS erroneously matching queue not supporting lhcb (low memory queues at RAL, causing many user jobs crashing) (GGUS 58399). The top BDII at SARA was not refreshing the views and publishing these queues at RAL as supporting LHCb. Problem fixed by restarting the TOP BDII.
    • RAL: request to increase the current 6 parallel transfer allowed in the FTS for the SARA-RAL channel in order to clean the current backlog draining too slowly.
    • CNAF: some FTS transfers seem to fail with the error
      SOURCE error during TRANSFER_PREPARATION phase: [INVALID_PATH] Requested file is still in SRM_SPACE_AVAILABLE state!
    • PIC: SRM seems to be down with all attempts to contact the endpoint are failing with error :
      [SE][srmRm][] httpg://srmlhcb.pic.es:8443/srm/managerv2: CGSI-gSOAP running on volhcb20.cern.ch reports Error reading token data: Connection reset by peer
      (GGUS 58430) ...issue went away by it self.
  • T2 sites issues:
    • INFN-TO and GRISU-SPACI-LECCE shared area problem

20th May 2010 (Thursday)

Experiment activities: New requirements for MaxCPUTime formulated on the VO Id Card. Currently running productions are affected by a severe application problem let jobs crashing. Urgent intervention form Core application people.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 1

Issues at the sites and services

  • T0 site issues:
    • LFC-RO timing out requests again. (GGUS 58380 )
  • T1 site issues:
    • IN2p3 Open a GGUS because ~10% of the jobs were failing yesterday with a shared area issues (GGUS 58283 )
    • PIC: our contact person informed that SRM, between 11:30pm till 3:30 am (CET) had problems due to a disk controller
  • T2 sites issues:
    • INFN-PADOVA: shared area problem

19th May 2010 (Wednesday)

Experiment activities: Last 24 h/s 30K jobs run (20 from users the rest MC production and reconstruction/reprocessing). No major problems to report. Going to update the VO ID Card: the max. CPU Time sites should provide shoul pass from current 12000 HS06 min to 18000 HS06 minutes to hold properly reconstruction jobs with 3GB input file currently throttled to 2GB for CPU Time limitation.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 2

Issues at the sites and services

  • T0 site issues:
    • none
  • T1 site issues:
    • IN2p3: open a GGUS to request the correct publication of a new variable of the GlueSchema to be used to normalize CPU Time: CPUScalingReferenceSI00
    • IN2p3 Open a GGUS because ~10% of the jobs were failing yesterday with a shared area issues (GGUS 58283 )
    • RAL: Shaun re-staged last night all data of the faulty disk server to another disk server to make them available to user. Storage elements at RAL re-enabled for users. Any news from RAL people concerning the SIR requested?
    • PIC: 10 minutes intervention to restart SRM (new postgres DB). Transparent.
  • T2 sites issues:
    • UK sites uploading MC jobs output are timing out against many T1's. Firewall issue. Contact person in UK following this up
    • GRIF and UFRJ-JF problems (too many jobs)

18th May 2010 (Tuesday)

Experiment activities: NIKHEF and RAL transfer backlog recovered in spectacular fashion.

sara_ral.png

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 2

Issues at the sites and services

  • T0 site issues:
    • 'lfc_noread alarm received on Saturday evening, after some log's digging, has been found to be mainly due to a incompatibility between IPV6 support and DNS load balancing as implemented at CERN; almost always the same front-end box was picked up and not enough thread left on lfclhcbr01.
  • T1 site issues:
    • IN2p3 Open a GGUS because ~10% of the jobs were failing yesterday with a shared area issues (GGUS 58283 )
    • RAL: we need to re-stage the rest of files that were in the old (faulty) disk server to another disk server to allow transfers out of RAL proceeding and to allow users to access them. Beside the impressive throughput SARA and RAL indeed we have many failures out of RAL because the source files supposed to be available in a disk server (disabled) and then not staging from tape is issued by CASTOR. So many transfers (all the ones that look for file on the removed disk server) are failing with the error below
       SOURCE error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] source file failed on the SRM with error [SRM_FAILURE]
      
      . We know Shaun moved all migration candidates off to different servers in the same service class and the bad disk server has been disabled again. LHCb would like to ask a detailed analysis report of what happened since Saturday.

  • T2 sites issues:
    • PDC shared area issue; USC-LCG2 too many pilots aborting

17th May 2010 (Monday)

Experiment activities: Data taking at full steam (~2KHz).

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • none
  • T1 site issues:
    • GRIDKA: Many jobs declared as stalled by the watchdog. It looks like LRMS on the site sent a STOP signal to the jobs (before 18:00 on Sunday) and some hours later they sent a CONT signal. A first interpretation is that at GRIDKA detected some problem and decided to pause jobs on the batch. Could GridKA people confirm that?
    • SARA: Still storage not working for lhcb while it seems to be for ATLAS. People at SARA are looking at the problem the problem seems to be due to the CERN CA not updated there and then a restart of SRM needed.
    • IN2p3: some shared area issue with some (74) production and user jobs hanging in souring the environment. No ticket submitted yet (it will arrive). Under monitoring
    • IN2p3: is not publishing properly Normalization factrs of their queue and jobs are failing the CPU Time calculation. (GGUS 58279)
    • RAL: the disk server issue of the week end has some side effect: indeed there are he 300 odd files which have not migrated and will be unavailable until the RAID system of the new added server has rebuilt. A backlog of transfers out of RAL has been formed with an error like :
       SOURCE error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] source file failed on the SRM with error [SRM_FAILURE]
      
  • T2 sites issues:
    • none

13-14-15-16 May 2010 (long Ascension weekend)

Experiment activities: Long week end with many intense activities. Huge activity form the PIT with about 3 TB (~3K files) of data taken as of Sunday morning (amount expected to increase by Monday). Huge amount of MC simulation jobs and many user jobs too. Stripping-03 started while stripping-02 on previous data running until completion.Two pictures summarize the amount of jobs (dominated by MC) and data transfer of data from the pit.

activity.png

fromPIT.png

GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 2

Issues at the sites and services

  • T0 site issues:
    • gLite WMSes (wms203 and wms216) were reporting on Friday night the status Running (glite-wms-job-status) for all jobs submitted against CREAMCEs while the LB was correctly reporting these jobs Done. This is not a show stopper but - considering the high load put in the system, it makes the pilot monitoring system's life more difficult. Off line discussions with Maarten on Saturday highlighted this problems to be due to known bugs whose Savannah bugs were already open in particular this one (63109)
    • Received a lfc_noread alarm on Saturday evening at around 20:00. Checking the problem at 20:30 there was not any impact on user and the service was working fine. SLS and SAM reported some degradation of availability (and huge number of active connections) at around the time the alarm was risen.
    • Friday evening found different user's jobs failing because of a AFS shortage serving the application area. The jobs were just hanging and then eventually killed by the watch dog as consuming too few cpu.
  • T1 site issues:
    • SARA: SRM down (gSOAP error message). Submitted ALARM ticket (day time) on Saturday at noon preventing all activities going through. No progress (as of Sunday morning) since the aknowledgement received from Jeff as soon as it has been opened (GGUS 58244)
    • RAL: all transfers started to timeout and turning into a fairly important degradation of the service. This might be related with the shortage of space on M-DST and MC-M-DST spac tokens (less than 2.5TB free out of 69TB allocated) (GGUS 58253). Problem found to be due to a faulty diskserver removed on Saturday dropping the amount of available space. This has not been announced by RAL people.
  • T2 site issues:
    • USC-LCG2 too many pilots aborting; INFN-NAPOLI shared area problems

12th May 2010 (Wednesday)

Experiment activities: No data recently taken...Next fill will be tonight or tomorrow. It will be for MagDown as we have less data than with MagUp. Stripping-02 progress so far is 66.2% for MagUp and 86.4% for MagDown (prods launched on Friday).Jobs during the night were all declared stalled (DIRAC issue) Workflow for stripping-03 commissioned and ready to go. A clean up campaign of stripping-01 data will be done on Monday.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 3

Issues at the sites and services

  • T0 site issues:
    • CREAMCE all jobs failing the 10th (GGUS 58120)
  • T1 site issues:
    • none
  • T2 sites issues:
    • too many jobs failing at: USC-LCG2, INFN-NAPOLI, RO-07-NIPNE

11th May 2010 (Tuesday)

Experiment activities:

  • Most of the data collected from the long fill this weekend will be marked as BAD. Running w/o problems (less than 1% failure rate) some MC productions. Further more the Reco-2 activity that counts right now about 2.5K concurrent running jobs also does not present any special issue to be reported. Interesting that the integrated throughput yesterday (with real data) achieved 1GB/s (RAW data export + DST replication). Impressive considering the small size of the reconstructed (and stripped) DSTs.

throughput_real_data.jpg

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • none
  • T1 site issues:
    • SARA-NIKHEF: The SE down yesterday the reason of a backlog formed and not yet fully recovered this morning.
    • pic: Accounting system down (Application level, no much pic site admins can do)
  • T2 sites issues:
    • none

10th May 2010 (Monday)

Experiment activities:

  • Long fill this weekend but because of some issue with one subdetector these data might result to be flagged as BAD. It would be a pity; 2.7 TB happily replicated at all T1's. this morning (because of a DIRAC agent stuck during the weekend and restarted only this morning) and first reconstruction jobs created only now.

GGUS (or RT) tickets:

  • T0: 2
  • T1: 1
  • T2: 2

Issues at the sites and services

  • T0 site issues:
    • Friday reported that CERN-SARA FTS transfers were stuck on Active status (GGUS 58048).
    • LFC-RO issue (apparently no because of exhausted threads this time, GGUS 58067).
  • T1 site issues:
    • pic: DIRAC services at pic are down again. MySQL database machine was down. ( ALARM GGUS 58063 risen at 3:54 UTC in the morning, reaction time less than 20 minutes). Further investigation revealed an issue at the application level. There is not much more that the site can do.
  • T2 sites issues:
    • Shared area issue at INFN-TO and BG01-IPP

7th May 2010 (Friday)

Experiment activities:

  • Launched yesterday few production to validate new workflows. No big deal. One production at 91.6% with few stalled jobs under investigation and the remaining 156 RAW files expected to finish later in the afternoon. Few small MC production requested and about to come jn the weekend together with high intensity data (tonight?)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 1

Issues at the sites and services

  • T0 site issues:
    • none
  • T1 site issues:
    • pic: DIRAC services at pic are down (GGUS 58036)
  • T2 sites issues:
    • RO-15-NIPNE failing all jobs.
  • CIC: still some instabilities with bunch of notifications (as backlogs formed on the server than sent all at once) and/or missed announcements.

6th May 2010 (Thursday)

Experiment activities:

  • No activity apart of user analysis

GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • The default pool was yesterday under heavy load and also SLS showed this unavailability (due to high number of concurrent transfers see the snapshot took available here). This was affecting users running through the LXBATCH and LXPLUS. (GGUS: 57982)
  • T1 site issues:
    • CNAF: none of the transfers at around 8pm yesterday were going on. Open an ALARM ticket (57996) perceived as a show stopper. Problem fixed very quickly after one hour. Much appreciated.
    • CNAF: LHCb_RDST space token issue. Found a bug in the clients used. Corrected immediately.
  • T2 sites issues:
    • UK T2 sites upload problem: seems to be due to a disruptive interference of the firewall.

5th May 2010 (Wednesday)

Experiment activities:

  • No activity apart of user analysis

GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 1

Issues at the sites and services

  • T0 site issues:
    • Yesterday's problem wasn't on lhcbdata but on default pool that makes much more sense.
    • During one hour time windows yesterday (from 13:30 to 14:30 UTC) all reconstruction jobs at all sites were failing to access a table on the condition DB.Open a GGUS ticket against DB people at CERN and it results that the problem wasn't service related but a user altered a table accidentally and then (the wrong) information correctly propagated down to T1s affecting jobs. Eva provided all details to chase this up LHCb side.
  • T1 site issues:
    • NL-T1: dramatic improvement since yesterday transferring the backlog of RAW data to PIC. This confirms that the variety of problems reported by Ron have been properly addressed. getPlotImg.png
    • CNAF: LHCb_RDST space token issue. The SLS sensor indeed got stuck trying to parse the output of the metadata query over there that contains a SRM error message. (GGUS: 57948).
       Space Reservation with token=499eb057-0000-1000-8366-b54ffea05e44
               StatusCode=SRM_INTERNAL_ERROR explanation=stage_diskpoolsquery
      returned error

  • T2 sites issues:
    • USC-LCG2: failing voms SAM tests

4th May 2010 (Tuesday)

Experiment activities:

  • No beam no data taken since yesterday morning. Reconstruction of 450 GeV almost finished. Put in place a workaround to prevent the watchdog to kill jobs in the phase of uploading the large amount of small (streamed) reconstructed data that was causing in the last days a big fraction of jobs killed (CPU/wall clock time below the threshold).

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • 72 files definitely lost in one of the disk server of the lhcbdata pool. List of files given to lhcb and a post-mortem already cooked up. Our data manager excludes however these data can belong to users or were effectively written in the lhcbdata pool because of the protections put in place there since before the 22nd of March. It should thenbe closely understood how these data were perceived to sit in the lhcbdata pool and written the 22nd of March as reported. This couldn't be the case by all means.
  • T1 sites issues:
    • NL-T1: problem with SE (GGUS: 57812) seems to have been understood to a variety of reasons (overload SRM, network, hardware and other)
    • CNAF: LHCb_RDST space token issue. The SLS sensor indeed got stuck trying to parse the output of the metadata query over there that contains a SRM error message.Sent mail to local contact person.
       Space Reservation with token=499eb057-0000-1000-8366-b54ffea05e44
               StatusCode=SRM_INTERNAL_ERROR explanation=stage_diskpoolsquery
      returned error

  • T2 sites issues:
    • none

3rd May 2010 (Monday)

Experiment activities:

  • Started the 450GeV data reconstruction this morning (FULL & EXPRESS/ magup & magdown). This evening back to 3.5TeV. Due to an DIRAC agent to be restarted, a huge backlog of data transfers from NL-T1 to pic has to be drained. About 10K files to be copied to pic.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0 site issues:
    • LFC:RO still overloaded by some users using old version of the framework (w/o CORAL patch) and/or not using the workaround put in place in DIRAC.
  • T1 site issues:
    • NL-T1: still actual the problem with the SE affecting also reconstruction jobs attempting to upload output (GGUS: 57812).
    • IN2p3: going to upgrade their MSS to put tape protection and only allowing lhcb production role to stage files in. Access of data staged on disk will occur through dcap protocol w/o any authorization.
  • T2 sites issues:
    • UK T2 continuing the investigation on data upload issue spawned time ago by LHCb.
    • UKI-LT2-IC-HEP huge amount of jobs failing there
  -- RobertoSantinel - 29-Jan-2010

Revision 12010-01-29 - unknown

Line: 1 to 1
Added:
>
>

May 2010 Reports

To the main

-- RobertoSantinel - 29-Jan-2010

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback