Difference: ProductionOperationsWLCGAug10Reports (1 vs. 2)

Revision 22010-09-02 - unknown

Line: 1 to 1
 

August 2010 Reports

To the main
Added:
>
>

31th August 2010 (Tuesday)

Experiment activities:

  • Reconstruction, Monte-Carlo jobs and high users activity.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN:
      • none
  • T1 site issues:
    • RAL: Due to limited number of connection to disk server most merging jobs failed.

30th August 2010 (Monday)

Experiment activities:

  • NOTE: Reconstruction, Monte-Carlo jobs and high users activity.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 4
  • T2: 1

Issues at the sites and services

  • T0 site issues:
    • CERN:
      • none
  • T1 site issues:
    • RAL: Disk server was down, solved after ~12 hours (GGUS:61625 from RAL)
    • CNAF: All jobs aborted at two CEs GGUS:61633, removed from production
    • GRIDKA: All jobs aborted at two CREAM CEs GGUS:61636, removed from production
    • IN2P3: SRM was down GGUS:61634, solved 12 hours later with "The SRM was restarted"
    • CNAF: Problem with the 3D replication GGUS:61646

27th August 2010 (Friday)

Experiment activities:

  • NOTE: Very intense activities in WLCG these days with backlogs of data to recover and with reprocessing+merging of old data to be delivered soon to our users. At this entry of the LHCb elog more explanation. The observed poor performances of diskservers/SRMs at T1's are due to the exceptional amount of merging jobs and relative merged data transfer (1GB/s sustained) across T1s.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 4
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN:
      • none
  • T1 site issues:
    • RAL: reduced regime of their disk server is affecting users with their request to access jobs queued waiting for a slot.
    • CNAF: The FQAN /lhcb/Role=user is not supported in their CREAMCEs (being the ultimate cause of the problems observed this week) GGUS:61590
    • CNAF: Transfers to CNAF are OK now (GGUS:61571)
    • SARA: huge backlog of jobs to run. There are 3 jobs reported by their bdii to be waiting for lhcb that kills us preventing to submit pilot to SARA (due to the rank). GGUS:61588. fixed
    • SARA: Yesterday SRM overloaded affecting transfers to SARA (GGUS:61586) then SRM dead (GGUS:61598). Now recovered.
    • SARA: still problem (initialization phase plus conditionDB access) preventing jobs to run smoothly. (GGUS:61609)
    • IN2p3: huge amount of pilot jobs aborting against CREAMCE. Service seems to be down (GGUS:61605)

26th August 2010 (Thursday)

Experiment activities:

  • Due to the exceptional amount of data to be processed/reprocessed and merged for users and backlogs forming at the pit (due to limited to 1Gb/s connection to CASTOR and high level trigger cuts too relaxed) LHCb is putting heavy load on many services (central DIRAC, grid T0 and T1 services).
  • (rather for the T1 coordination meeting). LHCb wonder about a clear procedure in case a disk server becomes unavailable for an (expected) long period (> half day). In case there is nothing already formally defined, a good procedure could be:
    • Disk server crashes.
    • Estimation of the severity outage.
    • The disk server will be unavailable for a long while (ex. a week, back to the vendor)
    • Try to recover data from tape (if possible) or from other diskservers (in case there are available elsewhere) and copy them to another disk server of the same space token.
    • Announce to the affected community (through agreed channels, GGUS to lhcb support for example is the way lhcb would like to receive these announcements) the incident and transmit with this announcement the list of unrecoverable files.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN:
      • Requested another space token (LHCb_DST, T0D1, no tape) to handle small un-merged files (that will be scrapped anyway after the merging). This is to avoid to migrate to tape smallish files as it happens now with M-DST (T1D1) serving these files. It is not necessary it is open to outside but that it had a good throughput to WNs.
      • A patched version of the xroot redirector has been deployed on castorlhcb (that should fix the pb reported last week). Re-enabled the xroot:// protocol on lhcbmdst service class.(GGUS:61184)
  • T1 site issues:
    • RAL: lhcbuser space token running out of disk space: as an emergency measure another diskserver has been added.
    • RAL: Since yesterday we got two more diskservers out of the game (gdss470 for lhcbMdst and gdss475 for lhcbuser space tokens). These diskservers pair with the gdss468 reported last week that seems to be still not fully recovered (working at reduced capacity).
      • From RAL: Due to the problems caused by high load on LHCb diskservers, we have reduced the number of jobslots on all LHCb diskservers by 25%. The last two diskservers that went down, gdss470(lhcbMdst) and gdss475(lhcbUser), are back in production with their jobslots reduced by 50%. We will monitor these machines and review these limits tomorrow.
    • IN2P3: Shared area: keeping this entry for traceability. Work in progress but problem still there. (GGUS:59880 and GGUS:61045)
    • CNAF: Issue with direct submission against CREAMCE. It is OK now
    • CNAF: FTS transfers failing to CNAF LHCb_MC_M-DST space token GGUS:61571. tokens getting full over there

25th August 2010 (Wednesday)

Experiment activities:

  • Mainly reprocessing/reconstructing real data and huge analysis activity competing with at T1's. Low level MC.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 3
  • T2: 1

Issues at the sites and services

  • T0 site issues:
    • LFC: observed a degradation of performances querying LFC (GGUS:61551)
    • xroot not working also for lhcbmdst space where it had been decided to use this protocol (GGUS:61184). Strong commitment from IT-DSS people to debug it.
  • T1 site issues:
    • RAL: SRM reporting incorrect checksum information (null) GGUS:61532
    • RAL: faulty diskserver discussed last days seem to have still problems with many gridftp timeouts observed.
    • pic: one of the problematic gLite WMS reported yesterday was at pic. It was simply overloaded. GGUS:61517
    • IN2P3: Shared area: keeping this entry for traceability. Work in progress but problem still there. (GGUS:59880 and GGUS:61045)
    • GRIDKA: the shifter evaluated that the number of failures in the last period is not so worrying. So no ticket has been issued.
    • GRIDKA: Some users observed some instability with dCache (GGUS:61544). Prompt reaction from sysadmins that did not observe any particular issue their side.
    • CNAF: Issue with direct submission against CREAMCE reported yesterday seems to be due to a certificate. Under investigation.
    • SARA: Eagerly waiting for SARA to recover and complete the remaining 900 merging jobs that have to be delivered to users by the end of this week. Considering this top priority activity and the situation with the SARA RAC, it has been decided to hack any DB dependency and run these merging jobs at SARA (where data to be merged are)by pointing to another condDB.

24th August 2010 (Tuesday)

Experiment activities:

  • The overall slowness in dispatching jobs reported yesterday was (most likely) due to concurrent factors related to various gLite WMSes at different T1's (now removed from the mask) resulting in a wrong assumption of the real status of the jobs (reported as scheduled while were completed) and then preventing DIRAC to submit furthermore. Analysis of these jobs will be done and hints will be reported to service providers for further analysis.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 4

Issues at the sites and services

  • T0 site issues:
    • All pilot jobs aborting against ce203.cern.ch (GGUS:61423). This was the reason of the observed low submission rate. Problem seems to be due to a known bug in CREAM.
  • T1 site issues:
    • IN2P3: Shared area: problems again preventing to install software (GGUS:59880 and GGUS:61045)
    • GRIDKA: issue with one CREAMCE (high number of pilot jobs aborting in the last week). GGUS ticket not issued yet.
    • CNAF: Issue with direct submission against CREAMCE reported yesterday seems to be due to a certificate. Under investigation.
    • CNAF: Due to a wrong way of switching off VMs, many xfers from other T1's were failing.
    • SARA:Could we be sure to find out about the status of the DBs at SARA? We have now been waiting for 5 days for ANY news and we are desperate to conclude our reprocessing for which we have 900 jobs at SARA.

23th August 2010 (Monday)

Experiment activities:

  • Reprocessing, MonteCarlo and analysis jobs.
  • There is the need to adjust the trigger rate to cope with the very high throughput from ONLINE. If LHC continues to provide such a good fill and such high luminosity we might run in trouble with backlogs forming. While the 60MB/s rate per computing model is not so rigid at most 100MB/s is acceptable. Internal discussion to cut a factor 2 (w/o affecting physics data)
  • Observed a general reduction of the number of running jobs at T1's. Last week 8K running concurrently now less than 3K. At CERN (absorbing the load from SARA) the worse situation: 3K waiting jobs Vs 500 effectively running. We do not see anything particularly wrong with the pilot submission itself that might slow down/affect the # of runing jobs (apart a slightly higher failure rate due to recent problems at Lyon). Is there any fair share hit?

GGUS (or RT) tickets:

  • T0: 2
  • T1: 1
  • T2: 1

Issues at the sites and services

  • T0 site issues:
    • Discovered several RAW files that can't be staged last Saturday (GGUS:61346). It looks like it was a glitch with some disk server.
    • xrootd problem (16/08) GGUS:61184 Now using castor:// protocols on all space tokens apart on mdst (where we keep using xroot://); the issue seems to have gone (because the service is less massively used) but still open the point why we met these instabilities with the xroot manager on castorlhcb.
  • T1 site issues:
    • RAL: received notification that one of the diskservers on lhcbMdst is unavailable (gdss468). The ~20K files there have been already marked as problematic and won't be used until the disk server recovers.
    • IN2P3: All jobs to their CREAMCE were aborting. Service started but not clear the reason. (GGUS:61358)
    • CNAF: authorization issue with one of the CREAMCEs. Local contact risen the problem with service managers.
    • SARA Oracle DB. NIKHEF and SARA banned. Reconstruction jobs diverted to CERN (which starts to be really overloaded too). Transfers from other T1's also not going (due to FTS unavailable). Only RAW data from CERN are coming, but cannot be processed due to ConditionDB unavailable too. Any news?

20th August 2010 (Friday)

Experiment activities:

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 3

Issues at the sites and services

  • T0 site issues:
  • T1 site issues:
    • IN2P3 SRM out last night (solved)
    • SARA Oracle DB. NIKHEF and SARA banned.
  • T2 site issues:

19th August 2010 (Thursday)

Experiment activities:

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
  • T1 site issues:
  • T2 site issues:

18th August 2010 (Wednesday)

Experiment activities:

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1

Issues at the sites and services

LFC problem:

2010-08-17_Transfer_Errors.png

2010-08-17_Transfer_Spike.png

2010-08-17_Transfer_Succeed.png

2010-08-17_Transfer_Succeed_1Week_SARA_CNAF.png

Ricardo Graciani: From this plots we can see how both CNAF and SARA where "released" at 11:45 and started to transfer after sone time not being able to do it. Since CNAF LFC was not Active (and we have agreed that LFC was the caused of the jobs getting stuck), it is likely that the reason of the problem was LFC at NIKHEF. Suddenly this problem was cured at 11:45 UTC (yesterday) and stuck jobs got released At CNAF there was a huge number of affected jobs

The last plot (1 week view for SARA and CNAF) shows that the problem with NIKHEF LFC started morning of the 16th and was solved one day later. Some jobs at CNAF managed to execute since they are able to use other instances.

Actually the candidate LFC to have caused the problem is: lfc-lhcb.grid.sara.nl

  • T0 site issues:
  • T1 site issues:
  • T2 site issues:

17th August 2010 (Tuesday)

Experiment activities:

  • No data last day. Reprocessing (5K) and analysis (5K) jobs.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 1

Issues at the sites and services

  • T0 site issues:
  • T1 site issues:
  • T2 site issues:

16th August 2010 (Monday)

Experiment activities:

  • A lot of data. Reconstruction, merging.

GGUS (or RT) tickets:

  • T0: 3
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • First pass (EXPRESS) RAW file unavailable GGUS 61157 (solved)
    • xrootd problems GGUS 61184
    • Jobs failed at the same worker nodes GGUS 61176
  • T1 site issues:
    • IN2P3 LHCb can not use SRM GGUS 61156 (solved)
    • IN2P3 RAW files unavailable GGUS 61195
    • CNAF many jobs failed at the same time
  • T2 site issues:

13th August 2010 (Friday)

Experiment activities:

  • No data. Reconstruction, merging and MC.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1

Issues at the sites and services

  • T0 site issues:
  • T1 site issues:
    • IN2P3 SharedArea problem GGUS 61045. SAM test failed, user jobs failed , software installation problem.
    • NIKHEF(SARA) Timeouts while getting TURLs from the SE at NIKHEF GGUS 60603 (Updated)
  • T2 site issues:

12th August 2010 (Thursday)

Experiment activities:

  • No data. Reconstruction, merging and MC.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
  • T1 site issues:
    • IN2P3 SharedArea problem GGUS 61045. SAM test failed, user jobs failed , software installation problem.
  • T2 site issues:

11th August 2010 (Wednesday)

Experiment activities:

  • Still no data after power cut. Reconstruction and merging. No MC.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 4
  • T2: 0

Issues at the sites and services

  • T0 site issues:
  • T1 site issues:
    • IN2P3 SRM problem GGUS 61023, solved by restarting service
    • IN2P3 SharedArea problem GGUS 61045. SAM test failed, user jobs failed , software installation problem.
    • CNAF Reconstruction jobs killed by batch system GGUS 61048. Under investigation.
    • RAL Most pilots aborted GGUS 61052.
  • T2 site issues:

10th August 2010 (Tuesday)

Experiment activities:

  • No data today due to power cut in pit. Reconstruction and merging. No MC.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
  • T1 site issues:
  • T2 site issues:

9th August 2010 (Monday)

Experiment activities:

  • Data during weekend and today. Reconstruction and merging. No MC.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 1

Issues at the sites and services

  • T0 site issues:
  • T1 site issues:
    • GRIDKA: Users reporting times out accessing data. Looks like a load issue with dcache pools GGUS:60887. No problems today.
    • RAL: after recovered the faulty disk server at LHCb_USER space token users reported problems with files access during weekend. No problem today.
  • T2 site issues:
    • nothing

6th August 2010 (Friday)

Experiment activities:

  • Reconstruction and merging. No MC. Due to the intervention in the accounting system at pic (migrating to CERN) the SSB dashboard is empty.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • Many files reported to be unavailable. GGUS:60886. Disk server problem solved. Files finally migrated to tape
  • T1 site issues:
    • CNAF: issue yesterday with streams replication (both LFC and condDB) timing out to CNAF. Problem with one of the two nodes serving the CNAF DB unreachable. Restored replications later in the evening.
    • GRIDKA: Users reporting times out accessing data. Looks like a load issue with dcache pools GGUS:60887. This seems to affect also reconstruction activity on the site; any news?
    • RAL: recovered the faulty disk server, LHCb_USER space token at RAL has been re-enabled in DIRAC
  • T2 site issues:
    • nothing

5th August 2010 (Thursday)

Experiment activities:

  • Reconstruction reprocessing of the huge amount (100 inverse nbarn1/3 of ICHEP statistics in one night), large fraction of jobs failing because of internal application issues.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • Many files reported to be unavailable. GGUS:60886
  • T1 site issues:
    • CNAF: problems with both SAM tests and real data transfer to Storm. Problem fixed (certificates in a wrong location). GGUS:60875
    • GRIDKA: Users reporting times out accessing data. Looks like a load issue with dcache pools GGUS:60887
    • RAL: We received the list of files in the faulty disk server of the LHCb_USER space token (GGUS:60847). 22K user files we would like to know when the admins think the server will be back.
    • pic: accounting migration to CERN.
    • IN2p3: SAM test reporting again failures and/or degradation of the shared area (GGUS:59880)
  • T2 site issues:
    • nothing

4th August 2010 (Wednesday)

Experiment activities:

  • Reconstruction/reprocessing is proceeding slowly due to an high failure rate due to the application (high memory consumption, anomalous CPU time required). New production will be launched today for merging output and new reconstruction campaign will be run as soon as a fix will be provided. Many user jobs. No MC production running.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0

Issues at the sites and services

3rd August 2010 (Tuesday)

Experiment activities:

  • Reconstruction of large amount of data received. Many user jobs. Some MC production going on.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • nothing
  • T1 site issues:
    • GRIDKA: Users reporting problems accessing some data. Not necessarily correlated but SAM test against GridKA are failing. GGUS:60821
    • Managed to run software installation jobs a IN2P3 (SharedArea)
  • T2 site issues:
    • nothing

2nd August 2010 (Monday)

Experiment activities:

  • Received a lot of data during weekend. Reconstruction, MonteCarlo and user analysis jobs.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 3

Issues at the sites and services

  • T0 site issues:
    • nothing
  • T1 site issues:
    • still SARA problems with NIKHEF_M-DST
    • software installation problem at IN2P3 (SharedArea)
  • T2 site issues:
    • nothing
  -- RobertoSantinel - 29-Jan-2010

Revision 12010-01-29 - unknown

Line: 1 to 1
Added:
>
>

August 2010 Reports

To the main

-- RobertoSantinel - 29-Jan-2010

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback