April 2010 Reports

To the main

30th April 2010 (Friday)

Experiment activities:
  • 2 MC Simulation running smoothly (3K jobs) and 2 Data Reconstruction activities (MagUp and MagDown). Discovered yesterday a problem affecting many jobs (50% failures rate) internal in LHCb. Upload issues at SARA and (temporarily in the afternoon yesterday) at CNAF

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 5
Issues at the sites and services

  • T0 site issues:
    • High memory consumption from one of the Streams queue monitor processes observed on lhcb downstream capture. Does not affect replication. Problem being investigated by Oracle.
    • Lhcb downstream capture for conditions was stuck since midday yesterday, DB people looking at the problem. Any news?

  • T1 site issues:
    • CNAF: a temporary glitch on StoRM yesterday afternoon affecting one hour SAM tests jobs and few reconstruction jobs to upload data output. Now it is OK since yesterday 16:00 UTC.
    • NL-T1: still actual the problem with the SE affecting also reconstruction jobs attempting to upload output (GGUS: 57812)
  • T2 sites issues:
    • UK T2 continuing the investigation on data upload issue spawned time ago by LHCb.
    • Shared area problems at 3 different sites: INFN-NAPOLI, UKI-LT2-UCL-CENTRAL ESA-ESRIN

29th April 2010 (Thursday)

Experiment activities:
  • Reprocessing activity launched today with analysis activity at low level. MC production (10M events) launched.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0
Issues at the sites and services

  • T0 site issues:
    • SAM CE tests failing for LISTMATCH failure against all endpoints
  • T1 site issues:
    • RAL : downtime at RISK for intervention on WN for installing 32 bit CASTOR clients
    • NL-T1; Storage at SARA failed all SAM tests since yesterday night and also users reported a lot of problem up/downloading data out of there. It looks like SRM was switched off. (GGUS: 57812)

28th April 2010(Wednesday)

Experiment activities:
  • Reconstruction ongoing with analysis activity at low level. New LHCbDirac release put in production

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0 site issues:
    • none
  • T1 site issues:
    • pic: downtime finished (still some stalled jobs to be checked)
    • NIKHEF: CREAM CE problem has been fixed. Back in production.
    • GridKA: need to restart PNFS on dCache. (on going)

27th April 2010(Tuesday)

Experiment activities:
  • Reconstruction ongoing with analysis activity at low level. Commissioning new workflow Brunel-Davinci that will replace the current one Brunel-Davinci-Brunel-Davinci by removing (now) redundant passages and reducing by a factor 2 the CPU requirements.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0 site issues:
    • none
  • T1 site issues:
    • IN2p3: AFS outage
    • pic: downtime
    • NIKHEF: turned off the CREAM CE. Reason (quoting Jeff) : it does not have our TMPDIR patch, so all the LHCb jobs you submitted via it, were running in $HOME which is NFS mounted. This effectively killed our entire site. I had to kill all those jobs to get things working again. The CE will be restarted whenever this problem is addressed. The lcg-CEs are not affected.
    • GridKA: need to restart PNFS on dCache. Agreed with contact person to drain currently running jobs and then restart it.

26th April 2010(Monday)

Experiment activities:
  • During the week end the increased luminosity brought up 1200 RAW files that increased a bit the load on CASTORLHCB (we took 3 times more data than integrated before). The increased production activities at the T0 range from data from the PIT, to data reconstruction (75% running at CERN) and data export to T1s. In the last week a large fraction (80%) of jobs were stalling at all sites (CERN mainly) because running out of Max CPU time in the remote queues. Asked ONLINE people to decrease from 2 to 1GB the RAW file size which is the current one. The problem is in the workflow that runs twice the same steps due to the low retention factor with the current L0 trigger.
  • WIth the next release of GAUDI there will be in place a way to intercept signals from the batch system to end gracefully the jobs when a SIGTERM signal is trapped (before the fatal SIGKILL is sent by the LRMS). On going a discussion with IN2p3 whose BQS sends in advance a SIGUSR instead of SIGTERM as all other batch systems do.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 3
  • T2: 3
Issues at the sites and services
  • T0 site issues:
    • Intervention on SRM to upgrade to 2.9 today 11-11:30. Preliminary tests on the PPS instance confirmed that the migration to this new version is not a problem.
    • Some slowness reported during the weekend by a user. Offline discussions seem to point the problem with the user itself that was running hundreds parallel rfio requests on the same restricted set of data in the RAW pool. The rest of users were not affected.
  • T1 sites issues:
    • CNAF: CREAM endpoint failing all pilots
    • SARA: CE still banned in the production mask: to all jobs stalling due to the latency in contacting CERN ConditionDB. The problem will be fixed once the new GAUDI release with the new Persistency patch working will be in place in few days time.
    • IN2p3: SIGUSR1 signal instead of SIGTERM signal.
    • RAL: failing all jobs for a recent reconstruction production. Under investigation.
    • GRIDKA: M-DST space is completely full. GGUS ticket to be opened (57665)
  • T2 sites issues
    • UNINA-EGEE and GRISU-UNINA: shared area issue
    • EFDA-JET: no space left.

23rd April 2010(Friday)

Experiment activities:
  • Data reconstruction, stripping and user analysis continuing at low level. No new data taken for couple of days due to ongoing LHC tests.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • No site issues to report.

22nd April 2010(Thursday)

Experiment activities:
  • Data reconstruction, stripping and user analysis continuing at low level. No new data taken for couple of days due to ongoing LHC tests.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • No site issues to report.

21st April 2010(Wednesday)

Experiment activities:
  • Data reconstruction, stripping and user analysis continuing.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T1 sites issue:
    • SARA: currently in downtime and banned for LHCb. Problem with application configuration. Waiting on new release of software.
  • T2 sites issue:
    • SAM jobs stalled at INFN-TRIESTE

20th April 2010(Tuesday)

Experiment activities:
  • Data reconstruction, stripping and user analysis continuing.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1
Issues at the sites and services
  • T1 sites issue:
    • SARA: currently in downtime and banned for LHCb. GOC-DB announcement arrived hours after downtime had commenced.
  • T2 sites issue:
    • SAM jobs stalled at INFN-TRIESTE

15th April 2010(Wednesday)

Experiment activities:
  • Data reconstruction and user analysis continuing.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0 sites issues:
    • Data access problem with CASTOR (57243) seems to have been resolved although there is concern that the service can become overloaded.
  • T1 sites issue:
    • NIKHEF/SARA: dcap file access working with no problems.

14th April 2010(Wednesday)

Experiment activities:
  • Data reconstruction and user analysis continuing.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0 sites issues:
    • Data access problem with CASTOR (ggus) causing user jobs to stall when requesting to open a TURL.
  • T1 sites issue:
    • NIKHEF/SARA: Still banned as dcap door not yet available. We will test as soon as it comes online.
    • GridKa is working again after upgrade to their NFS shared area.
    • CNAF: unbanned as new configuration (with right endpoints for CondDB) has been deployed.
    • RAL: unbanned new configuration (with right endpoints for CondDB) has been deployed.

9th April 2010(Friday)

Experiment activities:
  • No data to be processed but user analysis activities continuing.
  • On Monday 2009 real data reprocessing will be launched and few MC production correlated.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0 sites issues:
  • T1 sites issue:
    • NIKHEF/SARA: gsidcap file access issue (GGUS ticket: 56909). Reproducible: seems a clash deep in the libraries used by gsidcap and condition DB libraries.
    • GRIDKA: shared area issue (GGUS ticket: 57030). Problem still present...
    • CNAF: banned until the new configuration (with right endpoints for CondDB) is deployed.
    • RAL: banned until the new configuration (with right endpoints for CondDB) is deployed.

8th April 2010(Thursday)

Experiment activities:
  • No data to be processed but user analysis activities continuing.
  • Workaround to the currently substandard Persistency interface has been put in place by LHCb/Dirac. Appears to be working.
  • Many problems at T1s being investigated.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0 sites issues:
  • T1 sites issue:
    • NIKHEF/SARA: banned for the gsidcap file access issue (GGUS ticket: 56909). Under investigation.
    • GRIDKA: banned for shared area issue (GGUS ticket: 57030). SAM jobs still failing.
    • CNAF: banned due to user jobs failling after rename of COnditionDB instance the 30th of March. 42% failure rate over past two days, under investigation with local contact.
    • IN2P3: LHCb VO box had to be restarted.
    • RAL ConditionDB unreachable
  • T2 sites issue:
    • none

7th April 2010(Wednesday)

Experiment activities:
  • Reconstruction (in total 140 jobs to reconstruct all data collected so far) and many users running their own analysis on this data. It should be worth to highlight that it is now close to a week that two important T1's are out of the production mask in LHCb : NIKHEF/SARA and GRIDKA.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0 sites issues:
    • CERN: LFC read-only is unreachable timing out all requests and causing many jobs failing.This is again due to the sub-optimal Persistency LFC interface. A patch is about to come pending some test form LHCb. In the mean time the stop of LHC will allow to use frozen ConditionDB that will be deployed at the sites as local SQLiteDB and users will be able to analyze the close to 11 millions event recorded so far in the coming days.
  • T1 sites issue:
    • PIC: the lhcbweb.pic.es portal has been accidentally stopped. Mistake promptly recovered.
    • still true pb at NIKHEF: banned for the file access issue (GGUS ticket: 56909). Under investigation.
    • still true pb at GRIDKA: banned for shared area issue (GGUS ticket: 57030). A first sight, it seems a degradation of performances due to the concurrent heavy activity that ATLAS software is putting on the shared area.
  • T2 sites issue:
    • none

6th April 2010(Tuesday)

Experiment activities:
  • Data reconstruction at T1's proceeding smoothly apart few errors form ONLINE (sending BEAM1 data type flag instead of COLLISION10) and few jobs stalled. Reduced the size of input raw data file to fit the available queue lengths. 8.7 million events written at the moment. 10% in process is actually physics.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 2
  • T2: 2
Issues at the sites and services
  • T0 sites issues:
    • CERN: LFC read-only is unreachable timing out all requests and causing many jobs failing. (GGUS 57053)
  • T1 sites issue:
    • CNAF: all jobs for one bunch of reconstruction have been declared stalled consuming 0 CPU Time. Suspicious: a shared area issue but it is not clear at the first glance.
    • RAL: Network problem affecting data upload and new jobs to be picked up by RAL: CIC
    • still true pb at NIKHEF: banned for the file access issue (GGUS ticket: 56909)
    • still true pb at GRIDKA: banned for shared area issue preventing systematically all jobs to setup the environment (GGUS ticket: 57030)
  • T2 sites issue:
    • UKI-SCOTGRID-ECDF all pilot aborting there
    • UKI-SCOTGRID-DURHAM all pilots aborting there

5th April 2010(Monday)

Experiment activities:
  • Data reconstruction at T1's.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 1
Issues at the sites and services
  • T0 sites issues:
    • none
  • T1 sites issue:
    • CNAF: all jobs for one bunch of reconstruction have been declared stalled consuming 0 CPU Time. Suspicious: a shared area issue but it is not clear at the first glance.
    • RAL: Network problem affecting data upload and new jobs to be picked up by RAL. NO GOCDB DOWNTIME only an announcement on the CIC for the 4th of April announcing the problem was fixed. The day after once again the problem.
    • still true pb at NIKHEF: banned for the file access issue (GGUS ticket: 56909)
    • still true pb at GRIDKA: banned for shared area issue preventing systematically all jobs to setup the environment (GGUS ticket: 57030)
  • T2 sites issue:
    • BG04-ACAD: jobs aborting.

4th April 2010(Sunday)

Experiment activities:
  • Reconstruction of data received from ONLINE all weekend long. One unprocessed file out of 14 for one production. A fraction (prod. 6220) of the reconstruction jobs (originally at CERN) have been resubmitted several times because stalling (exceeding CPU time).

GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 3
Issues at the sites and services
  • T0 sites issues:
    • none
  • T1 sites issue:
    • NIKHEF: banned for the file access issue (GGUS ticket: 56909)
    • GRIDKA: banned for shared area issue preventing systematically all jobs to setup the environment (GGUS ticket: 57030)
  • T2 sites issue:
    • INFN-CATANIA: queue publication issue
    • INFN-MILANO-ATLASC: shared area issue
    • UKI-SOUTHGRID-OX-HEP: too many pilot aborting

1st April 2010 (Thursday)

Experiment activities:

  • "We are very proud to announce that we collected 1.3 million collision events last night in ‘nominal’ conditions, i.e. Velo fully closed and all detectors IN. The runs 69353,54,55 have been sent out for production and should become available for analysis later today"(O.Callot)
  • Reconstruction jobs will be running this afternoon and over the weekend.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 3
Issues at the sites and services
  • T0 sites issues:
    • For the last hour (14:30 CET) the LHCb Castor stager has been blocked by a inadvertent DOS attack. Any jobs that have attempted to access files in the last hour may have seen problems.
  • T1 sites issues:
    • CNAF: many users complain about the ConditionDB unreachable because the connection string has changed. This causes their jobs to time out. Fix has to be applied LHCbApp side.
    • NL-T1: Ongoing data access issues with gsidcap for user analysis jobs.
  • T2 sites issues:
    • GRISU-UNINA: shared area issue
    • UNINA-EGEE: shared area issue
    • INFN-NAPOLI: shared area issue

-- RobertoSantinel - 29-Jan-2010

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2010-06-11 - PeterJones
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback