September 2010 Reports

To the main

30th September 2010 (Thursday)

Experiment activities: . Analysis, no particular issues

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • RAL : FTS transfers out of RAL are failing. checksum errors (checksum disabled for the time being)

29th September 2010 (Wednesday)

Experiment activities: . Analysis, no particular issues

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • RAL ; downtime for CASTOR upgrade finished
    • RAL : FTS transfers out of RAL are failing. (GGUS:62579)

28th September 2010 (Tuesday)

Experiment activities: . Analysis, no particular issues

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • RAL ; downtime for CASTOR upgrade.
    • PIC ; PIC FTS credentials problem (GGUS:62490)

27th September 2010 (Monday)

Experiment activities: . Analysis, no particular issues

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • RAL ; downtime for CASTOR upgrade.
    • SARA : LHCBUser space full (GGUS : 62447)

24th September 2010 (Friday)

Experiment activities: The Reco06-Stripping10 reconstruction production for the FULL stream in Magnet Up and the associate merging productions. Analysis, no particular issues

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • CNAF: issue with CREAMCE (GGUS:62355). What is the status?
    • IN2p3: Huge backlog of merging jobs has to run at IN2p3. The quota issue has been addressed but other issues with shared area are there (GGUS:62379)
    • RAL: Long tail of merging jobs to run at RAL. Now draining their queues.
    • PIC: Issue with CREAMCE not reported the 22nd. Now fixed (GGUS:62357)

23rd September 2010 (Thursday)

Experiment activities: The new Reco06-Stripping10 reconstruction production for the FULL stream in Magnet Up and the associate merging productions have started.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • IN2p3: back from the downtime re-enabled in the production mask. Problem with sw area to install software (quota issue) GGUS:62379

22nd September 2010 (Wednesday)

Experiment activities: Awaiting for real data and commissioning the workflow for the new stripping.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
  • T1 site issues:
    • RAL: The failures reported in the last days seem to be due to merging jobs / workflow. Other job types (including user jobs) run fine when the merging workflow spikes do not hang up the system. Requested to throttle DIRAC side the amount of merging jobs at RAL to protect the whole site. Let's wait for CASTOR 2.1.9 and the optimization coming with internal gridftp and xrootd. RAL will start draining their LHCB queues on Friday so the share of the site has been set to 0.
    • CNAF: since last weekend the activity from CNAF,NIKHEF and RAL has saturated the sessions limit on CNAF LHCB CONDDB. DBA at CNAF have modified the DB parameters and scheduled a quick restart of the services (mandatory) to have the new configuration online. The intervention took place on the LCG 3D system, at 3pm CEST yesterday

21st September 2010 (Tuesday)

Experiment activities: Awaiting for real data and commissioning the workflow for the new stripping. Launched several MC productions. User analysis

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • none.
  • T1 site issues:
    • RAL: A lot of merging jobs are stalling trying to access data stored in few disk servers that can't cope with the load. GGUS:62242. Still an issue not running jobs as smoothly as we are use. Contact person should look at.

20th September 2010 (Monday)

Experiment activities: Remaining few merging productions + user analysis. Recovered some space at CERN. LHCb is very concerned on the status of their T1's as far as concerns the kernel patch to be applied, considering the data coming soon.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
  • T1 site issues:
    • RAL: A lot of merging jobs are stalling trying to access data stored in few disk servers that can't cope with the load. GGUS:62242 to request to throttle the number of concurrent jobs on RAL batch system.

17th September 2010 (Friday)

Experiment activities: User analysis mainly. Recovering space cleaning up unnecessary reprocessed data. PPG is currently defining the plan w.r.t. trigger; therefore even if LHCb take data, they should first commission the new workflow with stripping10 (10th reprocessing). Towards a global shutdown of the Grid?

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 none
  • T1 site issues:
    • RAL: half of the jobs are stalling. Shifter will look at the problem
    • SARA: 92% of the (production) jobs aborted resolving tURL. Also SAM reflected the issue with the remote SRM discovered to be in downtime.
    • PIC: 20% of failure rate due to a variety of problem related with the storage. Shifter will look at the problem.

16th September 2010 (Thursday)

Experiment activities: User analysis and few reprocessing running to the end. Smooth operations. Fixed the SRM critical tests that was failing at pic, CERN and CNAF. A first check on the availability of the space is performed before running (and then assessing) the endpoint.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues: Requested CASTOR to move 40 TB from the newly created space token lhcbdst to the space token lhcbmdst that suffered lack of space in the last weeks. Despite some cleaning has been performed the lhcbmdst is still not fully pledged (309TB installed, 350TB required for 2010) while the lhcbdst is far too much (87TB).
  • T1 site issues:
    • NIKHEF: Many pilot jobs aborting with exit status=999. This happens only against their CREAMCEs (GGUS:62132)

15th September 2010 (Wednesday)

Experiment activities: User analysis. Fighting with shortage of disk space deciding which reprocessing has to be clean.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues: Issue with LFC replication (GGUS:62078). It has been forgotten that for the LFC replication there is special 'real time downstream' optimization turned on in order to minimize replication latency through downstream database. It looks that during recovering of SARA it has disabled itself as there were more capture process running on the same database.
  • T1 site issues:
    • NIKHEF: Many pilot jobs aborting with exit status=999. This happens only against their CREAMCEs (GGUS:62132)

14th September 2010 (Tuesday)

Experiment activities: User analysis and few MC producitons

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues: The LFC replication tests are failing consistently at all sites. The 2+2 minutes timeout too short. Verified that after 26 minutes the information was not yet propagated down to the T1 (GGUS:62078)
  • T1 site issues:
    • IN2P3 : SRM SAM tests were failing yesterday and it has been fixed : The CRL are in a shared area on AFS here. We had an overload on our AFS cell yesterday at the timestamps you gave.
    • GridKA: removed hanging dcap movers, solved the problem. (GGUS:61999)
    • pic: The network's server is not as stable as it should.Looking at the problem Work in progress (GGUS:62019)
    • NIKHEF: CREAMCE issue, pilot jobs aborting there. (GGUS:62001) . Working in close touch with developers.
    • CNAF : All jobs aborted at ce07-lcg.cr.cnaf.infn.it (GGUS:62029). Restarted tomcat.

13th September 2010 (Monday)

Experiment activities:

  • Suffering a lot of problems due to various space tokens getting full at CERN and at different T1. On Sunday some space has been recovered by cleaning up un-merged DST files that were input of merging production now finished. Recovered 50TB (12TB on lhcbmdst) at CERN

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 4
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • Due to our space-tokens lhcbuser and lhcmdst full the operator piquet has been called twice this weekend and in turn notified the collaboration to clean up some space. Our data manager pmanaged to free some space to make at least usable these spaces and avoid internal alarms.
    • Streams replication: some blocking issues (streams processes were blocking each other) on Sunday night and LFC replication was progressing really slow. Fixed immediately.
  • T1 site issues:
    • CNAF: ConditionDB at CNAF times up connections form other sites WNs (and jobs got stalled). GGUS:61989. fixed promptly by CNAF people opening to the general network the listener.
    • CNAF : All jobs aborted at ce07-lcg.cr.cnaf.infn.it (GGUS:62029)
    • pic: Files reported UNAVAILABLE by local SRM (GGUS:62019)
    • GRIDKA: increased user and production jobs failure rate. Jobs stalling accessing data, it seems to be a problem wit dcap movers. (GGUS:61999)
    • NIKHEF: CREAMCE issue, pilot jobs aborting there. (GGUS:62001)

10th September 2010 (Friday)

Experiment activities:

  • Production jobs and user jobs are failing due to the corruption of our CONDDB database. Problem fixed in LHCb. No huge production activity on going, mainly users.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN: CREAMCE failing all pilots. GGUS:61957
    • We had the lhcbraw class overloaded on Thursday to to a super user running a burst of jobs accessing EXPRESS data. For a couple of hours SLS was reporting the srvice unavailable.
  • T1 site issues:
    • CNAF: ConditionDB at CNAF times up connections form other sites WNs (and jobs got stalled). GGUS:61989
    • IN2P3 : GGUS:61904 transfer to lhcb_dst are failing. Moved disks from another space token to allow jobs to write in.
    • RAL (SARA) Observed failures in the Oracle Streams Apply process for LFC. The problem seems to be due to some entries in the T1 instance at RAL (and SARA as consequence) that do not have the corresponding ones in the central catalog at CERN. Digging into details it looks like some DNs and FQANs have been added manually in the users and groups tables locally on the site to allow UKI NGI to test the LFC (a ticket dealing with this problem is GGUS:60618). It is worth to remind that any update done on the read-only instance of LFC at T1' s becomes an inconsistency in the replication and would compromise the whole replication of information and then has to be avoided. Ask central LHCb operation managers to add it as done for IN2p3.

8th September 2010 (Wednesday)

Experiment activities:

  • Production jobs and user jobs are failing due to the corruption of our CONDDB database.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN:
  • T1 site issues:

RAL : we confirm that we accept the upgrade of CASTOR proposed for the 27th of september.

7th September 2010 (Tuesday)

Experiment activities:

  • Lot sof jobs failing because our Bookkeepping service was overloaded and was not able to serv requests.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 4
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN: Some VOBOXes have been blocked by SAM people without any warning nor discussion with LHCb.
  • T1 site issues:

6th September 2010 (Monday)

Experiment activities:

  • Finishing reprocessing and merging of data at RAL and SARA.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 3
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN: none
  • T1 site issues:
    • RAL: another disk server (gdss379, lhcbuser space token) crashed (GGUS:61825). No clear the reason, back again in production. requested to decrease the number of job slots having again a very poor success rate 25% (see picture) (GGUS:61798).
    • GridKA: a lots of user jobs failing over the last week as the watch dog identifies them as stalled. Looking at the logs of these jobs it looks like these jobs are stalling while reading the input data via dcap servers. A ticket has been risen (GGUS:61841)
    • RAL: GGUS:61846 pilot aborting against one CE
    • pic: shared area overloaded.

3rd September 2010 (Friday)

Experiment activities:

  • Merging and data reprocessing + user analysis. ~40K jobs run (only) at T1's over the last 24 hours..

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 4
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN:
      • LHCb_DST space token (pure disk for un-merged files) has been delivered. Still pool ACLs have to be sorted out.
  • T1 site issues:
    • SARA: Oracle server hosting the ConditionDB down. GGUS:61799
    • SARA: all production jobs since few days are failing to setup of the application environment (timeout) GGUS:61795. Suspect on the local conditionDB.
    • RAL: Oracle server hosting the ConditionDB down. GGUS:61800
    • RAL: very high failure rate following their LRMS full capacity has been restored. We are killing their disk servers with plenty of transfers failures, jobs failing to access data and all activities affected. Open ticket GGUS:61798 to track this down and to ask site to throttling the number of job slots to see if it has some positive effect. The plot shows the production jobs at RAL in the last 24 hours with failures dominated by input data resolution24hoursatRAL.png

2nd September 2010 (Thursday)

Experiment activities:

  • Merging and data reprocessing. 8K jobs concurrently at T1's (quite impressive). Activities dominated by user analysis.
  • Held a meeting with LFC developers and LHCb to address some strange access patterns observed in the server's logs.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN:
      • FAILOVER full. Space managed by lhcb. (RAL backlog to be recovered).
  • T1 site issues:
    • RAL: another disk server (gdss473, lhcbmdst service class), after two hours that was put in production developed hardware problems. List of files provided to the lhcb data manager.
    • PIC: MC_M-DST, M-DST and DST space tokens got full. Pledged provided space tokens banned in writing.
    • CNAF: MC_M-DST and MC_DST space tokens getting full.
    • IN2p3: Many pilots aborting against the CREAM CE. GGUS:61766

1st September 2010 (Wednesday)

Experiment activities:

  • Reconstruction, Monte-Carlo jobs and high users activity.
  • LFC: observed some degradation in performances (as originally reported in the ticket GGUS:61551). DIRAC side all suggestions and improvements from LFC developers has been put in place but it seems there is no way to improve. Closing a session when the server is overloaded simply does not work. Information from SLS not always respond to the real situation.
  • PIC and CNAF report a critical SRM SAM test failing because the USER space token is full. They provide the pledged for this space token so apparently the availability of these sites should not be affected for this failure. SAM tests do not check neither the space is full nor the site provides the pledges.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN:
      • (LFC degradation).
      • The intervention on the CASTOR name server has finished OK this morning (upgrade to 2.1.9-8).
  • T1 site issues:
    • RAL: Due to limited number of connection to disk server almost all activities (FTS jobs, data upload, data connection in reading) are failing timing up (waiting for a slot to be freed). It is fine to protect disk servers but we cannot survive like that.
    • SARA/NIKHEF: user jobs filling up the disk space on the WN. Chasing this up with culprits. Apart from that access to conditionDB is still problematic -using other T1's Databases
    • CNAF: 3D replication problem: what is the status? (GGUS:61646)
    • GRIDKA: CREAMCE problem (GGUS:61636) still under investigation. Requested GridKA people to point which endpoint should be re-enabled in the LHCb production mask to verify the solution.
    • PIC: many users report timeout in setting up jobs and this is a typical issue related to the shared area. Confirmed that the NFS server was under heavy load due to some ATLAS activity.

-- RobertoSantinel - 29-Jan-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2010-10-01 - JoelClosier
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback