List of old problems

  • Major problems transferring to GridKA. Johan will chase it (gone away?)
  • Problems registering a file in LFC using VOMS groups different than default "lhcb". Joel for instance (as member of the VOMS /lhcb/lhcbprod group) didn't manage to register a new entry. Solution: either having secondary groups supported by LFC or adding a new ACL in the prodcution LFC. (Open a Remedy ticket #: CT372673). End of this week FIO people will modify the ACL. Updated the ACL accordingly.
  • GRIDKA: problem of software installation. Joel reinstalled software from scratch and it seems now a problem with the srm endpoint in Gridka. ( GGUS # 14331) Gone away.
  • PIC : is missing (was a problem with the sofwtare area in PIC GGUS ticket from Joel)
  • CNAF : In the trasformation agent there are no files available for the corrent reconstruction (prodID 1451,1452) (to be checked what files have been trsnsfered at CNAF). Started reprocessing with success data.
  • IN2P3 : jobs failed but apper running on the monitor page (under investigation) . The problem seems to be connected with a problematic SRM service running at Lyon (GGUS #14330)
  • DIRAC: It seems that the GUID of each DIGI which is in the pool_xml slice generated by DIRAC is not the internal GUID of the DIGI (from POOL stored also in the file catalog) but comes from a random generation by the Dirac wrapper. Any further reference to these DIGI files in the Event Tag Collector file from Da Vinci (for the subsequent Stripping phase) is then impossible because this wrong internal reference of DIGI files. DIRAC job wrapper was not picking up guids from the catalog when generating XML slices. After starting using genCatalog for ancestors the problem will disappear but also without ancestors guids will be correct. POOL application accesses the XML slice for picking up the GUID of the file and then this should reflect the real GUID. Solution in the test system Joel and Andrei will carry out several tests and the reconstruction will start to run with this new version of Dirac once it will proved to work as expected. In the meantime Markus Frank and Joel will start replcing the GUID in each file with the correct one (from POOL) in LFC. This implies to reopen the rDST, modify it, and re-upload it on the T1 tape. The Dirac v2r12 (r11p11 is the candidate) should fix this problem .The workflow tested by Joel over ,modified rDST (with right GUID) works for stripping, however there are still a couple of issues that prevents to restart reconstruction and stripping:
    • upload of files (lcg-cp and lcg-cr are not found in the new Dirac while they were successfully found on the Test WMS) (Problem of the site where tests were going)
    • need to have a standard location of the POOL XML slice from Brunel.
    • Test the new format of the ETC (from Da Vinci) files used again by Da Vinci for reprocessing rDST.
    • New algorithm tested and ready to use
  • CERN AFSLoginProblems issue The AFSLoginProblems volume (managed by GD andholding the software run by the batch jobs at CERN), needs asap to be replicated. This turns into the need of following a procedure AFSLoginProblems procedure for installing and making effective changes on the updated experiment software. The risk in case of none action is taken is the loosing of the O(1000) jobs running each day at the T0 and also of all data in this volume - which is not backed-up. Joel will follow this aspect with Rainer. This other question remains: why is the CERN choice driven toward AFSLoginProblems for serving the software area?
    • It doesn't scale (otherwise we shouldn't be asked to follow this special procedure) with the number of WNs
    • It doesn't (yet) allow for installing software via normal jobs as it happens on the other sites.
    • It forces (even when gssklog will be fully operative) to deal the problem of installing software at CERN always in a special way.
The gssklog mechanism in the new glite CE at CERN has been successfully tested and the AFSLoginProblems procedure has been followed for LHCb. AFSLoginProblems will be kept to serve the software area and this replica read-only will help to tackle scalability issues; lhcbsgm is the AFSLoginProblems administrator so it can run "vos release" for syncronizing the master area with the replica(s). The transparency of CERN as Grid site could be insured by running this command via a cronjob. .
  • October 10th CNAF transfers failing: Failed on SRM put: Failed SRM put on httpg:// id=529007723 call, no TURL retrieved for GGUS ticket #13989 top priority (October 10th). Problem suddenly gone away.
  • October 10th. Joel cannot upload and register a file. The problem seems to sit in the catalog. (GGUS ticket #13991 plus #13958, #13966 et #13967) VOMS mapping on LFC. LFC doesn't support secondary groups and then presenting /lhcb/lhcbprod group as primary FQAN (as Joel and Roberto were doing) registratoin of the file on LFC fails. Change ACL on LFC.
  • October 9th CERN: Special Service at CERN for Sandboxes Medium-high priority Required another Special Storage for storing Dirac Sandboxes. This service should be merged with the existing logs storage element and has then to rely on a file system (Castor wouldn't be a good solution then). In the mean time it should also be easily extensible (i.e. upgrade of hardware completely transparent to the user). DPM? FIO set up a castor disk storage that could be used for this purpose but it has to be decoupled from the current logs SE until someone (Roberto?) will make a cgi-bin application that allows for surfing files within CASTOR they de. Details on how to access this storage will be circulated soon. Classical Disk based storage element (will be not longer supported by FIO). They setup a special service class disk only. They should set up also a dedicated srm endpoint to let everybody access sandboxes stored there. Proposal is to use directly the gridftp server against the only disk server hosting this CASTOR area ( /castor/ ) : gsi The problem gone away definitely with the new cgi-bin application (available on that allows for browsing files on castor based backends.
  • October 6th CERN: High Priority Ask for a LHCb dedicated gLite RB to be used in production. GD people are following it up The machine will be used in production so it should reflect the same quality of service offered by the CMS machine The delivery of the new hardware is not expected before mid of October and in order to allow LHCb tests (see the tests section) we have explictly asked for rb101 (Atlas dedicated machine, currently unused) for being installed and configured (and temporary used by LHCb). Got (Atlas claimed back rb101!) with the same configuration of r102 and awaiting for new dedicated hardware to be used as gLite WMS.
  • O n October 4th CNAF has been claimed fully operative and transfers started going through CNAF CASTOR2. High priority issue - CNAF cannot reconstruct data. (Site admins aware of the problem). It seems affecting also undergoing analysis activities .Problems using CASTOR2 at CNAF. This is a long standing issue too. Basically at CNAF are using a different configuration that at CERN where for each VO there is a dedicated instance of the DB and LSF. After migrating to Castor2 (Philippe suggested also to downgrade to Castor1 at CNAF until the problem is fully understood) LHCb didn't manage to run in a continous and smooth way DC06 jobs at CNAF. There are several reasons behind: The single disk server serving the LHCb requests from LSF was not enough. There was also a limit on the max number of jobs per disk server increased to 300. (Fixed) The DB is overloaded (deadlocks) and all the requests to the stager are stuckThe pure disk pool (no Garbage Collector) seems to have problem in accessing files in case it becomes full (with consequent pending jobs overloading the LSF queue). During the summer some of the CASTOR SQL code has been patched to fix the problems of deadlocks, but sometime deadlocks still occur. CNAF system had quite a lot of hot fixes For this reason, following the recommendations of CERN CASTOR development team, they decided to cleanly upgrade to the newest recommended stable release (2.1.0-6 for servers and 2.1.0-8 for clients). The upgrade started on September 18 to end, in our plans, on September 20. At the same time they completed the migration of VOs from CASTOR 1 to CASTOR 2. The net result was an almost complete block of the system for more than one week: in fact the operability of the system was only recovered on September 29. During this period the search of the problem was focused on Oracle database back-end which indeed showed quite poor performances: the CERN CASTOR development team performed a deep fine tuning of the database with small improvements. The cause of the slowness of the database was actually due to the presence of spurious and duplicated entries in the database referring to the wrong Name Server (actually an alias). When, at last (September 29), someone from the CASTOR operations group had the possibility to look into this problem could easily understand the problem and restore the operability of our system. Moreover, as a consequence of the upgrade, also tape servers activity has been
    stuck for 1 week due mostly to a packaging error. In detail: the standard Linux tpmwatch package removes old files and directories from /var/tmp, while tape servers use the directory /var/tmp/RTCOPY when they mount a cartridge and the absence of this directory makes the tape server armless. On our tape servers thedirectory /var/tmp/RTCOPY was removed by tmpwatch. At CERN the standard tmpwatch has been substituted with a customized package
    which preserves the RTCOPY directory from removal, but this special tpmwatch package (actually it is only a configuration file change) was not present in theCASTOR 2 RPMs repository and the tape server RPMs don't have any dependences from this special tmpwatch. CNAF access to data is still problematic ( analysis and reconstruction are failing ). Transfers to CNAF after the intervention (for upgrading to the same Castor version used at CERN) were still failing). These transfers were originally thought to be failing because the DNS at CERN didn't resolve the SRM endpoint at CNAF it is now evident they are failing for some more tricky problem on Castor CNAF site (GGUS #13121 for this transfers issues). As regard the more generic access problems, they have already been spotted to site managers that are looking on that.
  • October 2nd: IN2P3: seems to be missing in some of the WNs of the site. (GGUS #13493) Added the compat-* libraries added. Problem with AFSLoginProblems?
  • October: Problem in transfer data to RAL (Andrew is looking at that with RAL managers).
  • September 28 : GridKa network connection down. FTS to GridKa stopped. (It is back up again)
  • September 28 : it seems we were running out of tape space at PIC. Ricardo fixed this problem and now PIC tape is again OK
  • September 27 : Failing reconstruction jobs at GRIDKA (GGUS #12320)
  • September 26 : Failing transfers to RAL (GGUS #13208) - solved. The GridFTP doors had crashed again.
  • September 26: Overloaded rb108 (used as fall back RB during the intervention on the RAID 5 of rb107) with too many threads submitting there. Able to substain up to 5K jobs, the increased number of threads generated a huge backlog with high load. Rolled back to a less stressing use of rb108 now balanced with the use of rb107 (that it is back on production). rb108 is now used with just 2 threads (being shared with opther VOs) while rb107 (whose some services weren't otiginally running fine) starts to receive 5 threads.
  • September 26: Transfers failing at Lyon (GGUS #13215, #13050,#13298) : transfer temporary problematic.

  • July 13: Many FTS transfers to CNAF are failing - solved? Stress testing the system now
  • July 12: Problem with FTS transfers to CNAF. Being investigated by Angelo.
  • July 08: Problem with all FTS transfers. One transfer to PIC (ID : 2731) has stalled. All other transfers (IDs : 2734 - 2754) are in the "waiting" state.
    1. The transfers have been stopped (the rates for CERN, RAL, PIC and IN2P3 are set to 0) : Restarted now
    2. The agent_runsv has been stopped on lxgate34 : Restarted now
    3. Someone has to be informed - who? email sent to lhcb-dirac-developers
    4. Problem with AFSLoginProblems area yesterday which was fixed. Did this have any effect? Donot know but things seem to be back for now.
    5. Transfers to PIC are still slow. Not doing any further file registrations there for now.
  • July 07: Problem at PIC. They keep transferring data but jobs are failing.
    1. The shared area where the srm files are was not online. Fixed now.
  • July 06: Overload of the Castor2 LSF scheduler with ~1.3M requests due to an introduced bug on the Scheduler software
  • July 06: Many transfer failed (and still failing) with Resource Busy message from CASTOR (@cern and @pic): this was/is due to corrupted entry (from previous tranfers timed out or failed) that Castor (for consistency) refuses to overwrite.
  • July 06: Many transfers again CERN Castor (SRM endpoint) where failing using srmcp: tcsh (which lhcbprod is mapped to in the CASTOR machines) wasn't set. This was the second time LHCb were running on this problem in few months.
  • July 06: local SE problems at CNAF; unable to access data out of SE - jobs running but ticket not closed yet
  • Juky 06: GridKa - new CE flickering in information system means that agents are not starting at GridKa (problem solved now by splitting various servicesrunning on the CE)
  • July 06: RAL: Very slow data access. The solution to this problem (overload of the disk server once too many request were coming simultaneously) has been to provide LHCb with a dedicated disk server that is now able to handle the LHCb needs.
  • July-August 06: Issues with slow transfer rates into CERN Real Burning. The transfer efficiency started degrading gradually until the situation where none of the transfers to castor at CERN (either through castorgrid or to srm) succeeds. The problem has been understood and was due to a con-cause of several problems:
    1. Faulty disk server
    2. There was a 'black hole' effect caused by an inconsistent state on LSF which was a ripple from the DNS problems at CERN.
    3. Misconfiguration on the HTAR routing the trafic from T2 to CERN (also CMS was affected by this same problem) ( Investigation from Olof)
  • August 06: Extremely poor performances of the new LCG RBs. The high-end RB for LHCb ( with 4GB of memory and dual processor 3GHz Xeon, RAID-5 disk, got screwed-up with just few K (1.5) jobs in the belly. The same has been reported by CMS. There could be something of weird with the software. Maarten and others found that there was a cronjob running on these machines that was loading excessively. Stopping this (useless) cronjob the performances of rb107 went back to much more expected values (easily substainable 5000 concurrent jobs). In the mean time , due to the high I/O WAIT usage of the CPU typical of RBs, it appeared also evident that big improvements can be optained by optimizing the usage of RAID5; tuning have been applied on rb108 and tests were undergoing. It is expected that these high-end machines will be able to handle up to 10K concurrent jobs.
  • August 06:, the special SE for log files enters into a deadly loop when more than ~1-2000 connections are opne simultaneously. It enters a regime where memory gets exhausted and it starts swaping, since the clients keep trying to connect and new connections are accepted the load of the machines starts to increase until it dies. It is urgetn to limit the number of simultanoeous connections a gridftp server can accept. A couple of possible solutions:
    1. use of uberftp (that keeps via the concept of session) the connection open for further transfers
    2. creation of archives of logs to be shipped in one go to the gridftp server instead of trasmitting for each job the 20-25 files.
  • August 06: Overall requests for making the deployment of lcg-utils much more light with officially maintained tarball distribution that could be shipped with LHCb jobs. Discussion with developers: the best way for certifying the software is to give it to the hands of the experiments for real use. Uncertified releases of new clients are available under special area of AFSLoginProblems or through web pages. There are not release notes nor special announcements of their availability because these RPMS are destinated to the certification team. Only a close interaction between end-users and developers can allow the right level of awareness.
  • August : Andrew's (actually Raja's) long proxy on myproxy-fts expired and all transfers to active centers (RAL,IN2P3 and PIC) were failing with the message Failed to get proxy certificate from . Reason is ERROR from server: requested credentials have expired DC06 is basically stopped for 12 hours.
  • August: PIC unable to process Data for DC06. Outage of the SE? Installation of the LHCb software? Was both the causes. PIC went back to a good shape after reinstalling software and after some intervention on their SRM server. To be investigated what screwed up the installation
  • September: IN2P3-CC largest queue lenght doesn't fit with LHCB simulation job needs. LHCb needs 216000 secs over a 5002KSI machines, Lyon was publishing a 2KSI=892 and a MaxCPUTime =1751. The problem has been fixed by increasing to 2500 the CPU Time limit.
  • September: RAL hanging transfers. Their gridftp doors fell over. This was leading to stuck transfers on the lhcb tape pools, preventing other transfers from happening. Killed those hanging transfers and implemented an automatic killing of any transfer which they haven't had contact for about 30 minutes (to avoid the problem in the future). Still working on the cause of the gridftp door crashes.
  • August/September: Problem at GridKA. (GGUS #11599) Apparently the impossibility of reprocessing data at GridKA was due to an "overloaded gridftp server" problem. Further investigations seem to converge to a communication problem between WN and CERN top BDII currently used on every sites by LHCb jobs. However very recent tests show up that the overwhelming majority of the failures doesn't come from the Information System but from the gridftp doors that sometimes seem to close their sockets. Doris cured the problem by rolling back to a previous dCache configuration and old setting. No more information although the problem seems to be correlated to the one in RAL (previous point)
  • September: FTS got stuck for 4 hours on Friday 15th. It was the Tomcat that was down and no alarm was triggered. ALl clients were hanging. Web server restarted by hands working on the procedures.
  • September: Joel's use case Joel is both production manager and Experiment Software Manager. He's carrying on LHCb production with SGM account but site administrators are not keen on that. They roll back the high priority set to sgm account on their sites for allowing more than 1 concurrent job running (NIKHEF). The solution to this problem is passing to VOMS and running explicitly SFT jobs (used for installation) by issuing voms-proxy-init --voms lhcb:/lhcb/sgm instead of a normal grid-proxy-init. that could be indifferently used for production jobs.VOMS seems to be ready for coping with this use case. We need to prove it by checking the status of the grid on this regard). The solution to this problem is to propagate groups only mapping to sgm and prd accounts (without extra roles) by changing on each site the groups.conf YAIM file used for creating the LCMAPS grid-mapfile. In this way Joel can be on demand production manager and software manager and doesn't need to run production with his sgm account (allowing in the meantime highest priority to sgm jobs)
  • September Data access problem at Gridka: 12 jobs running at GridKA were failing because some problem on the Storage System (GGUS #13016). Temporary problem, it simply gone away...
  • 19-20 September FTS seems that sometimes the authentication process fails with the error returned along the lines of 'not authorized to use service.'(GGUS #12953) This is a load problem service side and developers are setting another FTS service to add to the load-balancing service. The problem seems to be related to this new node that has a wrong mapping DN--> user. Request ending on this machine were then failing. The new (faulty) node was configured to (still) belong to the validation testbed (instead of the production). That explains why it was accepting only DTEAM. Furthermore the infoprovider was generating a too high load on the old server. The combination of the two actions brought back to a full operative status FTS central service.
Edit | Attach | Watch | Print version | History: r28 < r27 < r26 < r25 < r24 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r28 - 2006-11-16 - unknown
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback