Integration Incident Tracking

This is a wiki page containing a description of problems found during the preparation of production, preproduction and staged rollout releases. The idea is to try to understand and learn from our errors to help improving the current process and avoid similar scenarios in the future.

Production releases

Item Date Update Description of the problem Why did this happen? Solution for the future
1 29/04/09 44 3.1 Reported by Maarten Litmaath. Production update didn't appear as HIGH for UI and VOBOX in the gLite web pages even if it contained PATCH:2928 that was marked with priority HIGH. Therefore, sites didn't apply it immediately. The release scripts are run with an argument that defines the priority of the update as a whole. However, this can't be done at patch/service level. In that case, the label HIGH has to be manually applied to the affected services. In the production checklist there's a new step to control whether individual patches are of priority HIGH. Special attention has to be paid in those updates that are not priority HIGH as a whole but that contain individual patches that are HIGH. This has to be taken into account when announcing the release in production and has to be agreed with Operations which priority is announced in the mail and in the gLite news box.
2 04/05/09 45 3.1 Reported by GGUS ticket 48397 -> yaim-clients 4.0.7-3 were missing in the generic and WN 32bit repo. And another problem reported by Di by mail -> GFAL 1.11.4 and lcg_utils 1.7.2 were missing in the WN 64bit repo. yaim-clients -> We think patch 2921 wasn't applied to the production lists and therefore repo was not updated with this rpm. This is not detected by the release notes creation. GFAL -> we think there was a disk space problem and the rpms were not copied. This implied the following inconsistencies: repo, tarballs, rpm lists in html and txt format. And it also affected update 46 which was being previewed. Since update 46 was based on update 45, it inherited the errors and the same corrections had to be made. We need to check carefully that we are releasing all the patches that have been requested and all the rpms included in those patches. This is already stated in the production checklist. A link to the check scripts that Maria has written will be added.
3 07/05/09 45 3.1 Update 46, that was ready in the prepare area, was copied by mistake into the production location. When trying to fix problem reported in item 2, the prepare area was used. Update 46 was already available there. Communication problem. Also check scripts were not run either to check what was being released. Create a check list for special situations. If an rpm is detected to be missing in the production repo, the prepare area shouldn't be used. Instead, a new copy of production should be done in the prepare area and that copy should be used to fix the problems. The prepare area contents are unpredictable.
4 08/07/09 49 3.1 Reported by GGUS ticket 50110 -> glite-HYDRA updates were not included The list of services that is used to create the gLite web pages didn't contain glite-HYDRA. Make sure the list is correct and that it contains all the services that are currently in production.
5 09/07/09 03 3.2 Reported by Bug 53013 -> BDII rpms are not signed The make-release script is probably not copying packages from the certified repo where rpms are actually signed. Since this is only detected when actually installing the middleware, it was not detected (we do deployment tests without actually installing the middleware) Make sure the new rpms that reach the 3.2 production repository are signed. Otherwise the scripts need to be modified.
6 23/07/2009 02 3.2 Reported by GGUS ticket 50487 -> A missing package in the glite-WN rpm list. This is very strange since this lists are taken from the output of a yum install and the package is correctly installed there. Monitor the RPM lists a bit more in the future to detect some bug in the scripts
7 28/07/2009 02 3.2 Reported by GGUS ticket 50584 -> The new repos structure (GENERIC + node types) hasn't been announced. We can't change something like this without announcing it. This was changed from update 01 to update 02 and in any case it's different from 3.1. Sys admins try to find packages in the usual places and they are not. Specially when in update 01 things are as usual and we change it in update 02 without saying anything. This is now explained in the Generic Install Guide.
8 29/07/2009 04 3.2 Reported by GGUS ticket 50630 -> wrong URLs The links to the new packages listed in each service update page were wrong. This is due to the release scripts that were not available in 3.2. Previous releases were done manually and probably the pages contained errors. Now the scripts are adapted for 3.2 and I guess some more testing is needed. I've fixed the code where this was not properly generated and I think this will be OK for future releases. A quick review of the URLs should be done as well.
9 18/08/2009 52 3.1 Reported by Sophie Lemaitre to Integration team. The RPM lists in txt format do not contain the full URL pointing to DAG, jpackage and SL packages. This is due to changes in the release scripts to make them work in SL5. There was an error in one IF that was the cause of this problem. This is now fixed for upcoming releases and the rpm lists are fixed manually.
10 17/09/2009 05 3.2 Detected by Maria Alandes. When preparing update 05, the 3.2 production repository was overwritten by the new production 3.2 repository. When preparing a production release, the new repository is created in the production location as R3.X_new. This is a manual step. If the _new is forgotten, the disaster happens. A broadcast was sent and the repository was unavailable during one hour. Be very careful when copying things into the production location and contact the Helpdesk to know how to stop the synchronisation with the real production repository and how to retrieve backup copies.
11 22/09/2009 05 3.2 Detected by Andreas Unterkircher, also reported in the LCG-ROLLOUT and in GGUS ticket 51771. The 32bit versions of GFAL and lcg_util were not included in the release The 3.2 release process now signs rpms from the patch repositories and this is what it's used later to create the production repo. The script move-patch -s cert doesn't take into account packages that need to be included also for 32bit in a 64bit patch, because at certification time we only check 64bit. In this case the 32bit version of GFAL and lcg_util was not included in the patch repo and therefore, later they weren't included in the production repo. When doing a check_prod_all this wasn't detected either. The move-patch script needs to be modified to take this into account. This problem didn't give any error when doing a first installation, so it wasn't detected when creating the release. Maybe we need to think of a better way to check the packages we actually release and take into account the cases where we need 32bit and 64bit versiosn for the same update. Again, certification is far from production as far as repositories are concerned.
12 23/09/2009 05 3.2 Detected by Maria Alandes. The package vdt_globus_jobmanager_common-VDT1.10.1x86_64_rhap_5-3.x86_64.rpm was copied by mistake into the production repo. This has happened because of the external packages reorganisation that has taken place in this update. External packages were not properly copied into RPMS.externals in the previous 3.2 updates due to a misuse of our integration repository. Since external packages should not be signed in 3.2, it's good to fix the location of packages and make sure they are always installed in RPMS.externals, where we don't sign rpms. This has happened because in this update there's been a manual intervention unlikely to happen again.
13 28/09/2009 3.1 and 3.2 Detected by Steve Traylen. URLs pointing to the SL5 dag repository were incorrect The gLite web pages contains links to packages provided by the DAG repository, SLC repositories and jPackage repositories. The release scripts which create these URLs contained a wrong base URL for the SL5 dag repository. Probably this was changed by mistake when implementing the changes needed for 3.2. Pablo created a script to automatically update the URLs to point to correct ones and he has also investigated a tool, called linkchecker, to check that the gLite web pages don't contain broken links
14 30/09/2009 05 3.2 after recreating repo due to incident 11 and 12 Reported in GGUS ticket 51865. WN failed to install The tecnical reason for this problem is explained in the GGUS ticket. This is difficult to detect since it only gives problems when installing a TORQUE client before a WN in the same SL5 machine. Not everybody could experience this problem and in fact we didn't detect it in certification Our scripts have included an option to recreate the whole repository at the end of the process where new packages and metapackages are added to the repo. Moreover, if we split the repository, this problem can't happen again
15 13/01/2010 60 3.1 Reported by GGUS ticket 54613. Wrong tarballs. Tarballs were copied from the beta repository as described in our procedure. The problem is that the tarballs in the beta repository were coming from a staged rollout release that was actually rolledback. They were not deleted and we used them by mistake. To fixed this we have copied the tarballs from the previous production update since update 60 just introduces a new version of the lcg-vomscerts that doesn't actually affect the tarballs. We need to make sure we clean the necessary things when doing a rollback.
16 13/01/2010 60 3.1 Reported by GGUS ticket 54648. Tarballs for 64bit WNs were missing Probably the steps to create the tarballs were not executed for SL464bits. check-links script was probably not executed either and that's why this wasn't detected in the release pages links. To be diagnosed with the integrator to know more details
17 13/01/2010 dCache release Reported by GGUS ticket 54633. dCache metapackages not linked properly. After splitting up the repos, The last production version of dCache metapackages and some of their dependencies were not included and linked properly in the new splitted repos. The URLs in the release pages were also pointing to generic-dcache which no longer exists. This happened because for some reason, the rpm lists of the last production version of the dCache metapackages contained in the URL generic-dcache instead of the metapackage name. The script creating the split repo failed because of this but we didn't detect it because the install tests were OK. Of course, they were installing the version before the last one, but we were not checking it. Hopefully this is not happening again since the repos are now splitted.
18 25/01/2010 60 3.1 glite-MPI_utils has a bug, and has to be installed using a workaround. Therefore, the release notes scripts are failing and we have to skip the creation for MPI_utils and do it by hand. The release notes were copied from the staged rollout (and the ones in the staged rollout from an old version) and modfied to fit the new release, but there were some old things that were not changed. Take special care when dealing with MPI_utils release notes.
19 15/02/2010 08 3.2 WN Tarball wasn't correct as reported by GGUS ticket 55560 We think it's because of quota problems. But it's strange no error was given when copying the tarball. It's maybe worth installing the tarballs as soon as they are created to test they are correct.
20 09/02/2010 08 3.2 WN and UI rpm lists are not complete. WN is missing vdt_globus_jobmanager_common and UI is missing gsiopenssh. This has been reported by Maarten and Cristina in the EMT mailing list. I opened GGUS ticket 56298 This is due to a reincarnation of bug #56200. In SL5, this wasn't a problem. At least with yum 3.2.19. But we used new virtual machines for release 3.2 08 where yum was 3.2.22. This version of yum seems to display the yum install output in shorter lines, and when the name/version of a package is very long, the line is broken in two. Our script is not able to parse these broken lines and some packages are not taken into account. Since we don't do deployment tests with the rpm lists it's difficult to detect these errors. However, the scripts could probably be improved to deal with these special cases. We need to review the scrips so that they don't depend so much in the output of the yum install. To be investigated...
21 30/03/2010 09 3.2 Some i386 rpms ended up in the glite-SE_dpm_mysql, glite-VOBOX and glite-LFC_mysql repositories. This rpms should only go to glite-UI and glite-WN, and therefore, were creating conflicts when installing the affected services. The problem was tracked with GGUS ticket 56803 We are not sure yet but probably it's an error in the script that decides when to copy the compatibility mode rpms. This should be done only for glite-UI and glite-WN, but in this patch other services were also affected and the i386 packages were copied even if they shouldnt. To be investigated...

Staged rollout releases

Item Update Date Description of the problem Why did this happen? Solution for the future
1 03 3.2 01/12/09 Reported by task 12869. The SL5 preview pages contained a series of errors: 3.2 main updates page was referencing 3.1 all the time; The metapackage update pages contained wrong URLs pointing to production instead of beta repo; the external packages of all the metapackages involved in the Bundle were not included in the web pages; A wrong repo file called beta-glite-GENERIC.repo was part of the repo files and it was not necessary, which created confussion among users. We think all these errors happened because it was the first time we were releasing a 3.2 Bundle to Staged rollout (the previous bundle was rolled back). Our scripts were obviously not tested enough to produce correct web pages for 3.2 Staged rollout. Moreover, the release was finished on friday and there was little time left for extra verification. This was a bad idea. We have already identified the problems in the scripts and fixes have been committed. The web pages have been regenerated with the new version of the scripts fixing all the existing problems, so this won't happen again for the future. In order to detect these type of errors before we announce the release to the users, we have opened internal bugs 59908 and 59914 to improve the release process and automate some checks which would have helped us to detect these type of errors before.
2 03 3.2 03/12/09 Reported by bug 59965. VDT packages always fail to update due to their versioning scheme. This wasn't properly documented in the release notes (why this was not detected in certification? That's another issue). While looking at this we realised we didn't include 32bit versions of vdt_globus_essentials, voms-api-c and voms-api-cpp in the UI and WN repositories. Once we added the 32bit versions of these packages in the repo, we forgot to sign them and we needed to recreate the repo once more after signing them. This is due to bug 56188 and the fact that CREAM patch 3260 introduced new versions of vdt_globus_essentials, voms-api-c and voms-api-cpp, which we didn't realised. This would have implied us manually copying the 32bit versions of those packages in the patch repo, which we didn't do. We forgot to sign the rpms since we did an emergency intervention for which we don't have any check list and we forgot one important step. I've created bug 60045 to improve our install tests and be able to discover these type of mistakes before. Moreover, we really need to fix bug 56188 to avoid creating wrong repositories already in the certification phase. We have also contacted certification to make sure this is identified there as well. The script that creates the staged rollout repo is different from the PPS and prod script. In that case, we would have got the 32bit packages. The philisophy between scripts is different and now we do things in a different way that it's more efficient. We rely more of the patch repos from where we copy the rpms, that's why we need them to be correct and contain also the 32bit packages. We also need a clear procedure for emergency interventions.
3 03 3.2 20/01/10 The i386 rpms for libdcap-* were copied in the VOBOX RPMS.externals repository by mistake. This created dependency problems during an upgrade. For more information see GGUS ticket 54785 and Task 13419 This wasn't detected because did clean installation tests, but we didn't do upgrade tests. #META tags were also included by mistake into the VOBOX dependency list. This has been removed so it won't happen again in the future. We should run upgrade tests as well.
4 03 3.2 10/02/2010 Danica has reported a checksum error when trying to install the FTS (after adding the latest patch for FTS 2.2.3). We think it's because the preview repo wasn't copied properly from the prepare area. We copied the new metapackages but forgot to copy also the new repodata. It's best to copy the full repository to make sure we also copy the new repodata

Preproduction releases

Item Update Date Description of the problem Why did this happen? Solution for the future
1 46 3.1 28/04/09 Reported by Esteban Freire. PPS update 46 had the lcg-CE pps repository not pointing to the generic one. Therefore the last PPS-lcg-CE metapackage couldn't be found under generic. PPS mirrors always point to generic and in this update they couldn't find the newest PPS-lcg-CE metapackage. There was a missing symlink from lcg-CE to generic in /afs/ Why? No idea. But this breaks the whole pps release process since the scripts rely on these symlinks. This wasn't detected with the check scripts since PPS update 46 contained a large number of new metapackages and probably it wasn't detected that the lcg-CE one was missing. Make sure the prepare area contains all the symlinks to generic. WN and dcache are an exception here. The check scripts have to be improved.
2 47 3.1 28/05/09 Reported by Farida Naz in bug #50983. glite-info-provider-service package was not defined as a dependency in PPS-lcg-CE. The integration script that rerieves the metapackage changes from a patch is not very reliable. Sometimes it works and sometimes it doesn't. In update 47 failed in almost all the patches with metapackage changes. This was detected and fixed manually. However, patch patch #2841 was missed. If the metapakage changes are not properly retrieved, we miss dependencies. This wasn't detected in the check scripts since the package itselsf was copied in the repo. This can only be detected doing a configuration of the lcg CE. Investigate why the script is failing and make it work under all circumstances. Moreover, I've added a step in the pps checklist to make sure metapackage changes are properly applied when moving a patch into pps.
3 02 3.2 27/05/2009 Reported by Farida Naz. Patch 2875 SL5 UI, is rejected due to several problems. Bug 50923 detects we are using a wrong version of voms-admin. This is not detected in certification because we don't test voms-admin. Bug 51148 detects a missing path in PYTHONPATH. This is not detected in certification because the python GFAL and lcg_util are not tested. Missing tests in certification. Bad communication with developers? (Why were we using a wrong voms-admin?) Regression tests have been included to test voms-admin and python GFAL and lcg_util
3 03 3.2 08/06/2009 Reported by Farida Naz. Bug 50923 detects we are still using a wrong version of voms-admin! Patch 3035 defines the correct version of voms-admin. When applying the patch into the pps dependency lists, the version is not properly updated, in this this case downgraded. The check_scripts should have detected this since voms-admin 2.0.8-1, the right version, is missing from the repo. This was not detected in certification since the patch repository was properly created containing the right dependency. The problem was when creating the PPS metapackage. A quick deployment test should have also detected this.
4 06 3.2 Reported by Goncalo Borges in GGUS ticket 51435. Package vtd_globus_jobmanager_common was missing in the PPS repo. I can't understand why this package wasn't copied to the repo and why our installation tests didn't detect this missing dependency. It's as if yum doesn't complain when a mandatory package of a group install is not installed. The pakage has been copied in the repo and the repo has been recreated. This has to be further investigated to understand why the fact of not copying that package wasn't detected. However, the last PPS releases in 3.2 have been a bit complicated with many manual steps to consolidate the scripts and the signature of rpms. It could be also possible that this was not copied manually by mistake.

-- MariaALANDESPRADILLO - 29 Apr 2009

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2010-03-30 - unknown
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback