Hammercloud for CMS

Introduction

HammerCloud is a testing system developed by CERN IT initially for ATLAS and then adapted to CMS and LHCb to run custom tests on Grid sites. It has been adapted to run CRAB jobs with these goals:
  • continuously submit test jobs to be used as functional tests
  • allow sites to run stress tests on their infrastructure
  • make performance studies of CMSSW
This page is not the full documentation but it is intended to provide CMS sites and users with the minimum amount of information needed to navigate on the HammerCloud web site and find the essential information. Users are still strongly invited to read the official documentation to have a more complete picture of the HammerCloud architecture, functionality and interface.

HammerCloud is accessed via a web interface, that by default shows the existing tests. Functional tests are continuously generated by HammerCloud according to some templates and run at all CMS sites up to Tier-2's (Tier-3's may be included, if so desired, on demand). A non-authenticated user may access all pages, but only an authenticated user can modify templates or create and submit new tests. Authentication is password-based and for the moment it is totally separate both from NICE and from Hypernews. Site contacts who would like to submit their own tests should get a login by submitting a GGUS CMS ticket to the CMS Hammercloud support unit.

Basic concepts

Here we briefly explain the basic HammerCloud concepts to get you started as quickly as possible, also in relation to how they are used in CMS.
  • Tests. A test is a set of jobs with certain properties (e.g. version of CRAB and CMSSW, input dataset, job splitting parameters, CMSSW configuration) to be submitted to a certain set of sites and with certain throttling parameters (e.g. maximum number of running and queued jobs); all these properties are defined in a template and a test is the instantiation of a template;
  • Templates. Templates are entities that can be created or modified by privileged users and define all the parameters relevant to a test. In particular they define if a test is functional or a stress test.
  • Metrics. Metrics are quantities extracted from the job outputs that HammerCloud stores in its database, to be plotted. Examples are: CPU time, CPU efficiency, wallclock time, job efficiency, etc.; overall, about 40 metrics (mostly I/O-related) from the FJR are collected, thus allowing to make e.g. performance studies. At the moment, metrics plots are not available after the migration to CRAB3.

Web interface

From the main web interface, currently running and scheduled tests are listed in a table. Currently, the CMS functional tests are organized as follows:
  • one test including all Tier-1's and the Tier-0 (identified as having as region KIT, PIC, CC-IN2P3, ...)
  • one test for each region, including all Tier-2's in that region (identified by the Tier-1 in the region)
  • one test for each T1/T2 site, where jobs can run anywhere but access via xrootd a dataset located at that site.
For each of them you can see at a glance the number of submitted, run, completed and failed jobs (take into account that a test lives for 24 hours, after which it is terminated by HC and a new, identical test is submitted).

Test summary

Clicking on the row of a test brings to the 'Overall' tab of test summary page, which shows some test parameters and a few basic plots. Below one can find three tabs: 'Sites' (showing mostly test parameters and numbers of jobs), 'Backend reasons' (not functional) and 'Summary' (not functional). The 'Logs' link (NOT the Logs tab!) allows to see the Ganga log file. In the 'jobs' subdirectory are the python configuration files for the jobs/CRAB-tasks that contain the dataset name, lumi sections to process in a job, timeout, etc. (Currently 220 jobs cover the dataset and are submitted as one CRAB-task. The CRAB server releases one job or each task every 5 minutes.)

The other tabs in the top of the page are:

  • Sites: show metrics distribution over all sites in the test ("Overall") and by site (scroll down and click on the desired tab);
  • Metrics: for each metric, show its distribution overall and on each individual site;
  • Jobs: a table showing each individual job, that can be filtered according to the column values (e.g. to list all jobs run at a given site);
  • Evolution: shows the number of submitted, running, completed and failed jobs vs. time overall and by site;
  • Alarms: not functional;
  • Logs: not functional.

Tests

This menu has three options: 'All functional', 'All stress' and 'All'. They show tables of tests, but the only useful parameter to filter on is the cloud (aka region).

Reports

Not functional.

Robot

This menu has the following options:
  • Historical: a graphical representation of the daily success rate; the colour is related to the value according to the legend that appears by clicking on '?'. Clicking on a cell shows the exact success rate and the numbers of jobs;
  • Incidents: it records relevant events from the test life cycles;
  • Job errors: this page shows, for every site, the total numbers of failed jobs in a given time interval (the default is the latest 7 days). It can be restricted to certain sites or regions. The columns 'Grid failed jobs (aborted)' and 'Application failed jobs' are not functional.

More HC...

This page can be ignored.

Administration

This page allows to view, add and modify all the various configuration entities in HammerCloud. Most of them should be left alone. The only ones with some relevance for a "normal" user are:
  • Dspatterns: to add new datasets (or dataset patterns)
  • Job templates: CRAB configuration parameters (in particular for job splitting)
  • Templates: templates for tests (see above)
  • Test options: configurations to choose e.g. the version of CMSSW and the VOMS role
  • Tests: show all the configuration parameters of existing and past tests
  • User codes: correspond to CMSSW configurations

Usage of the HammerCloud information

The success rate of the HC jobs is extracted by the Dashboard and inserted in the Site Status Board to be used as a quality metric by the Site Readiness algorithm. The HC jobs which are used for this publish 'hctest' as activity name, to be separated from normal analysis jobs and other types of HC jobs.

Concerning the information published to the Site Status Board and fed to the Site Readiness, that is the success rate, it is directly extracted from the Dashboard job monitoring rather than from HammerCloud. As a consequence, the success rates from HammerCloud will slightly differ from those extracted from the Dashboard, but this is not a problem. The script used to extract the HC success rates is jr_successrate.pl and the Site Readiness and how it is used in CMS operations is documented here.

Operational procedures (for operators)

They are found here.

User support

For CMS users, the easiest way to get support is using GGUS to submit a CMS ticket selecting 'CMS Hammercloud' as support unit.

Presentations

Frequenty Asked Questions

  • Q: why the success rate I see in the Dashboard is different from the success rate calculated by HammerCloud?
  • A: the Dashboard calculates the success rate based on jobs that it could assign to a site, which can happen only if the jobs are submitted and assigned to a CE. If, for any reason, HammerCloud cannot submit a job or the WMS cannot match a CE to the job, HammerCloud considers the job "failed" because its attempt at running it failed. So, generally, smaller success rates are to be expected in HammerCloud compared with those calculated by the Dashboard. For this reason using the success rate from the Dashboard effectively eliminates the impact of some WMS problems when measuring site quality and is an improvement with respect to the Job Robot.

  • Q: how is the success rate calculated by the Dashboard?
  • A: a detailed description of the success rate calculation is available.

  • Q.: what is the meaning of the Grid error code UNDEFINED?
  • A.: The Job Errors table show the failed jobs submitted by HC. The case of a Grid error with the UNDEFINED label means that HC could not deduce the error from the job. Usually, this error comes when the job cannot be submitted since the WMS did not respond to the crab -submit command in more than 1 hour. The rule we have is than when a job is in state 'Created' for more than 1 hour, we consider it as a Grid failure. The tagging for this kind of errors should be improved, anyway.

  • Q.: I have a new CE at my site. How can I have it tested by HC?
  • A.: For a CE to be tested by HC, it has to be in production, visible in the BDII and registered in the SiteDB. When this is done, HC jobs may end up in that CE, but given that the gLite WMS assigns jobs to CEs based on the CE rank, it might take some time before the new CE has a favourable rank. Currently, there is no practical way to force HC to run on a specific CE (nor there should, given that HC should simulate analysis jobs and analysis users select sites, not CEs).

  • Q.: I don't see any HC job in my site. What's wrong?
  • A.: The causes can be various, and only some examples are given here. The site might be in downtime; the CE information (including CMS software tags) might not be correctly published; the site might be overloaded with other types of jobs and the HC jobs never get a chance to run before being killed by HC after 24 hours; the HC submission engine might be broken or HC itself down. The first thing to do is to try to find out if there are HC jobs scheduled at the site, by using the Dashboard Interactive View starting from this link and looking for jobs at the relevant site. If scheduled jobs are found, it's just a matter of waiting; otherwise, it's worth to check if there are 'analysis' jobs (by selecting the appropriate activity name). If no jobs are found, then most likely there is a site problem preventing all CRAB jobs to run. Otherwise, it's advised to send a Savannah ticket to the hammercloud squad for further investigation.

  • Q.: the HC jobs at my site abort with the error "BrokerHelper: no compatible resources". What does it mean?
  • A.: literally, this means that the gLite WMS could not find a good CE for the jobs. Given that CRAB checks at already at submission time if there are eligibile CEs for the jobs, it is a rather abnormal situation. A common explanation is that the jobs where submitted via a WMS different than the one used to make the CE existence check, and the second WMS is not in a good shape. The case when it is a site problem is much less likely.

  • Q.: I see that my site is failing HC jobs. How can I see the full logs (WMS logging info, job outputs, etc.) for the failed jobs?
  • A.: HC stores a copy of the WMS logging info for aborted jobs and the full output for all jobs that terminate (either correctly or with a CMSSW error). It is possible to see them, but the procedure is currently somewhat convoluted, but effective:
    • Go to the HC page;
    • Select the running test containing your site (one can tell from the region, which is the parent Tier-1);
    • Scroll down and click on Sites;
    • Click on the desired site (the link is the right of the table);
    • Filter only the failed jobs by writing status:f in the search box;
    • Click on the J number of a failed job to see the complete output of the full task, or on the Sj number to see the output of that job only;
    • If you clicked on the J number you can see the logging info of aborted jobs (under jobs/ and the output and FJR files of the jobs that ran (under res/)

  • Q.: I have DPM and all HC jobs fail with error 8020 and this error message: send2nsd: NS002 - send error : client_establish_context: Could not find or use a credential. Why?
    • A.: because a special setup is needed to run CMSSW with DPM, as explained in this page. In particular, check that step 3 is properly followed.

-- AndreaSciaba - 27-Mar-2012

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2017-12-14 - StephanLammel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback