Introduction

This shows the design of some changes to the APEL client software to accomodate HTCondor-CE.

This solution stems from work done at the T1 at PIC. PIC had already deployed HTCondor-CE in production. For accounting, they forked the standard APEL client software and made a functional solution. Unfortunately, the changes were not immediately portable. The portability problems came about because the HTCondor batch log parser was changed to use a new format which clashed with existing uses, and also an alternative CE log parser was written which used a novel data format, instead of the BLAH standard that the APEL client software is designed for. PIC print all the log data into one file that is read twice; once to parse out the CE data, and again to parse out the batch data. This is not the standard APEL client sequence, which is to use two files (a CE log in BLAH format and a batch system log in proprietary format), each of which is parsed once. Hence the file handling logic also had to be changed as a special case for HTCondor-CE; and we deprecate special cases.

Goals

The goals were to incorporate the essential elements of the PIC solution while maintaining portability with all existing uses, for example CREAM. We wanted no special cases and we wanted to preserve the existing architecture of APEL client. In the event, this was easier than we anticipated. For those who are interested, we wrote these design notes to describe the choices we made.

There was no need for the CE log parser to use a novel format, since the print format of HTCondor-CE is highly flexible and can be customised to write out BLAH log data directly. So we used a custom data extractor script to output BLAH, then fed that to the existing BLAH log parser. And there was no need to use a new format for the HTCondor batch log parser, since the existing standard parser, used at numerous CREAM/HTCondor sites, is already adequate, with only slight changes.

The one change needed was to incorporate an element of the PIC solution into the existing HTCondor batch parser. We added an optional final field, called cputmult (for cputime multiplier), which is used to apply a scaling factor to the run time durations. The scaling factor comes from the node that ran the job, and it defaults to 1 if it is not supplied, hence it is portable for existing uses. Version 1.8.0-1 and better of the apel parser software contains this change.

Scope

The Condor batch system application has been integrated with the APEL accounting system in various ways. Support for HTCondor varies according to the CE used and other factors. At the server/portal, HTCondor APEL Accounting is much the same as any other grid system. But there are variations in the APEL client software, which I will describe here. (NB: I will not focus on the ARC + HTCondor scenario, since Nordugrid ARC ships with its own accounting log file parser/sender, called Jura, distinct from the APEL client. This scenario is described in detail here: https://www.gridpp.ac.uk/wiki/Example_Build_of_an_ARC/Condor_Cluster). So this text will focus around two scenarios:

  • HTCondor used in conjunction with the separate CREAM CE product or similar;

Note: HTCondorCE is a HTCondor-based CE that inherently supports the HTCondor batch system, but which can also be used with some non-HTCondor batch configurations.

The APEL Client Architecture

Before I go into the various configurations, I'll discuss the APEL Client Architecture. For those who are unfamiliar with it, it works like this.

The APEL client is designed to take input from a matched pair of applications - a CE and a Batch System. A CE provides a "grid frontend",i.e. a generic interface that hides the detailed structure of how compute work is done. It deals (typically) with job reception, insertion into some batch system of compute resources, communication with job sender frameworks, dealing with input and output files, and handling authentication and authorization. These are standard functions that all CEs must provide. The Batch System, on the other hand, is used by the CE to actually do the computing work. It is (typically) not grid aware. It will simply take jobs and run them for the CE. Users of a CE do not need to be aware of Batch System details (in theory). So these systems are layered as a tier. The CE is a standard gateway, which (for purposes of communication) uses (as one of its parts) a generic interface known as BLAH. Thus all CEs will use the same interface protocols to the outside world. On the other hand, each batch system will have its own novel command and data interface that is hidden to a great extent from the outside world (since the CE is placed in front of it.)

apel.png

* Picture of the two tiers here, and the way data is pulled by apel client *

The implications for APEL of these arrangements is as follows. APEL client needs to read outcome data (accounting) from both the CE (top tier) and its batch system (tier below). Hence, for a single job, information is taken from the BLAH logs (which are given by the CE) and the Batch System logs (which are given by the batch system). Since every CE speaks BLAH, only one parser is needed to deal with ALL CEs. And since every Batch System is novel, a separate parser is needed for every batch system. This architecture has implications for the APEL client application. It uses exactly one design of BLAH parser and it has a choice from a multitude of designs for every batch parser.

So, in summary, each run of the APEL client always makes use of the "one and only" BLAH parser, and it has a choice from a multitude of batch parsers. This is good systems engineering, since it creates "re-use". These parsers write into the BlahdRecord and EventRecords tables, respectively. The database then joins the two record sets to create JobRecords, which are suitable for sending to the central portal.

CREAM/HTCondor

The support for CREAM with HTCondor (or similar) has been available for a long time. I won't describe it in detail (that information is elsewhere) but I'll give an example section of a parser.cfg that lays out how the two parsers (blah and batch) for a typical CREAM/HTCondor APEL parser setup is configured. I hope that suffices for now (this example is as modified from an original PBS setup , donated by John Hill, Cambridge, but I think it is correct for HTCondor now)

[blah]
enabled = true
dir = /var/log/cream/accounting
filename_prefix = blahp.log
subdirs = false

[batch]
enabled = true
reparse = false
# Batch system specific options. Valid types are LSF, PBS, SGE, SLURM etc.
type = htcondor
dir = /somedir/
filename_prefix = accounting
subdirs = false

The first section shown, [blah], specifies how the standard CE data (from CREAM, a standard CE) will be brought in with the standard BLAH parser. The second section, [batch], specifies how the HTCondor batch system data will be brought it with the special, existing htcondor batch parser. Since the batch data is novel in form, a "type" parameter is required in the [batch] section to tell APEL client what type of parser to employ.

HTCondorCE

HTCondorCE is condor-team developed CE that works natively with a HTCondor Batch System, and which can also work with a variety of other Batch Systems. The various setups are shown here.

http://opensciencegrid.org/docs/compute-element/htcondor-ce-overview/

The setups still have batch system functionality headed by a CE gateway. The pair-wise architecture (CE+BatchSystem) is not much changed, see below. There is still an identifiable CE layer and an identifiable Batch System layer, but (for now) I've ignored the output of log data (blah, batch, whatever). We'll explain that in a minute.

HTCodnor_CE_logsundef.png

* Picture of HTCondorCE ... still has two tiers but logs "undefined" *

Options for creating and ingesting HTCondorCE log data

As you can see in the picture, I have left the log file format for HTCondorCE undefined for the time being. This is because, within limits, HTCondorCE can write any type of log file we like, including BLAH. That is a config option, essentially.

Two ways have emerged recently for creating and ingesting HTCondorCE accounting log data, one used at PIC and potentially other sites, and another way proposed for Liverpool. I'll deal with each in turn and discuss the trade-offs. But before I do that, I'm afraid we have to understand Scaling Factors.

HTCondor(CE) Scaling Factor Considerations

The matter is slightly complicated by the issue of scaling factors.

Scaling factors are used to handle the differences in the powers of varied worker-nodes in a cluster. They are discussed in some detail in Appendix 2. I'll discuss two ways in which scaling factors are applied to job times in HTCondor settings. The first type is where the runtimes of the job are automatically adjusted to some common value. This is done at (e.g.) RAL and Liverpool, who use an ARC CE. The method uses features of ARC CE (hence it is not appropriate for HTCondorCE) and it is described here:

https://www.gridpp.ac.uk/wiki/Example_Build_of_an_ARC/Condor_Cluster#The_Setup_with_ARC.2FCONDOR

The second type is where the runtimes of the job are not adjusted. Instead, the job report also contains a Scaling Factor that can be applied by those who wish to use the values. At Liverpool, RAL and some others, the runtimes are adjusted before they are reported, so it is not necessary to include a scaling factor in the output report. PIC HTCondorCE reports DO contain a scaling factor in the report, hence the times are realtime, but need to have the factor applied to adjust them to the general site value.

The upshot of this is that, for some sites, scaling factor data is already applied (it is implicit in the accounting data) and for other sites (e.g. PIC) the scaling factor has to be explicitly applied. Hence, at PIC, some code is used to apply the factor,at the time the data is ingested (PIC has elected to make changes to the input end of the APEL client to do the scaling conversion.) The scaling factor is provided in the Batch system data. (Aside: this is not a standard feature of the existing HTCondor batch parser, so PIC had to modify that code. In doing so, if lost compatibility with existing users of the HTCondor batch parser, so we'll have to do something about that.)

Method at PIC

The developers at PIC took a copy of the standard APEL client code and got it to work at their T1 site with a HTCondorCE setup similar to that pictured above.

To do so, they extracted one log file from HTCondorCE (in a novel format, unused elsewhere) using one run of the condor_history command. This single log contains both BLAH data and Batch data (albeit in a format that can not be handled with either the standard BLAH parser not the existing htcondor batch parser.) Hence PIC needed to write two new parsers; one to replace the standard BLAH parser, and another to replace the standard htcondor batch parser. And, since the current APEL client architecture requires exactly one design of BLAH parser (see above), the architecture had to be modified to allow a choice of two designs of BLAH parser (standard and a special HTCondorCE one). Also, the PIC method arranges for the same log file to be parsed twice (prohibited in the normal version); once to extract the BLAH info and another the extract the batch info, and put it in the respective tables in the APEL client database. Further, smaller changes were needed to option parsing.

NB: One problem with the coding is that the new htcondor batch system parser became incompatible with existing CREAM (say) and HTCondor sites. This would have to be addressed if the patch were applied, by renaming and then by changing all CREAM/HTCondor sites to the new config, or by changing PIC, when they install the new code.

PIC_BACKEND.jpg

* Picture of PIC version of HTCondorCE backend *

Proposed Method at Liverpool

For Liverpool, the code changes are smaller. I intend to make one change to the existing htcondor batch parser. This will allow me to pass the scaling factor in, similar to how things are done at PIC. I have to do this, because I will be using HTCondorCE, not ARC, and it was an ARC feature that added the scaling factor. That will not longer be possible, so I'll use the same idea as PIC. There will be no further changes to the code APEL client code. Instead, I will write two log files using two runs of the condor_history command (see the data flow below and contrast to PIC flow above.) The first file will be in BLAH format, since HTCondorCE can write files in any format (within reason). The second file will be in standard (existing) htcondor batch log file format. The only change to the parsers will be the PIC change to make the scaling factor adjustment. If no scaling factor is provided, no scaling is done. Furthermore, this method could work at PIC, if they wish to adopt it, and it is compatible with all existing CREAM/HTCondor site. It can even replace Jura in an ARC/HTCondor setup.

LIV_BACKEND.png

Support for HTCondorCE using non-HTCondor batch systems.

Since the PIC method extracts all its data in one run of condor_history, it cannot be deployed in any HTCondorCE/non-HTCondor batch system configuration. Liverpool, on the other hand, retains BLAH and Batch log separation, so the following deployment would be possible (assuming a SLURM batch parser is produced, TBD.)

slurm.jpg

Infrastructure Description

To recap, data for accounting Job Records comes from two sources: the CE system (called Blah records for historical reasons) and the Batch system (called event records, for similar historical reasons). These are joined to make Job Records which go into APEL.

One field in the Job records is called "InfrastructureDescription". It appears to be a three part combined field made up from parts of the overall dataflow. Example values follow:

  • APEL-CREAM-HTCONDOR: data flows through APEL client. Job used a CREAM CE and was processed on a htcondor batch system.

  • JURA-ARC-CONDOR: data flows through JURA client. Job used an ARC CE and was processed on a condor batch system.

The trouble is that the field value is generated only in the batch system parser (hence residing in the Event Records). And the batch system parser does not know (or care) what the CE is. It could be anything. I regard this as a bug. The current apel HTCONDOR parsing software assumes it is APEL-CREAM-HTCONDOR .... which is liable to be wrong at least as often as it is right, once new CE systems (e.g. HTCondorCE) come into use. We'll have to deal with this systematically.

One way would be this: the client software (the "JURA" client, the "APEL" client, the "Whatever" client) would supply the prefix. The suffix would be supplied by the batch system parser, and the middle would be provided, of course, by the CE. Another way would be to read the value from the general [site-info] section of the config file (I like this because it is very easy). Another thing would be to ignore this field and do nothing.

Anyway, I can implement this any way we want, so I'm asking around for ideas. It's not that important anmyway.

Appendix 1 - Background information

HTCondorCE APEL Accounting

The need for a APEL accounting solution for the HTCondorCE was noted by several delegates at a HTCondor workshop at RAL.

https://indico.cern.ch/event/733513/

However, PIC are already using the APEL software, with certain patches, to accomplish this at their T1.

The unpatched APEL code is maintained by Adrian Coveney here:

The PIC level of support has been added by Jordi Casals, to parse HTCondorCE files. His repo (a clone forked off the main repo above) is maintained here:

There is also another version of a similar change, listed as a pull request:

Jordi Casals' branch (above) is known to work at PIC. But I (sjones) am doing tests to see if the patch is portable and suitable to be merged in without interfering with any other uses. That would give the option of using HTCondorCE as an alternative to (say) ARC. Some testing has been done, but it has not been conclusive yet. The results of the most recent tests are below.

Positive things:

I've obtained a patch to the APEL packages for this task from PIC (thanks, Jordi Casals) and I've made a test environment that mimics a HTCondorCE setup. I've shown that the patched APEL software parses log files produced by condor_history and populates a local APEL client database. I have deferred the accuracy tests of the process (against an independent method) until I've had to time to do more investigation. Note that I can rely on existing SSM software to transfer the data to the central APEL portal in the normal way (i.e. no need to test this.)

Compatibility concerns:

The patch totally reimplements an existing htcondor.py parser using a different field delimiter (was pipe, now semi-colon (not sure about this.... )) and other differences. While the patch works at PIC (we assume), it will not (or may not?) work at other sites still relying on the old parser input file format. Perhaps there are some? I may need to rework the patch to ensure it works independently, i.e. no risk of interference. I'm not sure yet how to do this yet... I'm still trying things out. Note that if we can guarantee that no site is using the current HTCondor parser, there is no need to consider these issues. But that would be hard to know. Perhaps the easiest thing would be to have the htcondor.py parser itself detect what sort of delimiter is in use, and hence assume the appropriate format for that file type. This would be reverse compatible with existing setups ....

Appendix 2 - APEL Accounting Scaling (and Normalisation)

For this article, I won't mention multicore - just multiply or divide by 8 as as appropriate for the type job.

For accounting, we need to measure the amount of work done by a job. We find out the work done by a job by multiplying the clock time by the power of the slot. Sysadmins measure the power of a slot in a worker node using the HEPSPEC06 benchmark program (HEPSPEC06 is both a standard benchmark method and a unit of compute power.)

Most workernodes give out around 10 HEPSPEC06 per slot. Some give 8, some give 16. But they don't go very far from 10 HEPSPEC06 (BTW in this discussion, 10 HEPSPEC06 is exactly equal to 2500 SpecInt2k, by definition.) 10 HEPSPEC06 is hence a reasonable, ballpark guess for any slot's value.

For simplicity, we might like all our slots to be the same, and assume each delivers 10 HEPSEC06, but they are not. We use many different types of workernode. But we can make them seem "the same" by assuming they all give out some standard power and (behind the scenes) adjusting the clocktime such that the work done is correct. Example: if we had a 8 HEPSEC06 slot, the scaling factor would by 0.8. The work done is found by multiplying clocktime by 0.8 and multiplying that by 10 HEPSEC06. If we had 16 HEPSEC06 slot, the scaling factor would by 1.6, etc.

So we apply a scaling factor to the clocktimes, and send the clocktimes into APEL; and to allow the second multiplication to be done, we also tell APEL the normalisation factor. The normalisation factor is (in this example) 10 HEPSPEC06 (although it could be anything BTW). Important note: we apply the scaling factor at the site; the normalisation factor is applied by the portal. So two multiplications are done to find work done; but at different places.

In summary, when the values arrive at the portal, the portal determines the work done for the job - it has both the scaled clocktime, and the normalization factor, in the record. The clock times have been scaled to make them consistent with our 10 HEPSPEC06 site normalisation factor.

Another thing to discuss is where the scaling factor is applied in the site. Some batch systems, including TORQUE/PBS, have a facility to apply the scaling implicitly in the batch system itself, so the process is wholly transparent to the APEL data processing. But some batch systems, such as HTCondor, do not have this feature, and we have to implement it externally to the batch system. How this is done may vary from system to system. This is slightly outside the scope of this article, but I'll say this. On our HTCondor-CE/HTCondor set-up, each worker node, when it runs a job, is configured to put its own scaling factor into the job output. The batch system data extraction script (written as part of this work) pulls this factor out and puts it in the batch job system logs, and the HTCondor parser (modified as part of this work) read this logs and applies the node's scaling factor dynamically as the records are written into the APEL database (see above.) Some background material for the general topic is presented here:

https://www.gridpp.ac.uk/wiki/Publishing_tutorial

Appendix 3 - Scaling FAQs

do we have in the portal the raw raw value, not scaled?

In the example in Appendix 2, we considered a typical batch system grid site that sends scaled (not raw) wall clock times of jobs to the portal, and a "site benchmark normalisation factor" of 10 HEPSPEC06. Hence, the portal does not receive raw wall clock time from the site. The portal then computes work = wall clock time * normalisation factor.

There are other situations. In VAC, for example, each worker node has its own dedicated, fully independent APEL submission client - using completely different software. VAC it sends raw wall clock time to the portal, and uses the node's real power as the "normalisation factor". Hence, with VAC, the portal does receive raw wallclocktime from the site. And even though worker nodes differ in slot power, the calculation is still correct; work = wallclocktime * normalisation factor (since each VAC record is consistent for that particular worker node - there is no general "site benchmark normalisation factor" with VAC, hence no scaling.)

the scaling, is it done by all the batch systems or not?

Some; not all. TORQUE/PBS has the feature to scale in the batch system. HTCondor does not. Hence it is applied in a different way, but always before going into the APEL client database.

do you have different fields for raw, scaled, and normalized?

In a typical TORQUE/PBS setup, the scaling factor is applied in the batch system. The raw wall clock times are never seen outside the batch system. When data emerges, it is already scaled, internally, transparently (however, it may be possible to determine raw wall clock time, since the starttime and endtime of the job are also provided. Obviously, due to scaling, they no longer match the scaled wallclocktime...) So we never see the raw wall clock times in the batch system logs (i.e. the input for APEL client) with TORQUE/PBS.

For the HTCondor, the raw wall clock times do emerge from the system, in the batch system logs. For system consistency with (e.g. TORQUE/PBS ), it is therefore necessary, on parsing the batch system logs, to apply the scaling factor (which is also written in the batch system logs).

So, in summary, when using scaling to make workernodes homogeneous (i.e. when not using VAC) the scaling factor is always applied before, or at the same time that, the wallclocktimes are read into the APEL client system. APEL client has no field for raw wallclocktime. And there is no field for normalized data in the APEL client system. Instead, the normalisation factor for the CE in question is stored. That is provided to the portal. Hence the portal calculates work = scaled wallclocktime * norm_factor.

-- SteveJones - 2018-12-18

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng HTCodnor_CE_logsundef.png r1 manage 785.6 K 2018-12-18 - 16:50 SteveJones  
PNGpng LIV_BACKEND.png r1 manage 491.2 K 2018-12-18 - 16:50 SteveJones  
JPEGjpg PIC_BACKEND.jpg r1 manage 84.6 K 2018-12-18 - 16:50 SteveJones  
PNGpng apel.png r1 manage 438.4 K 2018-12-18 - 16:51 SteveJones  
JPEGjpg slurm.jpg r1 manage 88.8 K 2018-12-18 - 16:49 SteveJones  
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2019-08-08 - SteveJones
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback