In this section we need to describe all transformation which happens to the accounting data at all levels.

How CMS and ATLAS Dashboard convert raw wallclock into HS06

For both ATLAS and CMS, we also have plots in historical views for HEPSPEC06 over pledges. For this, - we first collect once per month the installed capacities of all the sites from REBUS: logical cpus and total hepspecs. By dividing these two number together, we get the average HEPSPEC06 power for one CPU core for a site. We store this in our database (with the complete monthly history). - in the CPU/WC HEPSPEC06 plot, we multiply the CPU/WC consumption with the HEPSPEC normalisation factor that was calculated above and then we transform this in hours.

Note from Pepe

- Since the farms are composed by many CPU types, one select a reference, and scale all of the CPUtimes and WallTimes to this reference. This is the CPUScalingReferenceSI00 value. These "seconds" or "hours" are not real time, it's scaled time, that goes to APEL, the the EGI accounting portal. When one selects "CPU Time", we cannot compare the results obtained at different sites, since each site uses different references. Hence, the sum of "times" provided by the EGI accounting portal, and the relative %, are simply WRONG, since the values cannot be compared at all. This table is useless:

- The EGI accounting portal uses CPUScalingReferenceSI00/250 to scale these values and get the HS06·hours, which is called the Normalised CPU (or Elapsed) time. This is correct, since the different references are then taken into account, and then we can compare the values at the different sites, and get the relative contributions, etc...

- NOTE: Sites report the accounting values of the pilot jobs, for those VOs running pilots, which are the 'jobs' that the farms see. The dashboard of the experiments show the information coming from the payloads. So, one cannot simply go to the dashboard and get the core-hours and compare to these scaled times reported by EGI, since you are measuring different things, and in different scales (the jobs running at the sites report the exec. time on the node, no scale factors are applied).

- We can argue if the HS06 is the best benchmark we have, but it's pretty trivial with the MJF enabled at the sites for each payload to report to the dashboard about the consumed HS06·hours. This would be great.

In PIC we have a reference of 3050 SI00. Our average reported power (GlueHostBenchmarkSI00) is GlueHostProcessorOtherDescription.Benchmark(12.1205)*250 = 3030. But, as of today, we have switched on more machines, and the average HS06/core is 13.1, which corresponds to an average power of 13.1 HS06/core. This is a small effect, and indeed, we can check if the sites are publishing correctly this info, by simply ldapsearch this info:

ldapsearch -x -h lcg-bdii.cern.ch -p 2170 -LLL -b o=grid | grep "HEP-SPEC06" | grep "Benchmark" | cut -d"=" -f3 | cut -d"-" -f1 | sort -n | uniq -c

95% of the published CE's have GlueHostProcessorOtherDescription.Benchmark < 20:

There are CE's with values that are REALLY big, and to me, these should be corrected if we want to draw conclusions from this:

The most dangerous value to be wrong is the CPUScalingReferenceSI00, since it has direct influence in the EGI accounting portal, if the times are not well scaled. Having a look to this value CPUScalingReferenceSI00/250, this is what I observe from the BDii:

98% of the published values (multiple entries per CE, but here we look how the numbers are) are <20. 2% of the reference values seem unreasonable, with max. at 236 HS06/core! So, if these sites are really referencing to these high values, their reported scaled cpu time in APEL should be really small, since they are referencing to a very powerful (inexisting!) CPU type. Let's see if this indeed happens.

The site reporting CPUScalingReferenceSI00=59000 is ULAKBIM. For May 2016, the EGI "Sum CPU time" value was 278.893 "hours". The EGI "Sum Normalised CPU time" was of 65818773 HS06·hours. You see that the ratio is 236 (59000/250), consistent with this finding.

Then, is the reported "scaled" CPU time correct? or... Is the reported reference incorrect? You see the impact in the accounting if this is wrong... The site might be reporting a factor x10 more used CPU time than indeed done... how to verify this?

[jflix@ui03 ~]$ ldapsearch -x -h kalkan1.ulakbim.gov.tr -p 2170 -LLL -b o=grid | egrep "CPUScalingReferenceSI00|GlueHostBenchmarkSI00|GlueHostProcessorOtherDescription" | sort | uniq GlueCECapability: CPUScalingReferenceSI00=59000 GlueHostBenchmarkSI00: 59000 GlueHostProcessorOtherDescription: Cores=12,Benchmark=5.19-HEP-SPEC06

The site reports an average power of 5.19 HS06/core, and it's referencing to 236 HS06/core, however the GlueHostBenchmarkSI00 is incorrect, it should be 5.19*250 = 1298.

So, at this point, we need to think how to proceed. To me, since the reference is arbitrary, it would be really useful that all of the sites reference to the same CPU type, or the same value. This would simplify things a lot, and it would be really useful to detect publication errors, since this have a direct impact on the delivered HS06·hours reported in the EGI accounting portal.

Also, if we could have MJF enabled and the payloads would report the HS06·hours to the dashboard, it would be really really useful.

-- JuliaAndreeva - 2016-06-23


This topic: LCG > WebHome > AccountingTaskForce > DataNormalization
Topic revision: r1 - 2016-06-23 - JuliaAndreeva
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback