Experiment parameters
physical PC, pcitsdc04 with centos7-cern
Identical, unburdened AI nodes: ikadochn-es-c, ikadochn-es-c3, nova-large (4 CPU, 8GB) with slc6
Unless noted otherwise, aggregation queries shown are done from ikadochn-es-c3
Queries constructed in python to test different aggregations. The model aggregation (SRC) is the same as
https://twiki.cern.ch/twiki/bin/view/ArdaGrid/ElasticSearchEvaluation#Aggregation_query
For each aggregation and time range, the query was performed 12 times.
Results
- Performance on the 3 tested clients is very similar
- Performance of similar SRC and DST aggregations is almost identical
- Indexing load creates rare 1-2 second delays to aggregation queries, but overall effect is small
- SRC input and output size is linear with date range
- SRC aggregation time is linear with respect to input size and output size
- Matrix output size is not linear with date range
- Matrix aggregation time is NOT linear with respect to either input size or output size
More detailed aggregation timings
Model SRC aggregation times plotted against time range, bucket count and record count.
The same for the smaller range to see the realistic use-case better:
Not exactly linear here.
Total time vs ES reported time:
Unexpectedly linear! Does transfer time fluctuation correlate with aggregation time fluctuation? Or is most of the difference not transfer time, but json serialization time?
What does ES report
:
The time reported by elasticsearch in the "took" field is the time that it took elasticsearch to process the query on its side. It doesn't include
- serializing the request into JSON on the client
- sending the request over the network
- deserializing the request from JSON on the server
- serializing the response into JSON on the server
- sending the response over the network
- deserializing the response from JSON on the client
In our case, the aggregation request is small and client deserialization is not included in total time (as it does in the discussion linked). This leaves server-side serialization and transfer time to account for the difference.
Different client machines
To make sure changing the client that test run from does not drastically change the outcome. VMs possibly have different network performance than the PC the initial tests ran from. That may affect aggregation time.
Also, virtual machines might have less variance because they are empty and have no GUI running, or they might have more variance because they run puppet agent and the fluctuating load on other VMs might affect them.
Changing the client machine has almost no effect. Either the network conditions and performance is close for all clients, or the client has comparatively small effect on aggregation performance.
Plots with standard deviation to see if variance changes noticeably.
No apparent difference here.
SRC vs DST
A simple check that equivalent aggregation by DST instead of SRC is similar in performance:
Effect of indexing load on cluster performance
To check how the index update operation affects aggregation requests, repeated indexing of 3 days of data was started from ikadochn-es-c, then aggregation timings were measured from ikadochn-es-c3.
Looks like indexing has no effect on average aggregation times but rarely results in slower outliers.
Different SRC aggregations
Comparison of different SRC aggregations:
- src_plot: hourly bins, aggregations from the top of hierarchy: SRC_DOMAIN, IS_REMOTE_ACCESS, IS_TRANSFER, ACTIVITY, PERIOD_END_TIME
- src_plot_10m: SRC, but with 10-minute bins (no time aggregation at all, only sum over DST)
- src_plot_10m_nohist: Instead of histogram aggregation, use terms aggregation for PERIOD_END_TIME
- src_plot_daily: SRC same, but 24-hour bins
- src_plot_timefirst: SRC, but change aggregation order to PERIOD_END_TIME, SRC_DOMAIN, IS_REMOTE_ACCESS, IS_TRANSFER, ACTIVITY
- src_plot_minimal: hourly bins, aggregations are: SRC_DOMAIN, PERIOD_END_TIME, leading to less buckets
- src_plot_minimal_timefirst: reverse the previous to PERIOD_END_TIME, SRC_DOMAIN
Number of records traversed only depends on time range aggregated:
Size of returned buckets depend on aggregation structure, but not interval, as expected:
The number of buckets returned depends on aggregation and interval, but not aggregation order, as expected:
As a result, data size over time range is different for all aggregations:
Timing results:
Matrix aggregations
Matrix aggregations are a curious case because data does not grow for longer aggregation periods.
Compared 2 aggregations with examples of SRC aggregations for scale:
- matrix_full: all parameters except PERIOD_END_TIME, so SRC_DOMAIN, DST_DOMAIN, IS_REMOTE_ACCESS, IS_TRANSFER, ACTIVITY
- matrix_minimal: just SRC_DOMAIN, DST_DOMAIN
Bucket size depends on aggregation time, on a scale comparable to SRC:
Bucket count for the matrix is not linear:
As a result, data size is also not linear:
Matrix time is not linear with respect to date range (input data size):
Matrix time is not linear with respect to bucket count (which is proportional to output data size):
Matrix time is not linear with respect to output data size:
--
IvanKadochnikov - 2015-05-18