Streams Performance and Scalability Tests
Goals
This wiki page has the goal of documenting and sharing the results of streams performance and scalability testing in progress at CERN. The aim is to achieve a scalable and robust replication system for data sharing between Oracle databases at CERN and outside CERN (Tier 1 sites) and also within CERN (online to offline data transfer).
People involved
CERN: Dirk, Eva, Luca
Oracle: Patricia, Graeme
Test Architecture and setup
Replication has been set up between two Oracle RAC databases at the CERN datacenter on the same Gbps LAN. This reduces the complexity by excluding network latency issues and the use of heterogeneous configurations. Oracle RAC on Linux with 2-node clusters are used (see
Architecture_description.pdf and
Installation_verbose ).
Streams replication is setup as described in
CONFIG_STREAMS_for_TEST_10Jul.txt
The test workload is a simple insert into a single table replicated from source (d3r database) to destination (test1 database). This workload and the structure of the table (short row size composed mainly of numeric fields) have been chosen to reflect the expected behavior of an important class of applications at CERN (where 'condition data' that is numeric parameters of detectors and similar are logged into one table).
The test script can be found in
test_script.py.txt
Test 10-Jul, Luca
The test is: insert 1M rows into the a single table with 1 index (PK). Insert statements are grouped in transactions of 100 rows. A delay of 2cs has been added between transactions.
Notes
- The delay is new, this has been added to slow down the insert speed and bring it just above the measured replication speed.
- The test now runs for about 5 minutes, instead of 1 min.
- Apply parallelism is set to 1
- Replication source is d3r, target is test1
- on both DBs node 1 is the owner of the streams queues during the test (KST trace were gathered on node 1)
Collected logs and test sequence:
- scripts for KST traces are prepared (see KST_trace.txt)
- the test is started from a client on the same LAN as the DB servers (Mon Jul 10 19:27:42 2006)
- no further actions for a delay of a few seconds
- AWR snapshots are taken (on d3r and test1)
- strmmon data collection is started (d3r and test1)
- KST trace data is collected
- The test program stops inserting (reached 1M rows) at Mon Jul 10 19:31:40 2006
- streams replication goes on till 2006-07-10 19:32:45
- AWR reports are taken
- strmmon collection is stopped
- streams HC are gathered
- test10Jul_d3r.tar.gz: source DB trace files (test 10jul)
- test10Jul_test1.tar.gz: target DB trace files (10Jul)
Measured throughput:
- in rows/lcrs: 1M rows in 5 min -> 3.3K rows/sec
- In bytes: 50 MB in 5 min -> 10 MB/min, 170KB/sec
- Note: this result is consistent with what was measured and discussed previously for similar tests
Notes on Performance and scalability:
- The measured performance of 'Test 10-Jul' is already usable by some of the application in development and test at CERN
- However some write-intensive applications that we foresee in our environment will most likely need more than that.
- For this reason we would like to better understand the scalability limits of streams replication in its current form and possibly tune for performance and scalability.
- Other observations related to scalability and 'bottleneck hunting':
- We see that systems resource utilization (CPU, IO) is not the bottleneck
- From this and other tests we see that 'capture' is not the bottleneck
- From other tests we have seen that increasing the apply parallelism we don't increase the throughput for this particular workload (the peculiarity here is that we insert into one table)
- During the test we could see that the replication pace was slowly drifting behind with the insert (test harness) workload, so I believe that a steady state can be reached at a somewhat lower workload than what was used for Test 10-Jul.
- Note 11 Jul: this has been confirmed by further tests:
- delay = 1 cs (was 2 cs in Test10Jul)-> workload generator run time = 2 min -> streams replication run time = 5min
- delay = 3 cs (was 2 cs in Test10Jul)-> workload generator run time = 5 min -> streams replication run time = 5min (it stops about 10 seconds after the workload generator)
Test 12-Jul, Luca, Eva
- similar to Test10 Jul, with the following differences, as requested by Patricia
- KST tracing events changed:
ALTER SYSTEM SET trace_enabled = TRUE;
ALTER TRACING DISABLE "ALL";
ALTER TRACING ENABLE "10304:255:ALL:NORMAL";
ALTER TRACING ENABLE "10005:255:ALL:NORMAL";
- collected ash report too
- added a script to collect pstack info for Apply Server in a loop (100 times with a delay of 1 sec)
- Test12Jul_d3r.tar.gz: test12jul - logs and reports from the source DB
- Test12Jul_test1.tar.gz: test12jul - logs and reports from the destination DB
Reply from Patricia (22-Jul):
We would like to see more detailed information for the apply server process
(P001) session id 505 from the ASH information:
Just
FYI...you can specify the specific session_id by defining target_session_id
variable within the ashrpt.sql. Make a copy of the ashrpt.sql file from
rdbms/admin directory, Define the target_session_id variable with this copied
file and use this modified file to generate the ASH report.
If the ASH
data is no longer available for this AWR report and you need to rerun the load,
can you supply some sar data that shows the disk activity during the run - and
identify the disks where the database files and logfiles are stored. Also, the
KST tracing should not be collected until you see the apply consistently showing
as a bottleneck (at least 6-10 times in a row).
Just to clarify... we will need to see 2 ASH reports for the target database.
The first one must be run without the specific target_session_id setting. Then,
generate a 2nd ASH report with the APPLY SERVER sid specified, so we can get
more detailed information. Basically we want detailed event info on any session
that is taking significant time on the database (other than the propagation
receiver). In the earlier example, session id 509 (the first session listed in
the table) is the propagation receiver.
Added
27-Jul: can you add to the data collected pstack output for the reader process of the CAPTURE at the source database. Take multiple pstack outputs at the source. In addition, obtain the ashrpt for the capture reader process session at the source database. There have been some additional discussions on the output supplied so far, and additional info has been requested.
Added
4-Aug: I just want to let you know that the last report showed the logminer reader to be 90% active. Based on this information, it doesn't seem likely that the apply is the bottleneck. If you are going to do a re-run, please make sure to get an ASH report targeting the logminer reader session, on the source, as well.
Actually, I think we can make the information collection a bit simpler on this re-run. Just give us the STRMMON and the ASH report for the source and destination. Also, re-run the ASH for the logminer reader and the apply server session. In other words, it is not necessary to generate the KST and the AWR reports.
Test 04-Aug Luca, Eva
- repeated the same test (insert into 1 table and replication from source to destination RAC on a LAN, 1 milion rows inserted with transactions of 100 rows, dealy of 2cs between transactions, see also python script)
- test04aug.zip: streams replication performance test - test04 aug
- note on the ash reports: those were run after a few seconds (estimated up to 1 min) replication had finished, with start = -9 min to current time (this introduces some 'idle time' in the calculation).
- note on apply parallelism: it eas set to 1
- comment: we don't see capture (logminer) as the bottleneck.
Test 15-Aug Web Conference
Test 12-Nov
- From Patricia: alter system set events '26749 trace name context forever, level 2';
- Test with apply parallelism = 2 (tested parall=3 and parall =4, contention reduces throughput, parall =1 also lower throughput for different reason)
- using same test script as July tests above (with delay = 0.1 cs)
- in rows/lcrs: 1M rows in 160 sec -> 6.2K rows/sec
- capture is still most of the time blocked for flow control + capture (OS) process active with 20% of CPU (on a dual CPU machine) -> (capture is not the bottleneck),
CMS Online/Offline replication tests, Marcin
The test is: Streaming 7 ecalpedestals objects in a single commit
Test conditions and setup
- Streams setup described in Test Architecture and setup (above)
- Inserting data into 2 tables - st_ecalpedestals(1 row per object) and st_ecalpedestals_item (child)
- Each object's size is 1.63 MB, after relational mapping it takes 61200 rows in st_eclalpedestals_item table
- We have 7 x 61200 row inserts in a single commit
- The number of LCRs is 428,400
- It must be 7 objects in a single commit, otherwise the risk of data corruption may arise
Test procedure
PL/SQL
streamtest procedure generates the random data and does the table inserts.
During the tests we're monitoring the LCRs flow, redo generated on source and destination databases and streams pool size.
AWR reports are taken.
Performance
- Performance is very low - the replication time is 300 - 350 seconds
- Both processors run at 50%
- 100 - 120 secs of "Streams AQ: enqueue blocked due to flow control" wait events on the destination database
Unresolved or partially unresolved problems
We've noticed, that for a single run of streamtest(7) the redo volume generated by the destination database is much higher than on the source (716 MB
vs. 60 MB - 12x factor).
The redo traffic distribution is also interesting. Redo activity is around 3.5 MB/s until the apply writer starts, and 1.5 MB/s during the apply (as indicated by strmmon, AXXX process activity changes to non-zero value) see
lcrs.png and
redo.png attached.
Having mined through the logs on the destination database we've discovered that there is a lot of inserting into STREAMS$_APPLY_SPILL_MSGS_PART - precisely the number of inserts into former table equals the number of inserts into st_ecalpedestals_item, which is 428400 (7x61200).
We've dumped the log and observed that redo entry's size for STREAMS$_APPLY_SPILL_MSGS_PART is 591 bytes vs 75 bytes for st_ecalpedestals_item redo insert entry. The redo factor is then 666/75 = 8.88. So the redo volume generated by st_ecalpedestals_item inserts and STREAMS$_APPLY_SPILL_MSGS_PART is around 300 MB, which leaves 400 MB of redo unaccounted for (we've only checked the most frequent redo entries).
We suspect that the inserts into STREAMS$_APPLY_SPILL_MSGS_PART are related to TXN_LCR_SPLILL_THRESHOLD apply parameter. We've tried to set this parameter to different values than default 10,000. With the parameter set to values > 80,000 the performance dropped slightly and the source database redo increased its volume by factor of 2 (so far we haven't mined through the logs). The performance drop could be observed via strmmon output - there were gaps 10-20 sec in LCR propagation (source and destination queue as well as propagation indicated 0 LCRs/sec). Having set the parameter to 300,000 caused the spilling from queues on the destination database, which led to redo volume increase to 1,2 GB on the destination and affected the redo on source side (increase). The free streams pool size hasn't dropped below 30% (of 1,2 GB).