Skip to content

Commit 3bd325a

Browse files
author
Jakub Lopuszanski
committed
Bug#36072163 innodb.doublewrite test has race condition between perl and restart
Bug#36072349 innodb_autoextend_dml_1 and innodb_autoextend_dml_2 use expect file in wrong way Added sanity checks to: * mysql-test/include/check-testcase.test (which is a script executed before and after each test case) that there is no $MYSQLTEST_VARDIR/tmp/mysqld.$_server_id.expect file, as that would indicate that somebody expect the server to be shutdown at the beginning or the end of the test. This check is not done on Windows on which unlink($expect_file) done by the mysql-test-run.pl itself observes Permission Denied for several minutes, forcing us to give up to not timeout the whole test suite. * mysql-test/include/start_mysqld.inc (which is a script used to start up a server after it was shutdown/crashed) that there is an $MYSQLTEST_VARDIR/tmp/mysqld.$_server_id.expect file, and that it contains "wait", because if it doesn't exist it means the test wasn't expecting the server to be shutdown, and if it contains "restart.*" instead of "wait" then it means the author of the test wanted the server to be automatically restarted for them (so calling start_mysqld.inc would be a race). Fixed following issues with our tests, as the result of investigation: * Removed binlog.binlog_crash_safe_master_checksum_myisam test, which contained a snippet of code extracted from another test which involved MyISAM and crash_commit_after_prepare fault injection - taken out of this context, this small snippet makes no sense, and can't cause the crash * The comment in binlog_gtid.binlog_flush_logs_error claimed that the test uses crash_commit_before to cause a crash, but actually it always used simulate_flush_commit_error which doesn't cause a crash but an error * The innodb.alter_crash tried to manually write "restart" to expect file, then called start_mysql.inc, too, which caused a race. Changed that to use expect_crash.inc instead. * Same in innodb.doublewrite * Same in innodb_fts.sync * The innodb.session_temp_tablespaces thought it is starting the server, but actually it was already up and running already. Also, it looks like the intent could have been to restart with a different value of innodb_temp_tablespaces_dir, ".", but the reality is that InnoDB would still silently re-interpret it as ./#innodb_temp, so there's no real change from the default. I've simply removed the spurious call to start_mysqld.inc as it was doing nothing. * The innodb.zlob_rollback_crash_2 was never causing a crash, (and thus --error had 0 on the list of possible errors), because the crashpoint was unreachable due to mutually contradicting conditions. Later developement of BLOBs code made the code even "more unreachable" because it introduced optimization for short blobs. For more details see http://RB/rb/r/23299/#header-reply115008 * The main.dd_upgrade_error_cs was trying to start the server at the end, because it wrongly assummed the server was shut down before conducting several manual calls to --exec $MYSQLD. In reality the server was up and running the whole time. This was fine because the two operate on different data-dirs. I've chosen a safer approach in which the server is kept off throught the testcase, and only started at the end. * The main.persisted_variables_bugs tried to start the server while it was already up. Moreover it attempted to use --exec $MYSQLD_CMD to start another server from the same data-dir, which doesn't look like a good idea. I've fixed both problems by adding a shutdown_mysqld.inc before attempt to --exec $MYSQLD_CMD. However, this alone didn't work, because turns out that our assert_grep.inc needs connection to a server to execute some string concatenation via SELECT. So, I've reordered several lines in the code, so that we first start the server again, and only then assert the temporary error log file contains the pattern sought. * The main.restart_server_win had a severe race condition in that the author seemed to believe that RESTART is a synchronous operation. In reality the command returns immediately, but the restart itself takes several seconds. Meanwhile the test runner asked client to send another queries like SELECT 1, and SHUTDOWN, which were both accepted by the "old" server instance before it finished. Once the old instance finished, the supervisor was restarting it. Thus when the test attempted to call mysqld_start.inc, the server was already up and running thanks to the supervisor alone. I've fixed it by adding proper waits for the connection to be dropped and reestablished after RESTART. * The comment in encryption.upgrade says it will restart the server, but instead it tried to start it (while it was already up and running). I've changed it to actual restart_mysqld.inc. * The innodb.innodb_autoextend_dml_1 first wrote "restart" to the expect file, then tried to call start_mysqld.inc with a different set of options, which was a race. I've fixed this by using expect_crash.inc instead. * Same in innodb.innodb_autoextend_dml_2 * the portability_tablespace_linux, according to the code comment, tried to increase the wait_counter to permit for longer restart time, but in practice the value of 4000 which it used was smaller than the typically used 6000. The test often timed out, so I've changed it to 8000. * some tests tried to access t1.ibd file after observing error 2013, but on Windows the file could still be in use, so I've added retry argument to --remove_file commands * on slow Jenkins workers, it took several seconds to completely close the connection, so for some time it lingered in PROCESSLIST as KILLED, which triggered errors from --check-testcases. I've fixed this by changing the mysqld_check.sql to ignore KILLED connections. Many of the tests failed the sanity check reporting the expect file to still exist after the test on Windows in pb2. This appeared to be the same problem that was already reported as Bug#28921712 MTR SOMETIME FAILS TO REMOVE THE .EXPECT FILES IN WINDOWS, and for which mysql-test-run.pl already has a work-around, in the form of a retry loop. But, this loop made just one retry. Turns out even 439 loops aren't enough sometimes. OTOH, increasing it to 600 made the whole test suite time out in Jenkins after 5h, which indirectly indicates how frequent the problem is. As a trade-off, this patch increases it to 30. Above change in turn means that it can now take a bit longer for the server to be started again by the test runner, because test runner waits longer (before giving up) for the removal of the expect file. This was causing a new pattern of failures in which wait_until_connected_again.inc timed out, because it only had patience for 1000 x 0.1s. To match above change I've increased the maximum number of loops to 6000 by default. The innodb.doublewrite was failing for yet another reason: it tried to modify t1.ibd file immediately after the connection dropped due to a crash. The perl script couldn't open the file though, on Windows - probably again because of antivirus. I've reused corrupt_page.inc and added a retry loop to it. A similar issue was already reported and fixed in Bug#35595808 for mysql-8.4 and mysql-trunk, but in mysql-8.0 it is still not fixed. This patch includes a backport of that fix (calling wait_until_disconnected.inc) which in itself is not sufficient, as disconnecting from the server only means the server has closed connections which happens before its process ends, and file handles are released. Thus this patch extends the solution in corrupt_page.inc to include a retry loop (600 x 1sec) in case of Permission Denied. Change-Id: Ia5129fc4d846c0121f86b480a5c409d877d68b37
1 parent 038b026 commit 3bd325a

34 files changed

+163
-362
lines changed

mysql-test/include/check-testcase.test

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,26 @@
1111
# not have any unwanted side affects.
1212

1313
--disable_query_log
14+
--let $_server_id = `SELECT @@server_id`
15+
# There should be no expect file before nor after the test, because
16+
# existence of this file means author expects the server to be off,
17+
# while it is an assumption that the server is up and running at
18+
# the start and end of the test (in particular this script queries it).
19+
# Unfortunately, we can't verify this on Windows, because there the
20+
# unlink($expect_file) sometimes fails due to Permission Denied
21+
# despite numerous retries. In such case the test runner leaves the
22+
# file around, in hope it will be cleaned up later or not cause too
23+
# much trouble to the next test case. The trade-off here is that if
24+
# we insist on checking this on Windows, it will fail a lot without
25+
# revealing any real problem in how the test script is written, and
26+
# inceasing number of retries makes whole test suite time out, as this
27+
# happens quite frequently during whole test suite. Verifying on Linux
28+
# is enough to catch bugs in how the test scenario is writen.
29+
--source include/check_windows.inc
30+
if (!$have_windows)
31+
{
32+
--list_files $MYSQLTEST_VARDIR/tmp mysqld.$_server_id.expect
33+
}
1434

1535
# We want to ensure all slave configuration is restored. But SHOW
1636
# REPLICA STATUS returns nothing for servers not configured as slaves,
@@ -20,7 +40,7 @@
2040
# no error, and a few other invariants. So we issue SHOW REPLICA
2141
# STATUS, and if it returns no rows we fake the result.
2242
--let $tmp= 0
23-
if (`SELECT @@server_id <> 0`) {
43+
if ($_server_id) {
2444
--let $tmp= query_get_value(SHOW REPLICA STATUS, Replica_IO_Running, 1)
2545
--let $tmp= `SELECT '$tmp' = 'No such row'`
2646
}

mysql-test/include/check_windows.inc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,4 @@
77
# --source include/check_windows.inc
88
# # Now check if $have_windows is 0 or 1
99

10-
--let $have_windows = `SELECT CONVERT(@@version_compile_os using latin1) IN ("Win32", "Win64", "Windows")`
10+
--let $have_windows = `SELECT CONVERT(@@version_compile_os using latin1) IN ('Win32', 'Win64', 'Windows')`

mysql-test/include/mtr_check.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -200,7 +200,7 @@ BEGIN
200200
-- not give consistent result either.
201201
--
202202
SELECT /*+SET_VAR(use_secondary_engine=OFF)*/ USER, HOST, DB, COMMAND, INFO FROM INFORMATION_SCHEMA.PROCESSLIST
203-
WHERE COMMAND NOT IN ('Sleep', 'Daemon')
203+
WHERE COMMAND NOT IN ('Sleep', 'Daemon', 'Killed')
204204
AND USER NOT IN ('unauthenticated user','mysql.session', 'event_scheduler')
205205
ORDER BY COMMAND;
206206

mysql-test/include/not_asan_windows.inc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
let $have_asan_windows = `SELECT VERSION() LIKE '%asan%' AND CONVERT(@@version_compile_os using latin1) IN ("Win32", "Win64", "Windows")`;
1+
let $have_asan_windows = `SELECT VERSION() LIKE '%asan%' AND CONVERT(@@version_compile_os using latin1) IN ('Win32', 'Win64', 'Windows')`;
22
if ($have_asan_windows)
33
{
44
--skip Test requires 'not_asan_windows'

mysql-test/include/start_mysqld.inc

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,36 @@ if (!$restart_parameters)
77
# replace_result works on --exec
88
--exec echo "# $restart_parameters"
99

10+
# A common bug in our tests was that people tried to call start_mysqld.inc
11+
# while the expect file already existed and contained something other than "wait",
12+
# i.e "restart", "restart_abort" etc.
13+
# This was a race condition between server being restarted due to the inclusion
14+
# of this script, and restart because the test runner notices something other
15+
# than "wait".
16+
# The proper way of using start_mysqld.inc is to call it when you know the
17+
# server is not running - this is tricky, so as a compromise you should at
18+
# least ensure you are not connected. How? You've used a script which waits
19+
# for the server to shut down or at least disconnect your connection, or
20+
# you've executed a query which triggered a fatal error and your client
21+
# got disconnected (--error CR_SERVER_LOST, 2013).
22+
# And you don't want the test runner to ressurect the server, so the expect
23+
# file should exists and contain "wait" in it.
24+
# If it doesn't exist, then the test runner will panic, that server went
25+
# away and you didn't expect it. If it exists and has "restart" or
26+
# "restart_abort" in it, then the test runner will restart the server for
27+
# you, so you shouldn't attempt to also start it in parallel by using this
28+
# script.
29+
# Another, more rare bug was to try to use start_mysqld.inc when the expect
30+
# file doesn't even exit. This implies that the server is up and running, too,
31+
# as otherwise the test runner would panic not seeing the expect file.
32+
# Trying to start the server while it is already started indicates a logic bug.
33+
# Thus, we assert that the expect file exists and contains "wait" in it.
34+
35+
--let $_sm_old_wait_timeout=$wait_timeout
36+
--let $wait_timeout=0
37+
--source include/wait_for_wait_in_expect_file.inc
38+
--let $wait_timeout=$_sm_old_wait_timeout
39+
1040
# Include this script only after using shutdown_mysqld.inc
1141
# or kill_mysqld.inc or expect_crash.inc
1242
# where $_expect_file_name was initialized.

mysql-test/include/wait_for_wait_in_expect_file.inc

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,6 @@
3232
if ($_expect_file_name == '') {
3333
--die ERROR: $expect_file_name parameter needed for `wait_for_wait_in_expect_file.inc`
3434
}
35-
--echo include/wait_for_wait_in_expect_file.inc
3635
--let _WFWIEF_EXPECT_FILE_NAME = $_expect_file_name
3736
--let _WFWIEF_WAIT_TIMEOUT = $wait_timeout
3837
--perl

mysql-test/include/wait_until_connected_again.inc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
--disable_result_log
1414
--disable_query_log
1515

16-
--let $counter= 1000
16+
--let $counter= 6000
1717
if ($explicit_default_wait_counter)
1818
{
1919
--let $wait_counter= $explicit_default_wait_counter

mysql-test/include/windows.inc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
if (`select convert(@@version_compile_os using latin1) IN ("Win32","Win64","Windows") = 0`)
1+
--source include/check_windows.inc
2+
if (!$have_windows)
23
{
34
skip Need windows;
45
}

mysql-test/mysql-test-run.pl

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5990,17 +5990,24 @@ ($$)
59905990
delete $mysqld->{'restart_opts'};
59915991
}
59925992

5993-
# Attempt to remove the .expect file. If it fails in
5994-
# windows, retry removal after a sleep.
5995-
my $retry = 1;
5996-
while (
5997-
unlink($expect_file) == 0 &&
5998-
$! == 13 && # Error = 13, Permission denied
5999-
IS_WINDOWS && $retry-- >= 0
6000-
) {
6001-
# Permission denied to unlink.
6002-
# Race condition seen on windows. Wait and retry.
6003-
mtr_milli_sleep(1000);
5993+
# Attempt to remove the .expect file. It was observed that on Windows
5994+
# 439 are sometimes not enough, while 440th attempt succeeded. The exact
5995+
# cause of the problem is unknown, as trying to list the processes with
5996+
# open handle using sysinternal tool named Handle, caused the tool
5997+
# itself to crash. We make only 30 iterations here, as a trade-off,
5998+
# which seems to be enough in most cases, but waitng longer seems to
5999+
# risk timing out the whole test suite. Not removing a file might impact
6000+
# the next test case, thus we emit a warning in such case.
6001+
my $attempts = 0;
6002+
while (unlink($expect_file) == 0) {
6003+
if ($! == 13 && IS_WINDOWS && ++$attempts < 30){
6004+
# Permission denied to unlink.
6005+
# Race condition seen on windows. Wait and retry.
6006+
mtr_milli_sleep(1000);
6007+
} else {
6008+
mtr_warning("attempt#$attempts to unlink($expect_file) failed: $!");
6009+
last;
6010+
}
60046011
}
60056012

60066013
# Instruct an implicit wait in case the restart is intended

mysql-test/r/dd_upgrade_error_cs.result

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,6 @@ Pattern "Error in parsing Routine 'test'.'procedure_groupby_desc' during upgrade
1818
Pattern "Error in parsing View 'test'.'view_groupby_desc' during upgrade" found
1919
Pattern "Data Dictionary initialization failed." found
2020
# Remove copied files
21-
# Restart the server with default options.
22-
# restart
2321
########################################################################
2422
#
2523
# Bug#28204431: mysql upgrade from "version < 8.0.13" to "version >= 8.0.13"

mysql-test/r/persisted_variables_bugs.result

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -434,9 +434,9 @@ DROP USER 'bug28749668'@'%';
434434
#
435435
# config file is in wrong json format
436436
# server should fail to start
437-
include/assert_grep.inc [Server should fail to start as mysqld-auto.cnf file is with wrong JSON format.]
438437
# start server with all defaults
439438
# restart:
439+
include/assert_grep.inc [Server should fail to start as mysqld-auto.cnf file is with wrong JSON format.]
440440
#
441441
# Bug #32761053: SET PERSIST_ONLY ... = DEFAULT IS BROKEN FOR ENUM VARIABLES
442442
#

mysql-test/suite/binlog/r/binlog_crash_safe_master_checksum_myisam.result

Lines changed: 0 additions & 18 deletions
This file was deleted.

0 commit comments

Comments
 (0)