Prof. Farnsworth: "Good news, everyone!"
Bender: "Uh-oh! I don't like the sound of that."
Prof. Farnsworth: "I have two resolutions and one clue ..."
Bender: "Here it comes ..."
Prof. Farnsworth: "... and two apologies"
Bender: "And I'm outta here!"
Apology #1: The shoemaker's wife has no shoes.
Dr. Petoi recently posted a video of a test of a T_SKILL_DATA command with the maximum number of elements that would fill up the newCmd buffer to demonstrate it could handle the data. In my reply I asked if he could increase the data amout in order to force the buffer overflow code to execute.
I wasn't thinking. I could have done that myself with the test software I wrote. My bot wasn't working and had a hardware problem but I have been testing it with the battery off for some commands for a while. I could have easily written the stress test to overflow the newCmd buffer. But I wasn't thinking.
So my apologies for asking you to do something I could have done myself.
Apology #2: Do the whole job, not just half.
The test is fairly trivial but the results ... well, that's getting ahead of things.
Here's my first version of the test
TEST_F(ftfBittleXProtocol, newCmd_overflow) {
const unsigned BUFF_LEN{ 2507 }; // 1524 =125*20+7=2507
vector < int8_t > symphony(BUFF_LEN, 32);
cmd_def_t command{ BEEP, "b", "BEEP" };
command += symphony;
EXPECT_TRUE(on_command(command));
}
When I ran the test I got a long beep and the output (trimmed for space)
1>[ RUN ] ftfBittleXProtocol.newCmd_overflow
1>ftBittleX::ftfBittleX::on_send
1>TX command : b32 32 32 32 32 ... 32 32 32 32
1>write count : 7522
1>G:\AppDev\Robotics\Petoi\test\ftBittleX\ftBittleX.cpp(423):
error : Value of: result
1> Actual: false
1>Expected: true
1>ftSerial.write FAILED expected: 7522 actual: 2304
1>expected : 7522
1>actual : 2304
1>G:\AppDev\Robotics\Petoi\test\ftBittleX\ftBittleX.cpp(654):
error : Value of: result
1> Actual: false
1>Expected: true
1>G:\AppDev\Robotics\Petoi\test\ftBittleX\ftprotocol.cpp(565):
error : Value of: on_command(command)
1> Actual: false
1>Expected: true
1>[ FAILED ] ftfBittleXProtocol.newCmd_overflow (308 ms)
I was surprised by the number of bytes sent and we saw the write failure message before in a previous test. First I cut down the data size to get the byte count closer to the buffer size (I'd figure out why the size was large later). But once I got the byte count down to just below 2507 I was still getting the write failure. So I needed to look more closely into that.
I recalled the posts about the problem of BittleX reading the Bluetooth input and how Bluetooth would breakup the data into chunks that caused a BittleX read timeout error. Dr. Petoi said the solution was to increase the timeout value for those commands and asked if I could try that and play with the value. Well, I did try it and matched the BittleX TIMEOUT_LONG value in the OpenCat_serial code and tested it and didn't see any effect.
But I didn't "play" with the value. This time I did by increasing the timeout value by 10 ms. And the result ... no more write failure!
So, my apologies for slacking off and not doing as I'm told.
Resolutions
From my POV, this issue is on the BittleX side where the UART microcode needs to manage the input queue during read. The write timeout is the time to wait for the receiving device to signal ready to receive the next byte. But it could be on the Windows side where it manages the output queue during write.
With that resolved I could explain the size of the data being sent. For an ASCII command like 'b' each data element is a number. An ASCII number has one byte for each digit in the number. I used a 2 digit number (32) for all elements. There is a space between each number. Thus I need 3 bytes for each element. The maximum number of elements I can have is then BUFF_LEN/3. That test gives
1>[ RUN ] ftfBittleXProtocol.newCmd_overflow
1>ftBittleX::ftfBittleX::on_send
1>TX command : b32 32 32 32 ... 32 32 32 32
1>write count : 2506
1>expected : 2506
1>actual : 2506
1>ftBittleX::ftfBittleX::on_response
1>Command completed (normal)
1>description : BEEP
1>cmd : b
1>id : 1
1>data : 32 32 32 32 ... 32 32 32 32
1> 12984 ms : RX_latency
1> 748 ms : RX_elapsed
1>response : 2 lines
1>Changing volume to 10/10
1>
1>b
1>
1>response : end
//...
1>[ OK ] ftfBittleXProtocol.newCmd_overflow (14792 ms)
To force the overflow condition I add 1 to the buffer size.
vector < int8_t > symphony(BUFF_LEN / 3 + 1, 32); // overflow
to get the output I'm looking for
1>[ RUN ] ftfBittleXProtocol.newCmd_overflow
1>ftBittleX::ftfBittleX::on_send
1>TX command : b32 32 32 32 ... 32 32 32 32
1>write count : 2509
1>expected : 2509
1>actual : 2509
1>ftBittleX::ftfBittleX::on_response
1>Command completed (normal)
1>description : BEEP
1>cmd : b
1>id : 1
1>data : 32 32 32 32 ... 32 32 32 32
1> 46 ms : RX_latency
1> 750 ms : RX_elapsed
1>response : 2 lines
1>OVFb
1>
1>response : end
This is progress. I see the overflow response and I hear the error beep. I also see confirmation of my calculation of the response time (750 ms). The RX_latency value is an indicator of how long it the code to detect the overflow.
But what about the rest of it? The overflow code replaces the command in error with the "standup" command and lets it get executed.
In order to get that I need act as if I sent the "standup" command and I'm looking for the response.
if (!HasFailure()) { // process overflow standup command
cmd_def_t cmd_standup{ SKILL, "k", "SKILL" };
cmd_standup += string("up");
EXPECT_TRUE(on_response(cmd_standup));
}
to get the output:
1>ftBittleX::ftfBittleX::on_response
1>Command completed (normal)
1>description : SKILL
1>cmd : k
1>id : 9
1>data : up
1> 0 ms : RX_latency
1> 23 ms : RX_elapsed
1>response : 3 lines
1>up
1>
1>k
1>
1>k
1>
1>response : end
We're not done yet. This demonstrates that an ASCII command overflows properly. It should be identical for a binary command. If so, then the test for the binary test should be identical to the ASCII test and trivial to write and has identical behavior,
TEST_F(ftfBittleXProtocol, BIN_newCmd_overflow) {
const unsigned BUFF_LEN{ 2507 }; // 1524 =125*20+7=2507
vector < int8_t > symphony(BUFF_LEN + 1, 32); // overflow
cmd_def_t command{ BEEP_BIN, "B", "BEEP" };
command += symphony;
unsigned latency{ 14000 };
EXPECT_TRUE(on_command(command, latency));
if (!HasFailure()) { // process overflow standup command
cmd_def_t cmd_standup{ SKILL, "k", "SKILL" };
cmd_standup += string("up");
EXPECT_TRUE(on_response(cmd_standup));
}
}
the results:
1>ftBittleX::ftfBittleX::on_send
1>TX command : 4220202020... 202020207e
1>write count : 2510
1>expected : 2510
1>actual : 2510
1>ftBittleX::ftfBittleX::on_response
1>Command completed (normal)
1>description : BEEP_BIN
1>cmd : B
1>id : 3
1>data : 32323232 ... 32323232
1> 46 ms : RX_latency
1> 749 ms : RX_elapsed
1>response : 2 lines
1>OVFB
1>
1>response : end
1>ftBittleX::ftfBittleX::on_response
1>Command completed (normal)
1>description : SKILL
1>cmd : k
1>id : 9
1>data : up
1> 0 ms : RX_latency
1> 228 ms : RX_elapsed
1>response : 3 lines
1>up
1>
1>k
1>
1>k
1>
1>response : end
More progress. I was also able to expose a known issue in the binary data stream. A binary command is terminated by the tilde character ('~'). I believe the documentation cautions you to avoid using a tilde as a data value. The obvious reason is that the tilde will be interpreted as the end of the command but the command processing will see it as a data value and "strange" things may happen.
As it turns out, my first version of the binary beep test used the same data length and values as the ASCII version
TEST_F(ftfBittleXProtocol, ASC_newCmd_overflow) {
const unsigned BUFF_LEN{ 2507 }; // 1524 =125*20+7=2507
vector < int8_t > symphony(BUFF_LEN / 3 + 1, 32); // no overflow
cmd_def_t command{ BEEP_BIN, "B", "BEEP_BIN" };
command += symphony;
//...
This won't overflow the newCmd buffer but is a malformed command. The buzzer tone never stops.
The data for the beep command comes in pairs, "note duration", We can calculate if the test data is a properly formed data stream by subtracting the command suffix ('~') from the length of the data stream we provide, (BUFF_LEN / 3 + 1). The result must be even value for a properly formed data stream. (BUFF_LEN / 3 + 1) - 1. = 836 -1 = 835.
This is more directly shown by simply using BUFF_LEN - 1 for the data
vector < int8_t > symphony(BUFF_LEN - 1, 32); // malformed
This yields 3 data points for testing:
vector < int8_t > malformed_symphony(BUFF_LEN - 1, 32);
vector < int8_t > no_overflow_symphony(BUFF_LEN, 32);
vector < int8_t > overflow_symphony(BUFF_LEN + 1, 32);
I don't recommend the running the malformed test. The buzzer never ends and "annoys the pig."
Here's the exciting footage of the T_BEEP buffer overflow test: https://youtu.be/vbEehKzfEt4
These results resolve two issues:
1. Why the Windows WriteFile call fails? (Thanks to Doc Petoi)
2. What happens when you do overflow newCmd buffer?
The Last Test
We need to do one last test before we put this issue to rest. We need to test when the T_SKILL_DATA command overflows the newCmd buffer the program doesn't crash. Logically, this should no different than the binary BEEP command but it is the issue under test so we want to explicitly test it. We just have to change the command we send; the data or its format doesn't matter. All that matters is the size of the data. The response should be the same as that for the binary BEEP command. There's an interesting twist to this test
TEST_F(ftfBittleXProtocol, SKILL_DATA_newCmd_overflow) {
const unsigned BUFF_LEN{ 2507 }; // 1524 =125*20+7=2507
vector nonsense(BUFF_LEN + 1, 32); // overflow
cmd_def_t command{ SKILL_DATA, "K", "SKILL_DATA" };
command += nonsense;
unsigned latency{ 14000 };
EXPECT_TRUE(on_command(command, latency));
if (!HasFailure()) { // process overflow standup command
cmd_def_t cmd_standup{ SKILL, "k", "SKILL" };
cmd_standup += string("up");
EXPECT_TRUE(on_response(cmd_standup));
}
}
The results confirm our analysis but uncover a problem in processing some commands
1>[ RUN ] ftfBittleXProtocol.SKILL_DATA_newCmd_overflow
1>ftBittleX::ftfBittleX::on_send
1>TX command : 4b20202020 ... 202020207e
1>write count : 2510
1>expected : 2510
1>actual : 2510
1>ftBittleX::ftfBittleX::on_response
1>Command completed (toggle)
1>description : SKILL_DATA
1>cmd : K
1>id : 10
1>data : 32323232 ... 32323232
1> 45 ms : RX_latency
1> 1027 ms : RX_elapsed
1>response : 4 lines
1>OVFK
1>
1>up
1>
1>k
1>
1>response : end
1>ftBittleX::ftfBittleX::on_response
1>ftBittleX::ftfBittleX::on_response elapse timeout!
1>description : SKILL
1>cmd : k
1>id : 9
1>data : up
1> 0 ms : RX_latency
1> 2028 ms : RX_elapsed
1>response : 1 lines
1>k
1>
1>response : end
When most commands complete they echo the command to the serial monitor. However, some commands, like T_SKILL_DATA, do some software "sleight-of-hand" and echo back one or more lowercase values of the command. The T_SKILL_DATA is one of these and normally responds with two 'k' values.
But it doesn't do that here. For overflow it responds with "OVFK". It then issues the "kup" command which does respond with two 'k' valuss when it completes. So the test harness becomes confused about the responses. This is part of why I'm not fond of behavior I didn't ask for. It's inconsistent. I think an argument can be made for the error handling to be elsewhere. But that's a design, not a testing, issue. The tests here confirm the expected behavior and not a crash.
Here's the dramatic and astounding video of the T_SKILL_DATA buffer overflow test:
https://youtu.be/DIS5Efdx258
A Possible Clue
As a result of the diagnosis of a possible hardware failure, I started to think about how to test the hardware to confirm the diagnosis. This brought back memories of my early days working in a computer manufacturing plant when I worked with various hardware components and microprocessors. It's been so long since then (40 years) that I've forgotten the day-to-day, simple tasks you do when a hardware issue comes up. But I suddenly recalled one of them, a task so simple you do it by reflex: visual inspection. You look at the component to see if there's a visible flaw. It takes a trained eye to see some flaws but the folks who make them know where to look and can spot them at a glance. The component under suspicion here is the power supply circuit. The documentation identifies where that circuit is located on the board.
So I took a picture of Malfunctioning Eddie's board:
Here's a enlarged view of the power circuit (if I've located it correctly):
My eye isn't educated enough to know if there's a visible flaw. My eye is drawn to the one of the two circular soldered connections. The one on the right; the upper right hand corner seems odd to me; as if a connection point has been uncovered due to solder fatigue or something. But Idunno.
But I do think there's someone who could quickly tell if there's an obvious flaw.