Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pynq_driver] fix device early return #7

Merged
merged 1 commit into from
May 1, 2020

Conversation

zhanghaohit
Copy link
Contributor

Bug

VTADeviceRun may return early before completion of the device execution

Solution

sleep a while before checking

@zhanghaohit
Copy link
Contributor Author

@tqchen @tmoreau89 Could you kindly help review this PR? Thanks.

@huajsj
Copy link
Contributor

huajsj commented Apr 29, 2020

@zhanghaohit, thanks for the contribution, do we know the root cause for this early return issue? a sleep solution may not always work and would impact performance, we better to avoid that.

@tmoreau89
Copy link
Contributor

Thanks @zhanghaohit for the fix. I'm curious when the early completion occurs as @huajsj pointed out.

Ideally the driver shouldn't rely on polling and instead use a more robust mechanism like triggering an interrupt.

@remotego
Copy link

remotego commented Apr 30, 2020

Thanks @huajsj @tmoreau89 for pointing out the potential performance issue for this fix.

The root cause of the issue is because on line 124, we instruct the VTA to start calculation by writing to the VTA_START bit of control register. The FPGA will then respond by changing its VTA_COMPUTE_DONE bit of control register from '1' to '0', indicating a "busy" state of the VTA hardware. And VTA will change it back to '1' after the completion of the calculation.

The original code created a racing condition here between the processor code and the FPGA hardware, by attempting to check the VTA_COMPUTE_DONE bit immediately after the issuing of "start" command. If we got a more powerful processor, the checking of VTA_COMPUTE_DONE bit will happen before FPGA could ever change it to '0'.

As @tmoreau89 mentioned, we are using a pooling approach here. And in my opinion, a short sleep before checking is quite common in the pooling world. Here we proposed a 1us sleep, which translates to around ~250 clock cycles on FPGA, it would give FPGA sufficient time to change VTA_COMPUTE_DONE from '1' to '0'. As 250 cycles are way shorter than any meaningful calculations, I believe there should not be any performance impact.

I agree with @tmoreau89 that interrupt is a more robust solution compare to pooling, but it is quite a huge task involving changes from both HW and driver archs.

@tmoreau89
Copy link
Contributor

Thank you @remotego for this logical explanation. The fix makes sense, and is highly likely not to affect performance negatively.

@@ -126,6 +127,10 @@ class VTADevice {
VTAWriteMappedReg(vta_compute_handle_, 0x0, VTA_AUTORESTART);
VTAWriteMappedReg(vta_store_handle_, 0x0, VTA_AUTORESTART);

// Allow device to respond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more explanation here to make sure that this sleep is well understood for posterity (the summarized comment by @remotego should suffice)

@huajsj
Copy link
Contributor

huajsj commented Apr 30, 2020

@remotego, Thanks for the detailed explain, but for this problem I think you may have a glance on related pynq driver logic, actually there is no "racing condition" between "processor" and "FPGA" , because they are checking on different register.

on line 124 the VTA_START get write into register with offset "0", and in line 134 the logic is go to check register with offset VTA_COMPUTE_DONE_RD_OFFSET which is "24", i think the work ground may work in some scenario but if we can not find root cause it may not reliable.

@remotego
Copy link

remotego commented Apr 30, 2020

@huajsj Thank you for the reply. Let me explain more on this issue.

The original name of 0x18 (24) register of Compute Module is XCOMPUTE_CONTROL_BUS_ADDR_DONE_O_DATA, it is a output from the FPGA hardware. From the point of view of S/W, it is a read-only register. Thus there is no way for software (i.e. driver) code to change the content of it. And this register is fully controlled by FPGA H/W.

impl/ip/drivers/compute_v1_0/src/xcompute_hw.h

// 0x18 : Data signal of done_o
//        bit 31~0 - done_o[31:0] (Read)

If we trace the controlling of this register in h/w code, we could see the register will be set to '0' at the beginning compute module, and sets to '1' only when a FINISH instruction is encountered.

void compute(
...
  // Set done value
  done = 0;
  // Perform action based on opcode
  if (insn.generic.opcode == VTA_OPCODE_FINISH) {
    // Set done flag if we reach a FINISH instruction
    done = 1;
  }

When we start the VTA hardware by this code block,

    VTAWriteMappedReg(vta_fetch_handle_, 0x0, VTA_START);
    VTAWriteMappedReg(vta_load_handle_, 0x0, VTA_AUTORESTART);
    VTAWriteMappedReg(vta_compute_handle_, 0x0, VTA_AUTORESTART);
    VTAWriteMappedReg(vta_store_handle_, 0x0, VTA_AUTORESTART);

The fetch module will dispatch the first compute instruction to compute module, and module will then set the Done register to '0'.

However, at the same time, the driver code will attempt to check the value of Done register in a pooling loop. And if the Done register is equal to 1. the driver will break and return.

    if (flag == VTA_DONE) break;

Thus we have a racing condition here:

|START| ->|FPGA Compute module start|->|Done -> '0'|->|Other Computations...|->|Done -> '1'|
   |                                                       |
   ->|Driver code attempts to check "done" register ??ms |->

@huajsj
Copy link
Contributor

huajsj commented Apr 30, 2020

@remotego , thanks for the detailed explain, the logic make sense, code LGTM.

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed response and explanation. This fix is good to go.

@tmoreau89 tmoreau89 merged commit 21937a0 into apache:master May 1, 2020
@remotego remotego deleted the bugfix/earlyret branch December 15, 2020 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants