Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tape read failure #4

Open
larsbrinkhoff opened this issue Nov 9, 2016 · 10 comments
Open

Tape read failure #4

larsbrinkhoff opened this issue Nov 9, 2016 · 10 comments
Labels

Comments

@larsbrinkhoff
Copy link
Member

larsbrinkhoff commented Nov 9, 2016

There seems to be some problem reading from tapes. This is an extract from my configuration file (based on run/ksits/klh10-pi.ini):

devdef mta0 rh1.0 tm02 fmtr=tm03 type=tu45
devmo mta0 minsys.tape
load @.nsalv-260-u
go

minsys.tape is a tape image created with itstar. It works fine with SIMH. @.nsalv-260-u is the ITS disk salvager from run/ksits.

When I type go and then mark<ESC>g, the disk formatting procedure works ok. (Answer 0, y, 0, n, 3000, foobar.)

But next, if I type tran<ESC>g to restore files from a tape, things sometimes go badly. (Answer 0, y.)

On some of my host machines, the start of a file is found, but then I get

Tape offline or controller not ready
*** ERROR *** THE SYSTEM MAY NOT BE BROUGHT UP

On some other host machines, sometimes all files are restored, or the error appears in the middle of the tape.

If we look into the source code of NSALV 260, we see

MREAD2: MOVEI B,<%TMD16_8.>+<%TMFCD_4.>+TMBTS
        IOWRI B,%TMTC           ;Tell TM03
        MOVEI B,%TMNOP          ;You may need this to set status values
        IOWRI B,%TMCS1          ; but I'm not really sure
        IODRI B,%TMCS1
        TRNE B,%TM1GO           ;Wait till done
         JRST .-2
        IODRI B,%TMFS           ;Get status
        TRNE B,%TMSOL           ;On line?
         TRNN B,%TMSFR          ;Formatter ready?
          LOSE "~&Tape offline or controller not ready"
  • %TMTC is the tape control register at 772432.
  • %TMCS1 is control and status 1 at 772400.
  • %TMFS is formatter status at 772412.

So presumably, the tape controller is ready, but the tape formatter is not.

@Rhialto
Copy link
Member

Rhialto commented Nov 9, 2016

Does the problem occur on faster machines? I'm asking because in the past there has already been a problem with that (with other devices though) with a sort of workaround, as in commit 79e1a9e .
Although, on the other hand, the network, disk and tape use a separate process and InterProcessCommunication to handle the work, so this may work out differently.

@larsbrinkhoff
Copy link
Member Author

larsbrinkhoff commented Nov 9, 2016

It may be worst on my fastest machine.

I rebuilt KLH10 with those changed options, but I don't think it made any difference.

Using NSALV to restore files works very reliably in SIMH, so I'll try to add some debug logging to both emulators to see what the difference is.

@larsbrinkhoff
Copy link
Member Author

larsbrinkhoff commented Nov 9, 2016

Looking at the TM03 tape controller right before the failure. This is from KLH10 with debug=1 passed to the mta0 devdef.

Tape #1
.;DSKDMP RP06
[tm03_rdreg: r1/ 14400]
[tm03_wrreg: r11/ 2000 = 2000]
[tm03_wrreg: r0/ 4070 = 1]
[tm_cmdxct: 01]
[tm03_rdreg: r0/ 4000]
[tm03_rdreg: r1/ 14400]
Tape offline or controller not ready
*** ERROR *** THE SYSTEM MAY NOT BE BROUGHT UP

The last rdreg is interesting. It has the "tape online" bit set, but the "formatter ready" bit is off. Compare with SIMH:

Tape #1
.;DSKDMP RP06
TU read MTFS(r5): 10650
TU write MTTC(r15) 2000
TU write MTCS1(r0) 1
TU read MTCS1(r0): 4200
TU read MTFS(r5): 10650

There is a 6 in the last line where there was a 4 in KLH10. That't the "formatter ready" bit.

@larsbrinkhoff
Copy link
Member Author

larsbrinkhoff commented Nov 9, 2016

So apparently, the TM_WCF and TM_RDF case in tm_cmdxct turns off the ready bit, TM_SRDY. Now, if I just comment out that line, everything works fine! Not saying that is the right fix, of course.

@Rhialto
Copy link
Member

Rhialto commented Nov 9, 2016

I have just pushed a fix so that you can compile with
-DKLH10_DEV_DPTM03=0 (in mk/top-ks-its.mk). This eliminates the device
process for the tape and changes a lot of timings and such. You should
give that a try. It never turns off the TM_SDRY bit, if I read the code
correctly:

    sts |= TM_SDRY;     /* Assume "slave" always there */

Apart from that, in this code that you mentioned,

    case TM_WCF:    /* Write Check Forward (same as Read Forward) */
    case TM_RDF:    /* Read Forward */
...

#if KLH10_DEV_DPTM03
    if (!tm_io(tm, 1)) {        /* Errors handled by tm_io now */
        TMREG(tm, RHR_CSR) &= ~TM_1GO;  /* Turn off GO */
    }
    TMREG(tm, RHR_STS) &= ~TM_SDRY; /* Drive busy, note GO still on! */
#else
    (void) tm_io(tm, 1);    /* Errors handled by tm_io now */
    tm_cmddon(tm);      /* Do post-xct stuff, includes clearing GO */
#endif
    return;     /* IO operations don't trigger ATTN when done */

it smells like the clearing of TM_SDRY races with tm_io(), where it
causes messages to be exchanged with the device process. It may well be
that the command is executed before the bit is cleared, and then it
likely won't be set again (although I didn't trace that in detail).

When the device process signals, with SIGUSR1, that the command is
completed, tm03_evhsdon() is called, which calls tm_cmddon() which
calls ts_ssta() which sets TM_SDRY. I'm not sure if this is called
actually from the signal handler, i.e. asynchronously.

In any case, it looks safer to invert the order:
try putting the line that clears TM_SDRY before the call to tm_io().
There are other places where it is done in that order, and multiple
places where the order seems suspect.

@Rhialto
Copy link
Member

Rhialto commented Nov 10, 2016

Alternatively, if the signal handler doesn't directly call the
completion routine, in which case I'm looking for cases where GO is
turned off, and TM_SDRY is not turned back on by that time.

It is in theory possible that tm_io() returns 0, in which case GO is
cleared (indicating the command has completed) (and no message to the DP
is sent), but also TM_SDRY is cleared. And that one won't be set again
for a while (since the DP won't signal anything back). At least not
until tm_ssta() is called.

So in this case, the code (also in a few instances) should be changed
such that it reads

#if KLH10_DEV_DPTM03
if (!tm_io(tm, 1)) { /* Errors handled by tm_io now /
TMREG(tm, RHR_CSR) &= ~TM_1GO; /
Turn off GO /
} else {
TMREG(tm, RHR_STS) &= ~TM_SDRY; /
Drive busy, note GO still on! */
}
#else

(which of course is incompatible with my previous suggestion to try, but
it has the advantage that the comment "note GO still on" is more
accurate in this case)

Now it seems unlikely that tm_io() randomly returns 0 from time to time.
That's a weak point of this theory.

@larsbrinkhoff
Copy link
Member Author

Some more testing with an unmodified KLH10:

  • On my fast home machine, it fails consistently at the first tape file.
  • In a Linux running in VirtualBox, it sometime works ok, sometimes fails at the first file, and more rarely fails in the middle.

This seems consistent with a timing-dependent race condition.

I don't quite have the knowledge to see what the best fix would be. Are there some tests you want me to run? I can insert logging to view the state changes for the GO and SRDY bits; in both the failing and successful scenarious.

@Rhialto
Copy link
Member

Rhialto commented Nov 10, 2016

When the DP (device process) has processed a command it sends a signal.
Given that I tend to think now that the signal just sets a flag, which gets checked in some specific locations (with INSBRKTEST()) there can't really be a race condition as in comment 259556278 but more likely something like the next one.
So one thing to log could be all places where GO gets turned off and check if SRDY is not on.

The normal completion path would be something like

  • Signal SIGUSR1 gets received
  • Some INSBRKTEST() detects this and apr_check() gets called, which calls dev_evcheck().
  • This calls the registered handler in dvtm03,c: tm03_evhsdon().
  • tm03_evhsdon() sets tm->tm_state = TM03_ST_READY; (which should later cause the SRDY bit to be turned on) and calls tm_cmddon().
  • For TM_RDF which is probably the most used command in this scenario, basically only tm_flsbuf() gets called.
  • tm_flsbuf() eventually calls tm_ssta()
  • and finally tm_ssta() detects that tm->tm_state == TM03_ST_READY and sets the SDRY bit. The idea seems to be that SDRY tracks if the device proc is busy processing a command. If in hardware the SDRY doesn't track if the formatter is busy doing something, this may be a flawed idea, but I can't imagine KLH would do it that way then.
  • Back in tm_cmddon(), at the end, the GO bit gets turned off.

Somewhere this process must go wrong.

Now in the tape reading code you show way above, if I read it correctly, the actual command isn't RDR but NOP. NOP always turns off GO right away and doesn't even involve the device process, and doesn't mess with SRDY. So it might be some interplay between the RDR command and the NOP
It can't be that the NOP command gets given before RDR completes, because you can't even write a new command while GO is still on from the previous one (https://github.com/Rhialto/klh10/blob/master/src/dvtm03.c#L1154).

@Rhialto
Copy link
Member

Rhialto commented Nov 10, 2016

I see that the TM_ER3 command gets sent to the device proc, but on completion, tm_cmddon() doesn't call tm_ssta() for it. So this command would seem to leave the SRDY bit off.

@larsbrinkhoff
Copy link
Member Author

larsbrinkhoff commented Nov 10, 2016

-DKLH10_DEV_DPTM03=0 works fine, so I'm happy to use that as a workaround. I'm just not sure how to pass it to the build system.

  • I tried configure CPPFLAGS=-DKLH10_DEV_DPTM03=0, but CPPFLAGS is overridden by CONFFLAGS on the gcc command line.
  • I tried export CONFFLAGS_AUX=-DKLH10_DEV_DPTM03=0 before running configure, but that's ignored.

EDIT: fixed by adding the new CONFFLAGS_USR thing.

@Rhialto Rhialto added the bug label May 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants