implement automatic interlinear for strong-versions #12

chriswep · 2019-01-31T16:08:49Z

@david Instone-Brewer you mentioned in private chat that we can be sure that in any translation the order of instances of the same strong would be the same as in the original. i'm still not convinced by this - but since you are the pro i give it the benefit of the doubt: if we work with that assumption i wonder if we can go a different route altogether concerning morphology and strongs. what about having one original version as the only source of truth (which would be compiled of TANTT + TOTHT)? Doing that we wouldn't need to save morphology with the version and we wouldn't need to save a strongs-index with the version. If we would then update the original it automatically update all translations that have strongs.

David Instone-Brewer [3 months ago]
@chris Metz Perhaps I should add a few caveats about assertion that we can assume the same order in text and translation.
I'm referring to words with the same Strongs number in a text and in its translation.
So, if a verse uses the same word twice, we'd expect the translation to use them in the same order as the text. For a silly example: "He said to Jesus, 'Lord Jesus, help me'"
Now, a translation COULD have "'Lord Jesus, help me', he said to Jesus." but it would be strange. And this is an extreme example, where the two words are very close to each other, in phrases that could be swapped round.
BUT I haven't tested this idea, so I don't know how well it works in practice.
So I too am allowing it the benefit of the doubt - though I think it is a fair bet.
The big exception is when there is a variant - when part of a verse is missing in some MSS. This DOES result in some identical words getting mixed up - though in practice this only affects words such as "the" and "his" - ie words are likely to occur frequently in a single verse.
This means it is fairly important to identify the Greek text behind a translation. This kind of problem doesn't occur in the Hebrew OT.

David Instone-Brewer [3 months ago]
I do like your idea of having a single OT+NT text, but there are some BIG problems wrt morphology, because of variants. THe NT texts not only have different wording, but very often different morphology for the same words.

Dan Bennett [3 months ago]
Oooh, tricky

Chris Metz [3 months ago]
I see. I had a longer look at the TANTT dataset. If we take the list of strongs in a translation for a given verse we should be able to infer the type of text that was used using the information in the dataset right? Can we assume that a translation follows only one text type within a verse? I think i came across and example when they didn’t.. though I’m not sure.

David Instone-Brewer [3 months ago]
Bibles tend to fall into two camps: those that use the so-called Textus Receptus (ie the best text available to the KJV translators) and those that follow modern texts (ie NA, SBLGNT or THGNT which all more-or-less agree about the original text). Translations also take some extra decisions about whether to include things like the end of Mark and the forgiven adulteress in John 8. A few perversely use the modern texts but also fill in the so-called 'missing verses' which were duplications in older MSS.
So it would be fairly easy to construct some rules to figure out which of these options a translation was using.

chriswep · 2019-01-31T16:18:22Z

@DavidIB i'm putting this slack-conversation up here so that we can catch up where we left on this. in between you posted:

"I'm working on something Chris put me onto: A set of lexical and morphology tags with rules so that any tagged NT (in any language) can get morphology added."

what is the state on this? is there some new ruleset coming up that we can use to detect the source-text-type to match translation-word+strong => source-word+morph/lemma/etc.

DavidIB · 2019-02-01T15:05:28Z

I agree that this idea is not foolproof, though it is worth looking at to see how many potential instances of problems actually exist - I suspect there aren't too many.
Here's the type of examples where a problem could occur:

There is a separate Strongs value for similar words if they have a different grammatical form. So (to use English) the the noun "belief" could be 1000, verb "believe" could be 1001, but "believer" would also be 1001 because this is a participle from the verb "to believe". So if we have a sentence like "a believer believes this belief" the numbers would be 1000, 1000, 1001. In some languages the verb comes before the subject. This happens so often that there's a term for it: VSO - Verb Subject Object - and of course Hebrew is VSO.

To see if this is a common problem, we need to look for places where the same Strong numbers occurs next or close to each other with a different grammatical force

Sometimes a word may occur three or more times in a verse but it is translated fewer times to reduce the repetition. For example, if a verse said:
"The man found him and the man saw him hit a man" most translations would contract this to
"The man found him and saw him hit a man"
The morphology would have "man" as being a Subject twice and Object the third time, but a translation would have only two Strongs numbers for "man", so if the morphology went in order, they'b both be Objects.

To see if this is a common problem, we could look at a few tagged Bibles and look for instances where the number of tagged words with the same number is reduced from the original.

Despite these two problems, I think your idea is a VERY good one.
The first issue is, I think, likely to exist very rarely. I'll have to check
The second issue will be very common with definite articles and pronouns, so I'd suggest that these should remain deliberately untagged in this automatic tagging. No-one will miss this because if they were interested in the grammar of the articles, they'd be looking at a Greek text, which would be tagged properly.
There will still likely be some problems, esp with particles, because they aren't always translated. We DO want to tag these because people want to know, for example, if "in" is a translation of "εν" (in) or "εις" (which has more of a force of 'into').
There is, I think, a way to avoid this confusion: we can add missing tags at the right relative position for words that are missing, by putting the tag round a space. The process to figure out where to put the tag isn't too difficult. The following should work:

look for verses with too few instances of a particular Strong number (ignoring articles & pronouns)
note the Strong number of the next noun in the original (which should almost always match the next noun in the translation) or note that it comes after the last noun. This will tell you which of the originals is untagged in the translation
add the tag around a space in front of the noun where it is absent, or at the end of the verse if it is after the last noun.

This should work for all words except articles and pronouns - which in any case we don't want to tag in an automatic process.

chriswep · 2019-02-01T16:03:36Z

thanks @DavidIB - would you be willing to do some more research on the "To see if this is a common problem, we need to look ..." issues so that we have a more detailed understanding of the problem space before implementing it?

Apart from this, as far as i understand there is still the problem of different source texts or?

chriswep · 2019-02-01T16:09:03Z

first idea that comes to mind:

compile a list of the strongs in the translation verse
compare this list with the list of strongs for each variant in TANTT dataset
choose the source type of the variant that has all of the translations strongs (including duplicates) and has the least additional ones

Would that work?

DavidIB · 2019-02-02T10:26:22Z

Yep, but I'd suggest something more generic, cos I'm thinking that we could add a vocabulary+morphology feature to untagged texts - like in the web STEPBible. That is, if we knew which text the translation is based on, we could give the original+context sensitive gloss+morphology for every word in the verse, so readers could figure for themselves which word translates which.
To identify the text type generically, I'd use comparison of verse length, similar to the method for identifying versification.

chriswep · 2019-02-02T11:24:05Z

Currently i save the sourcetype anyway for every verse that was matched by a v11n rule. So if you would complete the ruleset to capture any verse in any bible (with nothing in the action column), every bible imported into bibleengine would be completely sourcetype-tagged :-)

We could make the ruleset smaller by defining default values (on testament, Book or verse level).

Would a ruleset like this be possible?

chriswep · 2019-02-02T11:27:06Z

I‘m wondering: verse numbering and source text used is not necessarily the same or? Something to look out for?

chriswep · 2019-02-02T11:28:18Z

Also: would assigning a source type to an equivalence translation even make sense?

DavidIB · 2019-02-05T09:34:43Z

These are great ideas. If I understand you correctly you are suggesting:

Marking source text along with versification type.
Marking up the source a book at a time, rather than verse-by-verse
Look out for problems wrt versification differences in sources
Remember that tagging may not be the same for a different source
FOrtunately I already have the data that will avoid the last two problems. I have already figured all the major variants and the verse division differences. There's rather a lot of these verses division problems because the first work on this was done by marking up a text while riding on horseback. Later editors found that the divisions were sometimes illogical - I like to imagine the horse was to blame for a lot of that!
I think it makes sense to mark source by Testaments - in that most Bibles will be translated from one particular text for OT and one for NT. However, people tend to cheat. A lot of BIbles based on NA include bits from other sources - eg some missing phrases in Hebrew that are in the LXX, the end of Mark and story of the forgiven adulteress - and on the whole I agree with this policy. But it means we have to test these passages individually. I already have a list of them.
I'm not sure about storing the information with the versification data, cos they aren't likely to cover the same chunks of text. But I'll leave you to figure that out.
The main issue is finding time! I'm rather behind with my work on auto-tagging of Swahili, and the deadline for my next book is coming up (I'm currenly on my third title in a deal where I write one every 4 months - I hit the last two deadlines but this one on BIble & science is taking longer). But what you are suggesting is really exciting. I must find some time for it!

chriswep · 2019-02-05T10:41:58Z

Marking source text along with versification type.

since we have a rule-system and an implementation in place, yes this would be the idea. however as i questioned above: is v11n-source and text-source (in the sense of the actual source words) compatible? i can imagine translators using a specific text-source in a verse while they number the verse according to a different versification scheme to the text-source.

Marking up the source a book at a time, rather than verse-by-verse

As you mentioned there might be verse-level exceptions but most verses within a translation are probably the same source type. so our rule-schema should support defining a rule for a testament or a book (or chapter?) as well as verses. the most specific matching rule for a verse will be chosen then.

FOrtunately I already have the data that will avoid the last two problems. I have already figured all
the major variants and the verse division differences. There's rather a lot of these verses division problems because the first work on this was done by marking up a text while riding on horseback. Later editors found that the divisions were sometimes illogical - I like to imagine the horse was to blame for a lot of that!

do you mean that verse boundaries can shift between translations by a few words or a sentence? so by a comparison of verse lengths you would be able to tell which original words are in a verse?

I think it makes sense to mark source by Testaments - in that most Bibles will be translated from one particular text for OT and one for NT. However, people tend to cheat. A lot of BIbles based on NA include bits from other sources - eg some missing phrases in Hebrew that are in the LXX, the end of Mark and story of the forgiven adulteress - and on the whole I agree with this policy. But it means we have to test these passages individually. I already have a list of them.

questions that come to my find concerning the rules:

how would we be able to figure out the "base-source" of a translation (is it like looking at a few common places and comparing verse lengths there?)
wouldn't we need to have a rule/test for each (major) variant that exists (since translators could have chosen a different source-texts in any of those cases potentially)?
i can understand how you can test for verse existence or compare verse lengths. But a lot of variants in the source wouldn't affect verse length right? (replaced word, different tense)

I'm not sure about storing the information with the versification data, cos they aren't likely to cover the same chunks of text. But I'll leave you to figure that out.

well i just say that IF v11n and text source are compatible categories than it might make sense to consider that - you would "just" need to complete the dataset. However i think i am not competent to answer that - and i have the feeling (as mentioned above) that there are differences and it might be better to separate the whole issue.

chriswep · 2019-02-05T10:52:34Z

@DavidIB i updated the comment above so please look at the current one on github

DavidIB · 2019-02-05T13:06:23Z

is v11n-source and text-source (in the sense of the actual source words) compatible? i can imagine translators using a specific text-source in a verse while they number the verse according to a different versification scheme to the text-source.

You are right to be cautious. I don't think there is much overlap between these two. However, I guess that you are marking EVERY verse with a versification type (Standard, Hebrew, Latin or Greek). So when there is no rule about a verse, it is marked as Standard by default.
BTW, don't bother to mark them with combinations such as "Hebrew+Latin". Mark them as one, in the order: Standard, Hebrew, Latin, Greek, other.

our rule-schema should support defining a rule for a testament or a book (or chapter?) as well as verses. the most specific matching rule for a verse will be chosen then.

Yep, but like I said, many Bibles use NA throughout and then deviate for a couple of verses - esp. at places like 1John.5.7-8. So you prob want an array with EVERY verse.

do you mean that verse boundaries can shift between translations by a few words or a sentence? so by a comparison of verse lengths you would be able to tell which original words are in a verse?

Yep, it really is as messy as that. Fortunately, these differences aren't usually a matter of "pick'n'choose". Translations normally follow one set or another. So we can look for an instance where there is a big difference and assume this applies to places where there are small differences.

how would we be able to figure out the "base-source" of a translation (is it like looking at a few common places and comparing verse lengths there?)

Yep.

wouldn't we need to have a rule/test for each (major) variant that exists (since translators could have chosen a different source-texts in any of those cases potentially)?

For major ones, yes we want to test them individually.

i can understand how you can test for verse existence or compare verse lengths. But a lot of variants in the source wouldn't affect verse length right? (replaced word, different tense)

Yep - a lot of variants (eg "anger" or "compassion" at Mark 1.41) depend on one word. THere's not much we can do about this except assume they are following the text they 'normally' follow - based on tests in verses that we can test

well i just say that IF v11n and text source are compatible categories than it might make sense to consider that - you would "just" need to complete the dataset. However i think i am not competent to answer that - and i have the feeling (as mentioned above) that there are differences and it might be better to separate the whole issue.

the issue will be separate, but I think the key concept that you introduced is that we should try to create rules and store the result in a similar way to the versification process.

chriswep · 2019-02-05T13:59:16Z

You are right to be cautious. I don't think there is much overlap between these two. However, I guess that you are marking EVERY verse with a versification type (Standard, Hebrew, Latin or Greek). So when there is no rule about a verse, it is marked as Standard by default.
BTW, don't bother to mark them with combinations such as "Hebrew+Latin". Mark them as one, in the order: Standard, Hebrew, Latin, Greek, other.

so, is my understanding correct that you confirmed that we have two distinct types of source: v11n-source-type (Standard, Hebrew, Latin, Greek) and text-source (i guess this would be variant families / greek texts like NA, Tyndale, Majority, etc..?)

concerning the v11n source: if there is "Greek+Latin" in the ruleset, did i understand you right that i should choose "Latin" in that case, since it has higher priority (and not the first mentioned)?

the issue will be separate, but I think the key concept that you introduced is that we should try to create rules and store the result in a similar way to the versification process.

so the conclusion for now is that you will - whenever you find the time - create a ruleset that will enable us to determine the source text for any given verse in any translation. correct?

something like this?

NT        NA    Joh.1:5 > Joh.1.6 & Rom.3.2 < Rom.2.1
NT        TY    Joh.1:5 < Joh.1.6 & Rom.3.2 > Rom.2.1
Mt.2.5    NA    Mt.2.5 > Mt.3.3
Mt.2.5    TY    Mt.2.5 < Mt.3.3

We could / should also define a default text (NA?) that is assumed when no rules match, which will also reduce the number of rules.

DavidIB · 2019-02-06T15:16:05Z

so, is my understanding correct that you confirmed that we have two distinct types of source: v11n-source-type (Standard, Hebrew, Latin, Greek) and text-source (i guess this would be variant families / greek texts like NA, Tyndale, Majority, etc..?)

Yep.
The OT source types can be restricted to MT, LXX (because all the places where translations fill in gaps in Hebrew can be said to be from LXX)
The NT source types can be restriced to TR (which is same as JKV and dirivatives) and NA (which is virtually identical to SBL & TH). The Byz text is important for some scholars but no Bible actually translates it - though the WEB Bible follows it somewhat, and I guess that Orthodox translations may be based on it. OK, perhaps we should include Byz as a third. Other editions/MSS aren't used by translators.

concerning the v11n source: if there is "Greek+Latin" in the ruleset, did i understand you right that i should choose "Latin" in that case, since it has higher priority (and not the first mentioned)?

There is no order intended when writing "Greek+Latin" or "Latin+Greek" - though this might be a good idea.

so the conclusion for now is that you will - whenever you find the time - create a ruleset that will enable us to determine the source text for any given verse in any translation. correct?
something like this?

NT        NA    Joh.1:5 > Joh.1.6 & Rom.3.2 < Rom.2.1
NT        TY    Joh.1:5 < Joh.1.6 & Rom.3.2 > Rom.2.1

Yep. And I suggest it is NA/TR
I don't think we need to use SBL for copyright reasons, cos we aren't publishing the text.

chriswep · 2019-02-06T16:32:13Z

The NT source types can be restriced to TR (which is same as JKV and dirivatives) and NA (which is virtually identical to SBL & TH). The Byz text is important for some scholars but no Bible actually translates it - though the WEB Bible follows it somewhat, and I guess that Orthodox translations may be based on it. OK, perhaps we should include Byz as a third. Other editions/MSS aren't used by translators.

but they may follow certain variants in a verse that can't be identified by one of those two (or three) texts or?

I don't think we need to use SBL for copyright reasons, cos we aren't publishing the text.

but we might want to display the corresponding greek text in the app. would that be a problem?

DavidIB · 2019-02-06T17:14:02Z

There's no problem displaying portions of NA. The words themselves aren't copyright (after all, they are ancient) but the exact choice of which words over the whole text, and the apparatus, are copyright.
It IS possible that a translation may follow a variant of an individual word which is not identified in NA or TR or indeed any edition. THere isn't much we can do about that.
So what we have to do is give notice that we are tagging in accordance to the Greek of NA with TR.

chriswep assigned DavidIB Jan 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement automatic interlinear for strong-versions #12

implement automatic interlinear for strong-versions #12

chriswep commented Jan 31, 2019

chriswep commented Jan 31, 2019

DavidIB commented Feb 1, 2019 •

edited

Loading

chriswep commented Feb 1, 2019

chriswep commented Feb 1, 2019

DavidIB commented Feb 2, 2019

chriswep commented Feb 2, 2019

chriswep commented Feb 2, 2019

chriswep commented Feb 2, 2019

DavidIB commented Feb 5, 2019

chriswep commented Feb 5, 2019 •

edited

Loading

chriswep commented Feb 5, 2019

DavidIB commented Feb 5, 2019

chriswep commented Feb 5, 2019 •

edited

Loading

DavidIB commented Feb 6, 2019

chriswep commented Feb 6, 2019

DavidIB commented Feb 6, 2019

implement automatic interlinear for strong-versions #12

implement automatic interlinear for strong-versions #12

Comments

chriswep commented Jan 31, 2019

chriswep commented Jan 31, 2019

DavidIB commented Feb 1, 2019 • edited Loading

chriswep commented Feb 1, 2019

chriswep commented Feb 1, 2019

DavidIB commented Feb 2, 2019

chriswep commented Feb 2, 2019

chriswep commented Feb 2, 2019

chriswep commented Feb 2, 2019

DavidIB commented Feb 5, 2019

chriswep commented Feb 5, 2019 • edited Loading

chriswep commented Feb 5, 2019

DavidIB commented Feb 5, 2019

chriswep commented Feb 5, 2019 • edited Loading

DavidIB commented Feb 6, 2019

chriswep commented Feb 6, 2019

DavidIB commented Feb 6, 2019

DavidIB commented Feb 1, 2019 •

edited

Loading

chriswep commented Feb 5, 2019 •

edited

Loading

chriswep commented Feb 5, 2019 •

edited

Loading