Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some incomplete coordinates for sentence elements #811

Closed
kermitt2 opened this issue Aug 6, 2021 · 6 comments
Closed

Some incomplete coordinates for sentence elements #811

kermitt2 opened this issue Aug 6, 2021 · 6 comments
Assignees
Labels
bug From Hemiptera and especially its suborder Heteroptera implemented The issue has been implemented

Comments

@kermitt2
Copy link
Owner

kermitt2 commented Aug 6, 2021

For this example (preprint):

Uploading document_sentence_segmentation_issues.pdf…

we have some incomplete bounding boxes for coordinates at sentence-level, see the 5 last sentences of this paragraph:

          <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head coords="23,54.00,212.69,163.38,11.14">ODD-luciferase activity assaay</head>
                <p>
                    <s coords="23,54.00,240.29,504.00,11.14;23,54.00,267.89,158.47,11.14">The ODD-luciferase construct with pcDNA3.1 plasmid vector was constructed as previously described 
                        <ref type="bibr" coords="23,108.63,267.89,92.50,11.14" target="#b42">(Safran et al. 2006</ref>).
                    </s>
                    <s coords="23,215.06,267.89,342.94,11.14;23,54.00,295.49,504.01,11.14;23,54.00,323.09,222.14,11.14">The proline p402 and p564 present within the oxygen degradation domain (ODD) of HIF1α, when hydroxylated by HIF-PHDs, allow its binding to the VHL protein that target it for proteasomal degradation.</s>
                    <s coords="23,279.87,323.09,278.13,11.14;23,54.00,350.69,354.71,11.14">In this way, the stabilization of ODD can be used as a marker of HIF1α stability 
                        <ref type="bibr" coords="23,197.48,350.69,95.96,11.14" target="#b42">(Safran et al. 2006</ref>
                        <ref type="bibr" coords="23,293.44,350.69,115.26,11.14" target="#b48">, Smirnova et al. 2010</ref>.
                    </s>
                    <s coords="23,408.71,350.69,11.34,11.14">Because of the luciferase tagged with ODD, the increase in ODD stability leads to a proportional increase in the luciferase activity and this provides a very good way of measuring the HIF1α stability in a quantitative manner with a wide dynamic range.</s>
                    <s coords="23,423.35,350.69,46.70,11.14">To this end, we used SH-SY5Y cells stably expressing ODDluciferase.</s>
                    <s coords="23,473.35,350.69,10.01,11.14">These cells were made by co-transfecting ODD-luciferase plasmid along with a puromycin resistance plasmid in SH-SY5Y cells and stably transfected cells were positively selected in presence of 4μg/ml of puromycin.</s>
                    <s coords="23,486.66,350.69,71.34,11.14">Luciferase activity was measured by luciferase assay kit (promega) using an LMaxII TM microplate luminometer (molecular Devices).</s>
                    <s coords="23,54.00,378.29,36.70,11.14">ODDluciferase activity was normalized to the protein content of each well measured by Bio-Rad DC TM protein assay kit.</s>
                </p>
            </div>
@kermitt2 kermitt2 self-assigned this Aug 6, 2021
@kermitt2 kermitt2 added the bug From Hemiptera and especially its suborder Heteroptera label Aug 6, 2021
@kermitt2 kermitt2 removed their assignment Aug 6, 2021
@kermitt2
Copy link
Owner Author

kermitt2 commented Aug 6, 2021

@lfoppiano
Copy link
Collaborator

It seems that the PDF is not reachable 😺

@kermitt2
Copy link
Owner Author

Sorry poor internet connection :(

document_sentence_segmentation_issues.pdf

@kermitt2
Copy link
Owner Author

Normally the text to be segmented includes the references (all text including descendant elements):

// in xom, the following gives all the text under the element, for the whole subtree

and we only keep track of the positions of the references to pass the "forbidden positions" to the segmenter:

SentenceUtilities.getInstance().runSentenceDetection(text, forbiddenPositions, curParagraphTokens, new Language(lang));

It seems that until that step, it works fine, the texts of the sentences look good.

The problem is probably then we try to group the LayoutToken corresponding of each sentence in segmentedParagraphTokens. The text segmented is coming from XML is a bit different than the text from LayoutToken (de-hyphenization, some spaces removed), and the alignment can be challenging.

@kermitt2
Copy link
Owner Author

PR #821 fixes the problem, which was due to a leftover in the reference pattern (year pattern) missing in the XML.

All the coordinates for sentence elements now look good:

           <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head coords="23,54.00,212.69,163.38,11.14">ODD-luciferase activity assaay</head>
                <p>
                    <s coords="23,54.00,240.29,504.00,11.14;23,54.00,267.89,158.47,11.14">The ODD-luciferase construct with pcDNA3.1 plasmid vector was constructed as previously described 
                        <ref type="bibr" coords="23,108.63,267.89,92.50,11.14" target="#b42">(Safran et al. 2006</ref>).
                    </s>
                    <s coords="23,215.06,267.89,342.94,11.14;23,54.00,295.49,504.01,11.14;23,54.00,323.09,222.14,11.14">The proline p402 and p564 present within the oxygen degradation domain (ODD) of HIF1α, when hydroxylated by HIF-PHDs, allow its binding to the VHL protein that target it for proteasomal degradation.</s>
                    <s coords="23,279.87,323.09,278.13,11.14;23,54.00,350.69,366.05,11.14">In this way, the stabilization of ODD can be used as a marker of HIF1α stability 
                        <ref type="bibr" coords="23,197.48,350.69,95.96,11.14" target="#b42">(Safran et al. 2006</ref>
                        <ref type="bibr" coords="23,293.44,350.69,120.93,11.14" target="#b48">, Smirnova et al. 2010)</ref>.
                    </s>
                    <s coords="23,423.35,350.69,134.65,11.14;23,54.00,378.29,504.01,11.14;23,54.00,405.89,504.01,11.14;23,54.00,433.49,185.14,11.14">Because of the luciferase tagged with ODD, the increase in ODD stability leads to a proportional increase in the luciferase activity and this provides a very good way of measuring the HIF1α stability in a quantitative manner with a wide dynamic range.</s>
                    <s coords="23,241.74,433.49,316.26,11.14;23,54.00,461.09,54.70,11.14">To this end, we used SH-SY5Y cells stably expressing ODDluciferase.</s>
                    <s coords="23,114.49,461.09,443.52,11.14;23,54.00,488.69,504.01,11.14;23,54.00,516.29,245.80,11.14">These cells were made by co-transfecting ODD-luciferase plasmid along with a puromycin resistance plasmid in SH-SY5Y cells and stably transfected cells were positively selected in presence of 4μg/ml of puromycin.</s>
                    <s coords="23,304.37,516.29,253.64,11.14;23,54.00,543.89,205.57,11.14;23,259.57,542.76,11.58,7.46;23,276.88,543.89,244.71,11.14">Luciferase activity was measured by luciferase assay kit (promega) using an LMaxII TM microplate luminometer (molecular Devices).</s>
                    <s coords="23,527.35,543.89,30.65,11.14;23,54.00,571.49,492.45,11.14;23,546.45,570.36,11.58,7.46;23,54.00,599.09,90.05,11.14">ODDluciferase activity was normalized to the protein content of each well measured by Bio-Rad DC TM protein assay kit.</s>
                </p>
            </div>

@kermitt2 kermitt2 added the implemented The issue has been implemented label Aug 22, 2021
@kermitt2
Copy link
Owner Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera implemented The issue has been implemented
Projects
None yet
Development

No branches or pull requests

2 participants