Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't require cproject structure #21

Open
blahah opened this issue Mar 28, 2016 · 12 comments
Open

Don't require cproject structure #21

blahah opened this issue Mar 28, 2016 · 12 comments
Labels

Comments

@blahah
Copy link
Member

blahah commented Mar 28, 2016

It's not clear to me from the command-line --help, or from the README, whether norma requires a cproject structure, but it seems to.

What most users will want to do is:

norma --nlm2html something.xml > something.html

I think we shouldn't require a specific input filename or enforce a specific output filename, because it restricts what the user can do and creates work for them. There are a lof of ways of getting NLM xml files without using contentmine tools. Using the contentmine project conventions should be an additional option.

@petermr
Copy link
Member

petermr commented Mar 28, 2016

Norma accepts single filenames in the form:

norma -i foo.xml -o bar.html --transform nlm2html

It can also accept a single directory as a Ctree

normal -q mytree/ -i foo.xml -o bar.html --transform nlm2html.

(Note that --nlm2html has never been an option. Since there are at least 6
different transforms we have gathered them all under --transform.

This is mainly a question of documentation.

Note also that norma accepts wildcards, e.g.

-i PMC*.xml

The problem with not using directories and reserved names is that the
output gets messy very quickly. If people want it we can do it easily. But
they will need to support the file management lower down the chain.

Note also that wrapping norma in a loop will probably mean relaunching the JVM each time.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@tarrow
Copy link
Contributor

tarrow commented May 12, 2016

I just tested this and for me it doesn't work.

tom@pisces PMC4683095 % norma -i fulltext.xml -o text.html --transform nlm2html
0    [main] WARN  org.xmlcml.cmine.args.DefaultArgProcessor  - no --project given; using --output
1    [main] WARN  org.xmlcml.norma.NormaArgProcessor  - No current CTree

@tarrow tarrow added the bug label May 12, 2016
@petermr
Copy link
Member

petermr commented May 13, 2016

@tarrow what were you expecting? fulltext.xml is a reserved filename. What happens with foo.xml?

@tarrow
Copy link
Contributor

tarrow commented May 13, 2016

I was expecting it to be fine since we didn't specify it was in a CProject. Reserved name shouldn't matter if you aren't using it in a project environment.

In any case using foo.xml is even worse:

tom@pisces PMC4811621 % norma -i foo.xml -o text.html --transform nlm2html     
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:1049)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runMethodsOfType(DefaultArgProcessor.java:946)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:927)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:1111)
        at org.xmlcml.norma.Norma.run(Norma.java:23)
        at org.xmlcml.norma.Norma.main(Norma.java:18)
Caused by: java.lang.RuntimeException: Input must be reserved file; found: foo.xml
        at org.xmlcml.norma.NormaArgProcessor.checkAndGetInputFile(NormaArgProcessor.java:282)
        at org.xmlcml.norma.NormaTransformer.transform(NormaTransformer.java:114)
        at org.xmlcml.norma.NormaArgProcessor.runTransform(NormaArgProcessor.java:202)
        ... 10 more
0    [main] DEBUG org.xmlcml.cmine.args.DefaultArgProcessor  - option in exception  or --transform; (1,2147483647); parseTransform; STRING: null / []; nlm2html; [nlm2html]
java.lang.RuntimeException: invoke runTransform fails
        at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:1052)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runMethodsOfType(DefaultArgProcessor.java:946)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:927)
        at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:1111)
        at org.xmlcml.norma.Norma.run(Norma.java:23)
        at org.xmlcml.norma.Norma.main(Norma.java:18)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:1049)
        ... 5 more
Caused by: java.lang.RuntimeException: Input must be reserved file; found: foo.xml
        at org.xmlcml.norma.NormaArgProcessor.checkAndGetInputFile(NormaArgProcessor.java:282)
        at org.xmlcml.norma.NormaTransformer.transform(NormaTransformer.java:114)
        at org.xmlcml.norma.NormaArgProcessor.runTransform(NormaArgProcessor.java:202)
        ... 10 more

@tarrow
Copy link
Contributor

tarrow commented May 17, 2016

So, to clarify I think the answer is one cannot use Norma on just a single file input. It can only be used to convert a single file input to a CProject with a CTree called fulltext.xml (or html etc..) which you can then run norma on (again) to do the actual conversion.

@petermr do you think this is the case? (i.e. norma -i foo.xml -o bar.html --transform nlm2html shouldn't work)

@petermr
Copy link
Member

petermr commented May 17, 2016

without reading the docs I'd say
norma -q foo -i fulltext.xml -o scholarly.html --transform nlm2html
SHOULD work
Note the use of "-q" for the parent directory of "fulltext.xml" It may or
may not work for non-reserved names ("foo.xml").

-q is how we started. in fact you can write
norma -q foo -i bar[2-7]file.xml
and that will process foo2.xml, foo3.xml ... (not sure what the output is
called)
but it's flaky and should be replaced by more explicit structure.

There is certainly the logic to build a CProject from a list of files, but
the syntax is badly overloaded.

On Tue, May 17, 2016 at 12:25 PM, tarrow notifications@github.com wrote:

So, to clarify I think the answer is one cannot use Norma on just a single
file input. It can only be used to convert a single file input to a
CProject with a CTree called fulltext.xml (or html etc..) which you can
then run norma on (again) to do the actual conversion.

@petermr https://github.com/petermr do you think this is the case?
(i.e. norma -i foo.xml -o bar.html --transform nlm2html shouldn't work)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#21 (comment)

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@tarrow
Copy link
Contributor

tarrow commented May 17, 2016

So, it isn't possible to run norma without a -q foo? I'm slightly unclear as to if norma interprets this as CTree or a CProject.

Should it be possible to do a conversion without being in a CTree? (i.e. I don't believe it is possible now; is this a bug or a feature?)

@tarrow
Copy link
Contributor

tarrow commented May 18, 2016

So, the situation is that it isn't currently possible. I think; like Richard; that it should be possible. Perhaps not as a particularly high priority task but it should be possible. Hurdles I see are that requiring a CProject, or creating one, are currently an integral part of the "main control loop" that can be found in org.xmlcml.cmine.args.DefaultArgProcessor.

@petermr
Copy link
Member

petermr commented May 18, 2016

So what's the use case for this? A major part of normal/ami is that it
takes away the responsibility of managing the control. It's possible to
wrap the point operations with Unix tools or any other language. The whole
lot can be done - with effort - without using Cmine. XML2HTML can be done
with native XSL transformers, etc.

But how many people want to transform just one file? And what can you do
with it when you've got it? AMI can't be used at all because it emits
results/ and there is no ctree to act as parent. So it's ONLY useful in
norma. Who is going to use norma for just one file and just one
transformation?

On Wed, May 18, 2016 at 5:14 PM, tarrow notifications@github.com wrote:

So, the situation is that it isn't currently possible. I think; like
Richard; that it should be possible. Perhaps not as a particularly high
priority task but it should be possible. Hurdles I see are that requiring a
CProject, or creating one, are currently an integral part of the "main
control loop" that can be found in
org.xmlcml.cmine.args.DefaultArgProcessor.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#21 (comment)

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@psychemedia
Copy link

I'm exploring whether or not I can use contentmine tools to help extract stuff from differently unstructured PDF based reports, many of them containing arbitrarily located charts and tables, in an attempt to then normalise some of the outputs from those docs. This will often require testing against a single document? (Or is contentmine not appropriate for this? In which case, I want to see how far I can appropriate it/subvert it!;-))

Another use case is just learning how to use the contentmine tools? Or folk wanting to start building processors for new journal style documents, where it makes sense to test one at a time, at least in the beginning?

@petermr
Copy link
Member

petermr commented Jun 8, 2016

The problem is that tables in PDF are very hard. No-one has solved it. TabulaPDF went some way, CM goes a different partial way. It's often possible to do a single source but not generalize:

Is    this
a     table
or    just
some prose

even recognising tables is hard.
I wrote a lot of this - you are welcome to it, but it's neither complete nor maintained.

@psychemedia
Copy link

I've started looking at Tabula via the R tabulizer package, which wraps the command line. I also started pondering cribs for tabula, eg giving it keywords for things t might expect to find in a table heading to help it gets its eye in!

I saw you'd done some work extracting data from line charts - is that part of the contentmine toolset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants