Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on input data format #2

Open
poorboy44 opened this issue Jul 29, 2015 · 0 comments
Open

Question on input data format #2

poorboy44 opened this issue Jul 29, 2015 · 0 comments

Comments

@poorboy44
Copy link

example.sh describes the input format as:

This file provides information about running the Dynamic Topic Model
or the Document Influence Model.  It gives two command-line examples
for running the software and several example commands in R for reading
output files.

Dynamic topic models and the influence model have been implemented
here in c / c++.  This implementation takes two input files:

 (a) foo-mult.dat, which is one-doc-per-line, each line of the form

   unique_word_count index1:count1 index2:count2 ... indexn:counnt

   where each index is an integer corresponding to a unique word.

 (b) foo-seq.dat, which is of the form

   Number_Timestamps
   number_docs_time_1
   ...
   number_docs_time_i
   ...
   number_docs_time_NumberTimestamps

   - The docs in foo-mult.dat should be ordered by date, with the first
     docs from time1, the next from time2, ..., and the last docs from
     timen.

test-mult.dat looks like this (1000 lines):

28 12:1 44:1 75:10 76:1 77:1 78:1 79:1 80:1 81:2 82:1 83:1 84:1 85:1 86:1 87:2 88:4 89:1 90:1 91:1 92:1 93:1 94:1 95:1 96:1 97:1 98:1 99:2 100:1
60 771:1 388:1 98:1 134:1 8:1 908:1 1037:1 600:1 405:1 1046:1 516:1 27:2 773:1 37:1 1137:1 1138:1 302:1 433:2 51:1 59:1 999:1 1119:1 224:1 67:1 69:1 71:1 584:1 330:1 77:1 269:1 337:1 83:1 1112:1 349:2 1118:1 1125:1 1120:1 1121:1 1122:1 1123:1 1124:1 101:2 1126:1 1127:1 488:1 1129:1 618:3 1131:1 1132:1 1133:1 1134:1 1135:1 1136:2 1128:1 114:4 1139:1 1140:1 1141:1 631:1 1130:1
17 257:1 546:2 547:1 548:1 549:1 6:1 551:1 552:1 553:1 554:1 550:2 418:1 174:1 433:1 315:1 92:1 415:1
11 288:1 1248:1 5:1 1063:2 269:1 654:1 656:2 532:1 373:1 1247:1 543:1
25 909:1 407:1 797:1 543:1 555:1 693:1 823:4 569:1 1226:1 1227:1 1228:2 1229:1 1230:1 1231:4 1232:1 1233:4 1234:1 1235:1 1236:1 1237:3 1238:1 1239:1 1106:1 113:1 243:1

test-seq.dat looks like this (10 lines):

10
25
50
75
100
100
100
100

I don't understand how the time correspondence is defined between test-mult.dat (which has 1 document per line) and test-seq.dat which has the number of docs per time-period (in this case 10 time periods). Can someone clarify for me how the input data should be formatted? Are we assuming the first 10 documents in test-mult.dat correspond to time period 1, the next 25 documents correspond to time period 2, etc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant