Question on input data format #2

poorboy44 · 2015-07-29T19:39:48Z

example.sh describes the input format as:

This file provides information about running the Dynamic Topic Model
or the Document Influence Model.  It gives two command-line examples
for running the software and several example commands in R for reading
output files.

Dynamic topic models and the influence model have been implemented
here in c / c++.  This implementation takes two input files:

 (a) foo-mult.dat, which is one-doc-per-line, each line of the form

   unique_word_count index1:count1 index2:count2 ... indexn:counnt

   where each index is an integer corresponding to a unique word.

 (b) foo-seq.dat, which is of the form

   Number_Timestamps
   number_docs_time_1
   ...
   number_docs_time_i
   ...
   number_docs_time_NumberTimestamps

   - The docs in foo-mult.dat should be ordered by date, with the first
     docs from time1, the next from time2, ..., and the last docs from
     timen.

test-mult.dat looks like this (1000 lines):

28 12:1 44:1 75:10 76:1 77:1 78:1 79:1 80:1 81:2 82:1 83:1 84:1 85:1 86:1 87:2 88:4 89:1 90:1 91:1 92:1 93:1 94:1 95:1 96:1 97:1 98:1 99:2 100:1
60 771:1 388:1 98:1 134:1 8:1 908:1 1037:1 600:1 405:1 1046:1 516:1 27:2 773:1 37:1 1137:1 1138:1 302:1 433:2 51:1 59:1 999:1 1119:1 224:1 67:1 69:1 71:1 584:1 330:1 77:1 269:1 337:1 83:1 1112:1 349:2 1118:1 1125:1 1120:1 1121:1 1122:1 1123:1 1124:1 101:2 1126:1 1127:1 488:1 1129:1 618:3 1131:1 1132:1 1133:1 1134:1 1135:1 1136:2 1128:1 114:4 1139:1 1140:1 1141:1 631:1 1130:1
17 257:1 546:2 547:1 548:1 549:1 6:1 551:1 552:1 553:1 554:1 550:2 418:1 174:1 433:1 315:1 92:1 415:1
11 288:1 1248:1 5:1 1063:2 269:1 654:1 656:2 532:1 373:1 1247:1 543:1
25 909:1 407:1 797:1 543:1 555:1 693:1 823:4 569:1 1226:1 1227:1 1228:2 1229:1 1230:1 1231:4 1232:1 1233:4 1234:1 1235:1 1236:1 1237:3 1238:1 1239:1 1106:1 113:1 243:1

test-seq.dat looks like this (10 lines):

I don't understand how the time correspondence is defined between test-mult.dat (which has 1 document per line) and test-seq.dat which has the number of docs per time-period (in this case 10 time periods). Can someone clarify for me how the input data should be formatted? Are we assuming the first 10 documents in test-mult.dat correspond to time period 1, the next 25 documents correspond to time period 2, etc?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on input data format #2

Question on input data format #2

poorboy44 commented Jul 29, 2015

Question on input data format #2

Question on input data format #2

Comments

poorboy44 commented Jul 29, 2015