-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add convert function #2407
Add convert function #2407
Changes from 12 commits
d7bde1f
f23ee80
e3a37a7
c079437
cf1018d
48e5574
09c8bf2
8f5805d
94dbe44
9011f9e
96a56b9
f904e79
283bdc5
77c4dce
46ccfc0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -149,3 +149,57 @@ def reader(): | |
yield line | ||
|
||
return reader | ||
|
||
|
||
def convert(output_path, | ||
reader, | ||
num_shards, | ||
name_prefix, | ||
max_lines_to_shuffle=10000): | ||
import recordio | ||
import cPickle as pickle | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add cPickle installation with fixed version in https://github.com/PaddlePaddle/Paddle/blob/develop/Dockerfile#L55 . There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Python 2.7 and Python 3.4 include the pickle and cPickle modules already. But I didn't find official answer, there is a answer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great, thanks! |
||
import random | ||
""" | ||
Convert data from reader to recordio format files. | ||
|
||
:param output_path: directory in which output files will be saved. | ||
:param reader: a data reader, from which the convert program will read data instances. | ||
:param num_shards: the number of shards that the dataset will be partitioned into. | ||
:param name_prefix: the name prefix of generated files. | ||
:param max_lines_to_shuffle: the max lines numbers to shuffle before writing. | ||
""" | ||
|
||
assert num_shards >= 1 | ||
assert max_lines_to_shuffle >= 1 | ||
|
||
def open_writers(): | ||
w = [] | ||
for i in range(0, num_shards): | ||
n = "%s/%s-%05d-of-%05d" % (output_path, name_prefix, i, | ||
num_shards - 1) | ||
w.append(recordio.writer(n)) | ||
|
||
return w | ||
|
||
def close_writers(w): | ||
for i in range(0, num_shards): | ||
w[i].close() | ||
|
||
def write_data(w, lines): | ||
random.shuffle(lines) | ||
for i, d in enumerate(lines): | ||
d = pickle.dumps(d, pickle.HIGHEST_PROTOCOL) | ||
w[i % num_shards].write(d) | ||
|
||
w = open_writers() | ||
lines = [] | ||
|
||
for i, d in enumerate(reader()): | ||
lines.append(d) | ||
if i % max_lines_to_shuffle == 0 and i >= max_lines_to_shuffle: | ||
write_data(w, lines) | ||
lines = [] | ||
continue | ||
|
||
write_data(w, lines) | ||
close_writers(w) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe 1000 could be better. Say we have images of 200KB per image, 10000 of them is 1.9GB, might be too big in memory consumption.