Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add convert function #2407

Merged
merged 15 commits into from
Jun 13, 2017
Merged
44 changes: 44 additions & 0 deletions python/paddle/v2/dataset/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,3 +149,47 @@ def reader():
yield line

return reader


def convert(output_path, eader, num_shards, name_prefix):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eader -> reader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import recordio
import cPickle as pickle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add cPickle installation with fixed version in https://github.com/PaddlePaddle/Paddle/blob/develop/Dockerfile#L55 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 2.7 and Python 3.4 include the pickle and cPickle modules already. But I didn't find official answer, there is a answer

Copy link
Contributor

@helinwang helinwang Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

"""
Convert data from reader to recordio format files.

:param output_path: directory in which output files will be saved.
:param reader: a data reader, from which the convert program will read data instances.
:param num_shards: the number of shards that the dataset will be partitioned into.
:param name_prefix: the name prefix of generated files.
"""

def open_needs(idx):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does open_needs mean in English? Maybe a more descriptive name would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

n = "%s/%s-%05d" % (output_path, name_prefix, idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should follow the format in the design doc:

random_images-00000-of-00099
random_images-00001-of-00099
...
random_images-00099-of-00099

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

w = recordio.writer(n)
f = open(n, "w")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need f? writer.Close already closes the file it opens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

idx += 1

return w, f, idx

def close_needs(w, f):
if w is not None:
w.close()

if f is not None:
f.close()

idx = 0
w = None
f = None

for i, d in enumerate(reader()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I should have mentioned earlier, please consider this issue: #1915

To randomize, maybe we could have a shuffle_buffer_size as optional parameter. read until the buffer is full, shuffle and then write to RecordIO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if w is None:
w, f, idx = open_needs(idx)

w.write(pickle.dumps(d, pickle.HIGHEST_PROTOCOL))

if i % num_shards == 0 and i >= num_shards:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的逻辑我不是很明白,假设一共有N个shard,这里的逻辑是每N个record,写入下一个shard。不应该是每一个record写入下一个shard吗?

是不是考虑这样:

var writers []writer
// fill writer
writer[i%num_shards].Write(record)
// close all writer once everything is done. Don't close and create a new writer frequently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.一开始想错了,把num_shards想成每间隔多少个写入一个文件了。汗!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一开始想如果文件数目比较多,而记录的个数比较少。那样的话会生成空文件。

close_needs(w, f)
w, f, idx = open_needs(idx)

close_needs(w, f)
16 changes: 16 additions & 0 deletions python/paddle/v2/dataset/tests/common_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,22 @@ def test_cluster_file_reader(self):
for idx, e in enumerate(reader()):
self.assertEqual(e, str("0"))

def test_convert(self):
def test_reader():
def reader():
for x in xrange(10):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test will break when 10 is changed to 4. According to line 191 if i % num_shards == 0 and i >= num_shards:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

yield x

return reader

path = tempfile.mkdtemp()

paddle.v2.dataset.common.convert(path,
test_reader(), 4, 'random_images')

files = glob.glob(temp_path + '/random_images-*')
self.assertEqual(len(files), 3)


if __name__ == '__main__':
unittest.main()