Add convert function #2407

gongweibao · 2017-06-07T08:01:35Z

关于写Design Doc的一点总结

…evelop

helinwang · 2017-06-08T20:17:28Z

python/paddle/v2/dataset/common.py

@@ -149,3 +149,47 @@ def reader():
                    yield line

    return reader
+
+
+def convert(output_path, eader, num_shards, name_prefix):


eader -> reader

helinwang · 2017-06-08T20:18:45Z

python/paddle/v2/dataset/common.py

+    """
+
+    def open_needs(idx):
+        n = "%s/%s-%05d" % (output_path, name_prefix, idx)


Should follow the format in the design doc:

random_images-00000-of-00099 random_images-00001-of-00099 ... random_images-00099-of-00099

helinwang · 2017-06-08T20:19:48Z

python/paddle/v2/dataset/common.py

+    :param name_prefix: the name prefix of generated files.
+    """
+
+    def open_needs(idx):


What does open_needs mean in English? Maybe a more descriptive name would be better.

helinwang · 2017-06-08T20:21:25Z

python/paddle/v2/dataset/common.py

+    def open_needs(idx):
+        n = "%s/%s-%05d" % (output_path, name_prefix, idx)
+        w = recordio.writer(n)
+        f = open(n, "w")


Why do we need f? writer.Close already closes the file it opens.

helinwang · 2017-06-08T20:26:47Z

python/paddle/v2/dataset/common.py

+    w = None
+    f = None
+
+    for i, d in enumerate(reader()):


Sorry I should have mentioned earlier, please consider this issue: #1915

To randomize, maybe we could have a shuffle_buffer_size as optional parameter. read until the buffer is full, shuffle and then write to RecordIO.

helinwang · 2017-06-08T20:28:12Z

python/paddle/v2/dataset/common.py

+
+def convert(output_path, eader, num_shards, name_prefix):
+    import recordio
+    import cPickle as pickle


Please add cPickle installation with fixed version in https://github.com/PaddlePaddle/Paddle/blob/develop/Dockerfile#L55 .

Python 2.7 and Python 3.4 include the pickle and cPickle modules already. But I didn't find official answer, there is a answer

Great, thanks!

helinwang · 2017-06-08T20:32:18Z

python/paddle/v2/dataset/common.py

+
+        w.write(pickle.dumps(d, pickle.HIGHEST_PROTOCOL))
+
+        if i % num_shards == 0 and i >= num_shards:


这里的逻辑我不是很明白，假设一共有N个shard，这里的逻辑是每N个record，写入下一个shard。不应该是每一个record写入下一个shard吗？

是不是考虑这样：

var writers []writer // fill writer writer[i%num_shards].Write(record) // close all writer once everything is done. Don't close and create a new writer frequently.

Done.一开始想错了，把num_shards想成每间隔多少个写入一个文件了。汗！

一开始想如果文件数目比较多，而记录的个数比较少。那样的话会生成空文件。

helinwang · 2017-06-08T20:34:11Z

python/paddle/v2/dataset/tests/common_test.py

+    def test_convert(self):
+        def test_reader():
+            def reader():
+                for x in xrange(10):


I think this test will break when 10 is changed to 4. According to line 191 if i % num_shards == 0 and i >= num_shards:

helinwang

Great! Only one minor comment.

helinwang · 2017-06-12T21:17:44Z

python/paddle/v2/dataset/common.py

+            reader,
+            num_shards,
+            name_prefix,
+            max_lines_to_shuffle=10000):


Maybe 1000 could be better. Say we have images of 200KB per image, 10000 of them is 1.9GB, might be too big in memory consumption.

…o convert

gongweibao added 11 commits May 11, 2017 17:50

first time to add

d7bde1f

Fix bugs

f23ee80

Fix bugs

e3a37a7

fix bugs and add sections

c079437

Merge pull request #2 from gongweibao/introtopr

cf1018d

关于写Design Doc的一点总结

Merge branch 'develop' of https://github.com/gongweibao/Paddle into d…

48e5574

…evelop

Merge remote-tracking branch 'upstream/develop' into develop

09c8bf2

Merge remote-tracking branch 'upstream/develop' into develop

8f5805d

Merge remote-tracking branch 'upstream/develop' into develop

94dbe44

add precommit

9011f9e

rm not need

96a56b9

gongweibao closed this Jun 7, 2017

gongweibao reopened this Jun 7, 2017

gongweibao requested review from helinwang, Yancey1989 and typhoonzero June 7, 2017 08:03

rm not need

f904e79

helinwang requested changes Jun 8, 2017

View reviewed changes

fix by helin's comments

283bdc5

helinwang approved these changes Jun 12, 2017

View reviewed changes

gongweibao added 2 commits June 13, 2017 14:41

modify 10000 to 1000

77c4dce

Merge branch 'develop', remote-tracking branch 'upstream/develop' int…

46ccfc0

…o convert

gongweibao merged commit c9d7871 into PaddlePaddle:develop Jun 13, 2017

gongweibao deleted the convert branch January 17, 2021 07:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add convert function #2407

Add convert function #2407

gongweibao commented Jun 7, 2017 •

edited

Loading

helinwang Jun 8, 2017

gongweibao Jun 12, 2017

helinwang Jun 8, 2017

gongweibao Jun 11, 2017

helinwang Jun 8, 2017

gongweibao Jun 11, 2017

helinwang Jun 8, 2017

gongweibao Jun 11, 2017

helinwang Jun 8, 2017

gongweibao Jun 11, 2017

helinwang Jun 8, 2017

gongweibao Jun 11, 2017

helinwang Jun 12, 2017 •

edited

Loading

helinwang Jun 8, 2017

gongweibao Jun 11, 2017

gongweibao Jun 11, 2017

helinwang Jun 8, 2017

gongweibao Jun 12, 2017

helinwang left a comment

helinwang Jun 12, 2017


		w.write(pickle.dumps(d, pickle.HIGHEST_PROTOCOL))

		if i % num_shards == 0 and i >= num_shards:

Add convert function #2407

Add convert function #2407

Conversation

gongweibao commented Jun 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Jun 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gongweibao commented Jun 7, 2017 •

edited

Loading

helinwang Jun 12, 2017 •

edited

Loading