Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_json/to_jsonl does not have encoding options when exporting file? #70

Closed
chslink opened this issue May 9, 2016 · 7 comments
Closed
Assignees
Labels
Milestone

Comments

@chslink
Copy link

chslink commented May 9, 2016

when using to_json/to_jsonl to export files.I found the methon no encoding option.I read the code,and both functions using json.dump and utf-8 charset.
And thanks for your awesome work,I really love this functional toolkits.

@EntilZha
Copy link
Owner

EntilZha commented May 9, 2016

Thanks @chslink for the bug report and thanks! I'll look into it, but if you have a test case that would be helpful to reproduce the bug with.

@EntilZha EntilZha added the bug label May 9, 2016
@EntilZha EntilZha added this to the 0.7.0 milestone May 9, 2016
@EntilZha EntilZha self-assigned this May 9, 2016
@chslink
Copy link
Author

chslink commented May 10, 2016

That's strange,maybe i found another bug...

My test environment:

Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32

My test code:

testlist = [
        {"id": 1, "key": u"some unicode 哈哈哈 1"},
        {"id": 2, "key": u"some unicode 哈哈哈 2"}
    ]

data = seq(testlist)
print data.find(lambda x: x['id'] == 1)
print data.drop_while(lambda x: ['id'] == 1)
data.to_json("r.json")
data.to_jsonl("r.jsonl")

Output:

{'id': 1, 'key': u'some unicode \u54c8\u54c8\u54c8 1'}
[{'id': 1, 'key': u'some unicode \u54c8\u54c8\u54c8 1'}, {'id': 2, 'key': u'some unicode \u54c8\u54c8\u54c8 2'}]

I thought two files will be encoded with utf-8. But r.jsonl file was encoded with my system default charset.
The other bug seems drop_while() not working?

@EntilZha
Copy link
Owner

In your code, you wrote lambda x: ['id'] instead of lambda x: x['id']. Since the array is non-empty, it would return true. After fixing that, things seem to work as you would expect.

I took a closer look at the code, and your bug report is correct. Currently, the default json writer is used on the to_json call which by default uses utf 8. The jsonl writer uses the builtin open and its default encoding which is the system default.

My proposal to fix this issue would be:

  1. Change the default encoding of to_jsonl to utf8
  2. Include an encoding parameter for both to_json and to_jsonl to let user change this if needed.

The outcome is that jsonl and json outputs are consistent, but it allows users to change these if needed just as with to_file.

@chslink
Copy link
Author

chslink commented May 11, 2016

Sorry,I was too careless...
My question is this:

data = seq(1,2,3,4)
print data.drop(1)
print data

Output:

[2, 3, 4]
[1, 2, 3, 4]

Obviously, element isn't really drop from the sequence...

EntilZha added a commit that referenced this issue May 11, 2016
@EntilZha
Copy link
Owner

EntilZha commented May 11, 2016

The behavior you describe is actually exactly the intended behavior. Once a seq is initialized it is for most purposes an immutable stream.

It would be better to think about the function calls as a declarative sequence of transformations rather than procedural directives. For example, I would interpret your example as:

  1. data.drop(1): Give me data after one element has been dropped
  2. data: Give me data

This can be seen a little more clearly by doing this:

print(data.drop(1)._lineage)
# Lineage: sequence -> drop(1)
print(data._lineage)
# Lineage: sequence

In other words, any mutations to a sequence are not in place, but instead describe a pipeline of transformations to apply to the original sequence.

On the bug, I just pushed a fix in 4a3735b

UPDATE: I decided to remove the encoding parameter/option. In python 3 json gets loaded/dumped as unicode so it makes sense to keep consistency with this (python 2 has the encoding option, but that is removed/deprecated in python 3).

Here is the code output now:

$ python
Python 2.7.11 (default, Jan 19 2016, 14:37:25)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from functional import seq
>>> testlist = [
...         {"id": 1, "key": u"some unicode 哈哈哈 1"},
...         {"id": 2, "key": u"some unicode 哈哈哈 2"}
...     ]
>>>
>>> data = seq(testlist)
>>> data.to_json("r.json")
>>> data.to_jsonl("r.jsonl")
>>> exit()
$ cat r.json
[{"id": 1, "key": "some unicode \u54c8\u54c8\u54c8 1"}, {"id": 2, "key": "some unicode \u54c8\u54c8\u54c8 2"}]
$ cat r.jsonl
{"id": 1, "key": "some unicode \u54c8\u54c8\u54c8 1"}
{"id": 2, "key": "some unicode \u54c8\u54c8\u54c8 2"}

It looks like it works, but could you verify on your end to be sure? The best way to do this would be to do one of these inside a virtualenv:

  1. pip install git+git://github.com/EntilZha/PyFunctional.git
  2. clone the github repo and do python setup.py install

@chslink
Copy link
Author

chslink commented May 12, 2016

I'd known what you explaining about drop() behavior.
I'd already tested,nice work.

@EntilZha
Copy link
Owner

Sounds like everything is good so resolving this issue. Feel free to open another issue if you find another bug. Thanks for the bug report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants