Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swf: Fall back to WINDOWS-1252 for text decoding #1499

Closed

Conversation

CounterPillow
Copy link
Contributor

Older Flash files (in this case, Mata Nui Online Game) use WINDOWS-1252 encoding for text. By trying this encoding using the encoding_rs crate if utf-8 decoding fails, the little guards at the lava gate have something to say again.

I don't know Rust please don't bully I just wanna play my lego games

Older Flash files (in this case, Mata Nui Online Game) use
WINDOWS-1252 encoding for text. By trying this encoding using the
encoding_rs crate if utf-8 decoding fails, the little guards at the
lava gate have something to say again.
@CounterPillow CounterPillow changed the title core: Fall back to WINDOWS-1252 for text decoding swf: Fall back to WINDOWS-1252 for text decoding Nov 5, 2020
@Herschel
Copy link
Member

Herschel commented Nov 5, 2020

Thanks and welcome!

This should be based on the SWF version -- if self.version >= 6, then the text encoding is UTF-8; otherwise it is locale-dependent (Windows 1252).

Eventually we want to provide options to specify the locale, and this should also apply to the AVM1 reader, but that will be for future PRs (I'm working on this currently by switching the SWF reader to return byte slices).

@CounterPillow
Copy link
Contributor Author

I don't see the SWF version exposed to SwfRead in any way; of the three implementations of SwfRead, only two have a version, so passing it along is also not trivial to do.

In short, I do not know this codebase or Rust so I don't know what the appropriate rearchitecturing is to make this work as you described.

@Herschel
Copy link
Member

Herschel commented Nov 6, 2020

For now, let's remove read_c_string from SwfRead and make it a method of swf::Reader to check the version.

  • avm2 is always SWF9+, so it is always UTF-8 encoding and never relies on the version. It uses its own read_string method.
  • avm1 does depend on the version, however it has its own read_c_string method that returns a str slice from the input, so dealing with this will be a bigger change (decoding gives you a Cow). Eventually we should make this return our own str-like type around a [u8] slice.

@rtpg
Copy link
Contributor

rtpg commented Nov 30, 2020

I've looked at this problem a bit, and I agree that if we want to cleanly support the ANSI and SHIFT_JIS locales, &str isn't going to cut it. Mainly that str is supposed to point to a unicode string and... well, Shift-JIS ain't unicode.

Would the next step forwards on this for avm1 be to have read_c_string return some specialized string object?

Two options I can think of:

  • some shim around [u8] slices that track the locale it's meant to be treated as. This is a nice API, though now you have a locale bit everywhere

  • instead, really have it just be [u8] slices everywhere and all usages also get an added Locale parameter that you have to pass in (Since, according to my understanding at least, everything ends up being the same encoding in a single run of the system)

    So for example any time you would need to actually decode the strings you would need to get the locale....somehow.

The first one definitely feels easier to manage overall.

@rtpg
Copy link
Contributor

rtpg commented Nov 30, 2020

I typed this up, thought about how (for example) string length doesn't get affected by the underlying encoding, and am now thinking that it's actually totally reasonable for us to not bundle the locale with the string type, save the byte, and just require callers/users to provide locale info each time. Especially given how many C strings would be floating in a standard run.

@Herschel
Copy link
Member

Herschel commented Nov 30, 2020

Yes, my thought is to define an SwfStr type that wraps around &[u8]. This both handles invalid UTF-8 and alternate locales, and I also think it should guarantee no interior null bytes to match C string behavior that occurs throughout the Flash Player.

I wasn't sure about bunding the encoding with the SwfStr -- on one hand, the encoding will be the same for an entire SWF, so it makes sense to save the bytes. On the other hand, bundling it may be more ergonomic and allow for impls for Display,

I have work on my branch that will help with this by converting the swf Reader to return byte slices for its types; I'll make a PR this week.

@Herschel
Copy link
Member

Herschel commented Nov 30, 2020

Also, although the specs only specifically mention ANSI and Shfit-JIS, I believe we actually need to support any encoding (Flashpoint specifically has quite a few Big5-encoded SWFs). So the encoding parameter could be an encoding_rs::Encoding and instead of trying to make an enum with various encoding types. Thankfully encoding_rs is already a dependency of core thanks to a recent PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants