Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete implementation of NSScanner #2582

Merged
merged 4 commits into from
Apr 27, 2017

Conversation

ms-jihua
Copy link
Contributor

  • Implemented the following functions/properties:
    • locale
    • copyWithZone:
    • localizedScannerWithString:
    • scanDecimal:
    • scanHexDouble:
    • scanHexFloat:
  • Added unit tests for NSScanner (there were none, including for previously implemented features)
  • Fixed a number of edge cases that were incorrect

Fixes #2083

if (!scannedPeriod) {
scannedPeriod = true;
} else {
goto done_scanning_hex_double; // Stop scanning once a second period has been scanned
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since break inside a switch in a for loop breaks the switch and not the outer loop, this unfortunately seemed to be the best out of a group of bad options:

  • Add a done bool that is checked on every iteration: this causes ++i to occur once more than necessary unless it's the end of the string, making the set of _scanLocation = 1 at the end more complicated than necessary
  • Move the switch to a helper function: would need to manipulate both mantissa and exponent through out-pointers, seems unnecessarily messy
  • Add a continue to every 'good' case, then add a break outside the switch that the 'bad' cases can fall too: wwwwaayyy too messy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not sure if our coding standards handle gotos, but I think if this is the best option then we should make it follow the macro-style naming convention of ALL_CAPS so it really stands out.

} else if (!hasSign && unicode == '+') {
sign = 1;
hasSign = TRUE;
hasSign = YES;
} else if (unicode >= '0' && unicode <= '9') {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iswdigit?

} else if (unicode >= '0' && unicode <= '9') {
if (!hasOverflow) {
int c = unicode - '0';

// Inspired by http://www.math.utoledo.edu/~dbastos/overflow.html
if ((long_long_MAX - c) / 10 < value)
hasOverflow = TRUE;
if ((longLongMax - c) / 10 < value)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaaa my eyes they are burning

nit: please fix these weirdos 😄

if ([_skipSet characterIsMember:unicode])
state = STATE_SPACE;
else if (unicode == '0') {
case STATE_START:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these states be unified?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the first two)

}

return TRUE;
return YES;
}

/**
@Status Interoperable
*/
- (BOOL)scanHexInt:(unsigned*)valuep {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we implement scanHexInt in terms of scanHexLongLong? we'd have to accoutn for overflow behaviour, but it would kill half our hex reading code

value = check;
else {
value = -1;
if (0xF000000000000000 & value) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this need ULL at the end? hmm

bool mantissaIsNegative = false;
bool exponentIsNegative = false;
bool scannedPeriod = false;
bool hasValue = false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: your new code has hasValue as a bool but the old code has it as BOOL; standardize?


done_scanning_hex_double:; // Allows breaking out of the loop from inside the switch
if (hasValue) {
if (result) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an overflow case to be had for decimals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean besides the exponent getting too large and causing ldexp to return HUGE_VAL or 0?

if the mantissa becomes too precise to represent correctly, the latter (least significant) bits get chopped off during the double operations in STATE_MANTISSA

return StubReturn();
}
- (BOOL)isAtEnd {
@synchronized(self) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of the other APIs are @synchronized... should they be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NSScanner is in the list of thread-unsafe Foundation classes.


static const double c_floatingPtTolerance = 0.00001;

// Helper template function that calls a scan_______ function specified by the selector and does the nececssary casts

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: nececssary

}

TEST(NSScanner, ScanHexLongLong) {
testScanHexIntegral<unsigned long long>(@selector(scanHexLongLong:));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see an additional test for something that fits in 64 but not 32; this doesn't have that


} else if (unicode >= 'a' && unicode <= 'f') {
if (scannedPeriod) {
mantissa += (mantissaDigitPlace /= 16) * (unicode - 'a' + 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the public struct NSDecimal to keep track of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be longer code, and we don't need NSDecimal's greater precision, since we only need exactly as much precision as would fit in a double's mantissa.

if (result) {
*result = static_cast<float>(doubleValue);
}
return YES;
}

/**
@Status Interoperable
@Notes
*/
- (BOOL)scanUnsignedLongLong:(unsigned long long*)pValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: no implicit int

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsigned and long are just modifiers, not actual types, so saying unsigned long long makes the compiler infer the type which by default is int. It's common to elide the int, but I'm not a fan 🐋

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, that. we as well as the docs forgo the int, so I'm leaving as is.

} else {
*valuep = long_long_MIN;
*valuep = std::numeric_limits<long long>::min();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be lowest()?

}

TEST(NSScanner, ScanLongLong) {
testScanSignedIntegral<long long>(@selector(scanLongLong:));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: they're everywhere 😿

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

- (BOOL)scanHexInt:(unsigned int*)intValue;
- (BOOL)scanHexLongLong:(unsigned long long*)result;
- (BOOL)scanInteger:(NSInteger*)value;
- (BOOL)scanInt:(int*)intValue;
- (BOOL)scanLongLong:(long long*)longLongValue;
- (BOOL)scanString:(NSString*)string intoString:(NSString* _Nullable*)stringValue;
- (BOOL)scanString:(NSString* _Nonnull)string intoString:(NSString* _Nullable*)stringValue;
- (BOOL)scanUnsignedLongLong:(unsigned long long*)unsignedLongLongValue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: they're everywhere 😿

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@@ -185,45 +150,43 @@ - (BOOL)scanInteger:(int*)valuep {
}
}

return TRUE;
return YES;
}

/**
@Status Interoperable
*/
- (BOOL)scanLongLong:(long long*)valuep {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh. I am pretty sure this works, but why not use strtoll here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was not sure how we felt about swapping out large chunks of the previous code. there are also unichar -> char conversion implications I haven't thought about yet, but it should be fine I think...

std::vector<char> charsToScan;
charsToScan.reserve(_string.length - startLocation);
for (NSUInteger i = startLocation; i < length; ++i, ++pScan) {
if ((*pScan >= '0') && (*pScan <= '9')) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(*pScan >= '0') && (*pScan <= '9' [](start = 12, length = 33)

isdigit?

continue;
} else {
return FALSE;
if (_charactersToBeSkipped) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_charactersToBeSkipped [](start = 8, length = 22)

This method is going to be particularly slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default case (skip whitespace, string is space separated) has low n, so it might not be as bad as you think. That said, is there an improvement you'd like to suggest? (Fetching more of the string at once?)

}

/**
@Status Stub
@Status Interoperable
@Notes
*/
- (BOOL)scanHexDouble:(double*)result {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanHexDouble [](start = 8, length = 13)

could sscanf_s be used for the heavylifting here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sscanf doesn't return where it stopped scanning. you can sort of work around that by specifying "%a%s" as the format, but that scans the whole rest of the string unnecessarily... you could also do a bunch of preprocessing to give it a prepped state, but then it just ends up being the same, but with sscanf instead of ldexp

@bviglietta
Copy link
Contributor

  • (BOOL)scanInteger:(int*)valuep {

NSInteger


Refers to: Frameworks/Foundation/NSScanner.mm:137 in 658f544. [](commit_id = 658f544, deletion_comment = False)

@@ -174,7 +139,7 @@ - (BOOL)scanInteger:(int*)valuep {

// This assumes sizeof(long long) >= sizeof(int).
if (![self scanLongLong:&scanValue]) {
return FALSE;
return NO;
} else if (valuep) {
if (scanValue > LONG_MAX) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LONG_MAX [](start = 24, length = 8)

NSIntegerMax, etc.


pScanStart += _location;
char* pScan = (char*)[_string UTF8String] + startLocation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(char*)[_string UTF8String] + startLocation [](start = 18, length = 43)

This math won't work out well if there are non-ASCII characters in the string.

// copy a scannable range of the string to a new buffer, adjusting for these constraints
char decimalSeparator;
if ((!_locale) ||
(![[_locale objectForKey:NSLocaleDecimalSeparator] getCString:&decimalSeparator maxLength:1 encoding:NSASCIIStringEncoding])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NSASCIIStringEncoding [](start = 109, length = 21)

In some locales the decimal separator won't be representable as an ASCII character.


// Helper function for implementing a scan function for a numeric type using a CRT function such as wcstoll
template <typename TNumeric>
static BOOL __scanNumeric(NSScanner* self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: this could really be c++ bool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, scanner instead of self?

template <typename TNumeric>
static BOOL __scanNumeric(NSScanner* self,
TNumeric* valuep,
TNumeric (*localeFunc)(const wchar_t*, wchar_t**, int, _locale_t), // eg: _wcstoll_l
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need a locale function here, you could use lambda to bind the locale when you need it (at the calling site).

_locale = nil;
// Overload of __scanNumeric for CRT functions that do not take a base parameter
template <typename TNumeric>
static BOOL __scanNumeric(NSScanner* self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lambdas to bind base when needed?

result = NO;
break;
const NSUInteger length = _string.length;
NSUInteger i = [self _indexOfNextUnskippedCharacter];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could go inside the initializer part for the for statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could in this instance, but there are other instances in the file where i needs to be available outside the for loop and I'd rather keep them in line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. You can ignore any and all the nit comments.

range.length = [string length];
int oldLocation = _location;
- (BOOL)scanString:(NSString* _Nonnull)string intoString:(NSString* _Nullable*)stringp {
if (string.length == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the method uses string.length, and [string length], would be good to stick to one convention. also, may even make sense to stick it into a constant.

state = STATE_P;

} else {
goto DONE_SCANNING_HEXDOUBLE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of the goto here, please set a flag (doneScanningDouble) and outside the switch, check for the flag and break out of the for loop. We don't like goto's for a reason and this is easy to not use it here.

double mantissaDigitPlace = 1;

const NSUInteger length = _string.length;
NSUInteger i = [self _indexOfNextUnskippedCharacter];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, can be for initializer.

const NSUInteger length = _string.length;
NSUInteger i = [self _indexOfNextUnskippedCharacter];
for (; i < length; ++i) {
unichar unicode = _stringChars[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ch or even c is more appropriate than unicode.

goto DONE_SCANNING_HEXDOUBLE; // Stop scanning once a second period has been scanned
}

} else if (unicode >= '0' && unicode <= '9') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could extract the value of the current character first for digits and hex first and then do the scannedPeriod check. Better yet, you could have a multiplier value which is 16 before you scan period and 1/16 after you scannedPeriod, then you don't need the if checks at all. also, isdigit instead of actual comparison here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving the comparison since it better contextualizes the arithmetic and is more in line with the other checks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if checks would still be necessary even using that multiplier value, since the mantissa advances in different directions before/after the period ie:

0x12, scan 3 = 0x12 * 16 + 0x3 = 0x123 (multiplier applies to previous number, is constant)
0x1.2, scan 3 = 0x1.2 + 0x3 / 16^2 = 0x1.23 (multiplier applies to new digit, changes)

I could combine scanned period and the multiplier into one var, but I think that would hurt readability.


switch (state) {
case STATE_START:
if (unicode == '-') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are meant to be UTF16 unichars, then need L'-' etc in front of the chars. Not sure if that is automatic.

return _isCaseSensitive;
+ (instancetype)localizedScannerWithString:(NSString* _Nonnull)string {
NSScanner* ret = [self scannerWithString:string];
ret.locale = [NSLocale currentLocale];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do it via an init,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no equivalent public init, seems like unnecessary frills to make one just for this

state = STATE_SIGN;
mantissaIsNegative = true;
} else if (unicode == '+') {
state = STATE_SIGN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

   state = STATE_SIGN; [](start = 12, length = 27)

you could just use the NSDecimal struct and if the precision is not too high, it would only take up the first slot of the mantissa array.
Also it would make the code much simpler, with tracking all the states.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe not, given this is being looped over.


In reply to: 113068748 [](ancestors = 113068748)

case STATE_ZERO:
if (unicode == 'x' || unicode == 'X') {
state = STATE_MANTISSA;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else [](start = 18, length = 4)

nit: can just return NO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? the top if case would fall through...


case STATE_P:
if (!hasValue) {
goto DONE_SCANNING_HEXDOUBLE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goto [](start = 20, length = 4)

goto :(

@ms-jihua
Copy link
Contributor Author

Ping


// Helper function for implementing a scan function for a numeric type using a CRT function such as wcstoll
template <typename TNumeric>
static BOOL __scanNumeric(NSScanner* scanner,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make templated on lambda so we don't have to use std::function, and then just inline the lambdas into the __scanNumeric calls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would rather have it be strongly-typed - if it turns out we are bottlenecked by std::function overhead, we can make this change then

std::function<long(const wchar_t*, wchar_t**)> scanFunc = [self](const wchar_t* scanStart, wchar_t** scanEnd) {
return _crtLocale ? _wcstol_l(scanStart, scanEnd, 10, _crtLocale) : wcstol(scanStart, scanEnd, 10);
};
return __scanNumeric(self, reinterpret_cast<long*>(valuep), scanFunc);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may not be a valid cast on all platforms. prefer a temporary that you copy into NSInteger for truncation purposes

- Implemented the following functions/properties:
	- locale
	- copyWithZone:
	- localizedScannerWithString:
	- scanDecimal:
	- scanHexDouble:
	- scanHexFloat:
- Added unit tests for NSScanner (there were none, including for previously implemented features)
- Fixed a number of edge cases that were incorrect

Fixes microsoft#2083
…g CRT functions

- Changed NSScanner to prefetch all characters in the string at init
- Added tests that check for differing behaviors between 32-bit and 64-bit types
- Misc CR feedback
- Renamed the states in scanHexDouble to reflect the current expected state rather than previous state
@ms-jihua ms-jihua merged commit dcdc18d into microsoft:develop Apr 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants