-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathperlfun_pack.txt
628 lines (495 loc) · 31.4 KB
/
perlfun_pack.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
pack TEMPLATE,LIST
Takes a LIST of values and converts it into a string using the
rules given by the TEMPLATE. The resulting string is the
concatenation of the converted values. Typically, each converted
value looks like its machine-level representation. For example,
on 32-bit machines an integer may be represented by a sequence
of 4 bytes, which will in Perl be presented as a string that's 4
characters long.
See perlpacktut for an introduction to this function.
The TEMPLATE is a sequence of characters that give the order and
type of values, as follows:
a A string with arbitrary binary data, will be null padded.
A A text (ASCII) string, will be space padded.
Z A null-terminated (ASCIZ) string, will be null padded.
b A bit string (ascending bit order inside each byte,
like vec()).
B A bit string (descending bit order inside each byte).
h A hex string (low nybble first).
H A hex string (high nybble first).
c A signed char (8-bit) value.
C An unsigned char (octet) value.
W An unsigned char value (can be greater than 255).
s A signed short (16-bit) value.
S An unsigned short value.
l A signed long (32-bit) value.
L An unsigned long value.
q A signed quad (64-bit) value.
Q An unsigned quad value.
(Quads are available only if your system supports 64-bit
integer values _and_ if Perl has been compiled to support
those. Raises an exception otherwise.)
i A signed integer value.
I A unsigned integer value.
(This 'integer' is _at_least_ 32 bits wide. Its exact
size depends on what a local C compiler calls 'int'.)
n An unsigned short (16-bit) in "network" (big-endian) order.
N An unsigned long (32-bit) in "network" (big-endian) order.
v An unsigned short (16-bit) in "VAX" (little-endian) order.
V An unsigned long (32-bit) in "VAX" (little-endian) order.
j A Perl internal signed integer value (IV).
J A Perl internal unsigned integer value (UV).
f A single-precision float in native format.
d A double-precision float in native format.
F A Perl internal floating-point value (NV) in native format
D A float of long-double precision in native format.
(Long doubles are available only if your system supports
long double values _and_ if Perl has been compiled to
support those. Raises an exception otherwise.
Note that there are different long double formats.)
p A pointer to a null-terminated string.
P A pointer to a structure (fixed-length string).
u A uuencoded string.
U A Unicode character number. Encodes to a character in char-
acter mode and UTF-8 (or UTF-EBCDIC in EBCDIC platforms) in
byte mode.
w A BER compressed integer (not an ASN.1 BER, see perlpacktut
for details). Its bytes represent an unsigned integer in
base 128, most significant digit first, with as few digits
as possible. Bit eight (the high bit) is set on each byte
except the last.
x A null byte (a.k.a ASCII NUL, "\000", chr(0))
X Back up a byte.
@ Null-fill or truncate to absolute position, counted from the
start of the innermost ()-group.
. Null-fill or truncate to absolute position specified by
the value.
( Start of a ()-group.
One or more modifiers below may optionally follow certain
letters in the TEMPLATE (the second column lists letters for
which the modifier is valid):
! sSlLiI Forces native (short, long, int) sizes instead
of fixed (16-/32-bit) sizes.
! xX Make x and X act as alignment commands.
! nNvV Treat integers as signed instead of unsigned.
! @. Specify position as byte offset in the internal
representation of the packed string. Efficient
but dangerous.
> sSiIlLqQ Force big-endian byte-order on the type.
jJfFdDpP (The "big end" touches the construct.)
< sSiIlLqQ Force little-endian byte-order on the type.
jJfFdDpP (The "little end" touches the construct.)
The ">" and "<" modifiers can also be used on "()" groups to
force a particular byte-order on all components in that group,
including all its subgroups.
The following rules apply:
* Each letter may optionally be followed by a number
indicating the repeat count. A numeric repeat count may
optionally be enclosed in brackets, as in "pack("C[80]",
@arr)". The repeat count gobbles that many values from the
LIST when used with all format types other than "a", "A",
"Z", "b", "B", "h", "H", "@", ".", "x", "X", and "P", where
it means something else, described below. Supplying a "*"
for the repeat count instead of a number means to use
however many items are left, except for:
* "@", "x", and "X", where it is equivalent to 0.
* <.>, where it means relative to the start of the string.
* "u", where it is equivalent to 1 (or 45, which here is
equivalent).
One can replace a numeric repeat count with a template
letter enclosed in brackets to use the packed byte length of
the bracketed template for the repeat count.
For example, the template "x[L]" skips as many bytes as in a
packed long, and the template "$t X[$t] $t" unpacks twice
whatever $t (when variable-expanded) unpacks. If the
template in brackets contains alignment commands (such as
"x![d]"), its packed length is calculated as if the start of
the template had the maximal possible alignment.
When used with "Z", a "*" as the repeat count is guaranteed
to add a trailing null byte, so the resulting string is
always one byte longer than the byte length of the item
itself.
When used with "@", the repeat count represents an offset
from the start of the innermost "()" group.
When used with ".", the repeat count determines the starting
position to calculate the value offset as follows:
* If the repeat count is 0, it's relative to the current
position.
* If the repeat count is "*", the offset is relative to
the start of the packed string.
* And if it's an integer *n*, the offset is relative to
the start of the *n*th innermost "( )" group, or to the
start of the string if *n* is bigger then the group
level.
The repeat count for "u" is interpreted as the maximal
number of bytes to encode per line of output, with 0, 1 and
2 replaced by 45. The repeat count should not be more than
65.
* The "a", "A", and "Z" types gobble just one value, but pack
it as a string of length count, padding with nulls or spaces
as needed. When unpacking, "A" strips trailing whitespace
and nulls, "Z" strips everything after the first null, and
"a" returns data with no stripping at all.
If the value to pack is too long, the result is truncated.
If it's too long and an explicit count is provided, "Z"
packs only "$count-1" bytes, followed by a null byte. Thus
"Z" always packs a trailing null, except when the count is
0.
* Likewise, the "b" and "B" formats pack a string that's that
many bits long. Each such format generates 1 bit of the
result. These are typically followed by a repeat count like
"B8" or "B64".
Each result bit is based on the least-significant bit of the
corresponding input character, i.e., on "ord($char)%2". In
particular, characters "0" and "1" generate bits 0 and 1, as
do characters "\000" and "\001".
Starting from the beginning of the input string, each
8-tuple of characters is converted to 1 character of output.
With format "b", the first character of the 8-tuple
determines the least-significant bit of a character; with
format "B", it determines the most-significant bit of a
character.
If the length of the input string is not evenly divisible by
8, the remainder is packed as if the input string were
padded by null characters at the end. Similarly during
unpacking, "extra" bits are ignored.
If the input string is longer than needed, remaining
characters are ignored.
A "*" for the repeat count uses all characters of the input
field. On unpacking, bits are converted to a string of 0s
and 1s.
* The "h" and "H" formats pack a string that many nybbles
(4-bit groups, representable as hexadecimal digits, "0".."9"
"a".."f") long.
For each such format, "pack" generates 4 bits of result.
With non-alphabetical characters, the result is based on the
4 least-significant bits of the input character, i.e., on
"ord($char)%16". In particular, characters "0" and "1"
generate nybbles 0 and 1, as do bytes "\000" and "\001". For
characters "a".."f" and "A".."F", the result is compatible
with the usual hexadecimal digits, so that "a" and "A" both
generate the nybble "0xA==10". Use only these specific hex
characters with this format.
Starting from the beginning of the template to "pack", each
pair of characters is converted to 1 character of output.
With format "h", the first character of the pair determines
the least-significant nybble of the output character; with
format "H", it determines the most-significant nybble.
If the length of the input string is not even, it behaves as
if padded by a null character at the end. Similarly, "extra"
nybbles are ignored during unpacking.
If the input string is longer than needed, extra characters
are ignored.
A "*" for the repeat count uses all characters of the input
field. For "unpack", nybbles are converted to a string of
hexadecimal digits.
* The "p" format packs a pointer to a null-terminated string.
You are responsible for ensuring that the string is not a
temporary value, as that could potentially get deallocated
before you got around to using the packed result. The "P"
format packs a pointer to a structure of the size indicated
by the length. A null pointer is created if the
corresponding value for "p" or "P" is "undef"; similarly
with "unpack", where a null pointer unpacks into "undef".
If your system has a strange pointer size--meaning a pointer
is neither as big as an int nor as big as a long--it may not
be possible to pack or unpack pointers in big- or
little-endian byte order. Attempting to do so raises an
exception.
* The "/" template character allows packing and unpacking of a
sequence of items where the packed structure contains a
packed item count followed by the packed items themselves.
This is useful when the structure you're unpacking has
encoded the sizes or repeat counts for some of its fields
within the structure itself as separate fields.
For "pack", you write *length-item*"/"*sequence-item*, and
the *length-item* describes how the length value is packed.
Formats likely to be of most use are integer-packing ones
like "n" for Java strings, "w" for ASN.1 or SNMP, and "N"
for Sun XDR.
For "pack", *sequence-item* may have a repeat count, in
which case the minimum of that and the number of available
items is used as the argument for *length-item*. If it has
no repeat count or uses a '*', the number of available items
is used.
For "unpack", an internal stack of integer arguments
unpacked so far is used. You write "/"*sequence-item* and
the repeat count is obtained by popping off the last element
from the stack. The *sequence-item* must not have a repeat
count.
If *sequence-item* refers to a string type ("A", "a", or
"Z"), the *length-item* is the string length, not the number
of strings. With an explicit repeat count for pack, the
packed string is adjusted to that length. For example:
This code: gives this result:
unpack("W/a", "\004Gurusamy") ("Guru")
unpack("a3/A A*", "007 Bond J ") (" Bond", "J")
unpack("a3 x2 /A A*", "007: Bond, J.") ("Bond, J", ".")
pack("n/a* w/a","hello,","world") "\000\006hello,\005world"
pack("a/W2", ord("a") .. ord("z")) "2ab"
The *length-item* is not returned explicitly from "unpack".
Supplying a count to the *length-item* format letter is only
useful with "A", "a", or "Z". Packing with a *length-item*
of "a" or "Z" may introduce "\000" characters, which Perl
does not regard as legal in numeric strings.
* The integer types "s", "S", "l", and "L" may be followed by
a "!" modifier to specify native shorts or longs. As shown
in the example above, a bare "l" means exactly 32 bits,
although the native "long" as seen by the local C compiler
may be larger. This is mainly an issue on 64-bit platforms.
You can see whether using "!" makes any difference this way:
printf "format s is %d, s! is %d\n",
length pack("s"), length pack("s!");
printf "format l is %d, l! is %d\n",
length pack("l"), length pack("l!");
"i!" and "I!" are also allowed, but only for completeness'
sake: they are identical to "i" and "I".
The actual sizes (in bytes) of native shorts, ints, longs,
and long longs on the platform where Perl was built are also
available from the command line:
$ perl -V:{short,int,long{,long}}size
shortsize='2';
intsize='4';
longsize='4';
longlongsize='8';
or programmatically via the "Config" module:
use Config;
print $Config{shortsize}, "\n";
print $Config{intsize}, "\n";
print $Config{longsize}, "\n";
print $Config{longlongsize}, "\n";
$Config{longlongsize} is undefined on systems without long
long support.
* The integer formats "s", "S", "i", "I", "l", "L", "j", and
"J" are inherently non-portable between processors and
operating systems because they obey native byteorder and
endianness. For example, a 4-byte integer 0x12345678
(305419896 decimal) would be ordered natively (arranged in
and handled by the CPU registers) into bytes as
0x12 0x34 0x56 0x78 # big-endian
0x78 0x56 0x34 0x12 # little-endian
Basically, Intel and VAX CPUs are little-endian, while
everybody else, including Motorola m68k/88k, PPC, Sparc, HP
PA, Power, and Cray, are big-endian. Alpha and MIPS can be
either: Digital/Compaq uses (well, used) them in
little-endian mode, but SGI/Cray uses them in big-endian
mode.
The names *big-endian* and *little-endian* are comic
references to the egg-eating habits of the little-endian
Lilliputians and the big-endian Blefuscudians from the
classic Jonathan Swift satire, *Gulliver's Travels*. This
entered computer lingo via the paper "On Holy Wars and a
Plea for Peace" by Danny Cohen, USC/ISI IEN 137, April 1,
1980.
Some systems may have even weirder byte orders such as
0x56 0x78 0x12 0x34
0x34 0x12 0x78 0x56
These are called mid-endian, middle-endian, mixed-endian, or
just weird.
You can determine your system endianness with this
incantation:
printf("%#02x ", $_) for unpack("W*", pack L=>0x12345678);
The byteorder on the platform where Perl was built is also
available via Config:
use Config;
print "$Config{byteorder}\n";
or from the command line:
$ perl -V:byteorder
Byteorders "1234" and "12345678" are little-endian; "4321"
and "87654321" are big-endian. Systems with
multiarchitecture binaries will have "ffff", signifying that
static information doesn't work, one must use runtime
probing.
For portably packed integers, either use the formats "n",
"N", "v", and "V" or else use the ">" and "<" modifiers
described immediately below. See also perlport.
* Also floating point numbers have endianness. Usually (but
not always) this agrees with the integer endianness. Even
though most platforms these days use the IEEE 754 binary
format, there are differences, especially if the long
doubles are involved. You can see the "Config" variables
"doublekind" and "longdblkind" (also "doublesize",
"longdblsize"): the "kind" values are enums, unlike
"byteorder".
Portability-wise the best option is probably to keep to the
IEEE 754 64-bit doubles, and of agreed-upon endianness.
Another possibility is the "%a") format of "printf".
* Starting with Perl 5.10.0, integer and floating-point
formats, along with the "p" and "P" formats and "()" groups,
may all be followed by the ">" or "<" endianness modifiers
to respectively enforce big- or little-endian byte-order.
These modifiers are especially useful given how "n", "N",
"v", and "V" don't cover signed integers, 64-bit integers,
or floating-point values.
Here are some concerns to keep in mind when using an
endianness modifier:
* Exchanging signed integers between different platforms
works only when all platforms store them in the same
format. Most platforms store signed integers in
two's-complement notation, so usually this is not an
issue.
* The ">" or "<" modifiers can only be used on
floating-point formats on big- or little-endian
machines. Otherwise, attempting to use them raises an
exception.
* Forcing big- or little-endian byte-order on
floating-point values for data exchange can work only if
all platforms use the same binary representation such as
IEEE floating-point. Even if all platforms are using
IEEE, there may still be subtle differences. Being able
to use ">" or "<" on floating-point values can be
useful, but also dangerous if you don't know exactly
what you're doing. It is not a general way to portably
store floating-point values.
* When using ">" or "<" on a "()" group, this affects all
types inside the group that accept byte-order modifiers,
including all subgroups. It is silently ignored for all
other types. You are not allowed to override the
byte-order within a group that already has a byte-order
modifier suffix.
* Real numbers (floats and doubles) are in native machine
format only. Due to the multiplicity of floating-point
formats and the lack of a standard "network" representation
for them, no facility for interchange has been made. This
means that packed floating-point data written on one machine
may not be readable on another, even if both use IEEE
floating-point arithmetic (because the endianness of the
memory representation is not part of the IEEE spec). See
also perlport.
If you know *exactly* what you're doing, you can use the ">"
or "<" modifiers to force big- or little-endian byte-order
on floating-point values.
Because Perl uses doubles (or long doubles, if configured)
internally for all numeric calculation, converting from
double into float and thence to double again loses
precision, so "unpack("f", pack("f", $foo)") will not in
general equal $foo.
* Pack and unpack can operate in two modes: character mode
("C0" mode) where the packed string is processed per
character, and UTF-8 byte mode ("U0" mode) where the packed
string is processed in its UTF-8-encoded Unicode form on a
byte-by-byte basis. Character mode is the default unless the
format string starts with "U". You can always switch mode
mid-format with an explicit "C0" or "U0" in the format. This
mode remains in effect until the next mode change, or until
the end of the "()" group it (directly) applies to.
Using "C0" to get Unicode characters while using "U0" to get
*non*-Unicode bytes is not necessarily obvious. Probably
only the first of these is what you want:
$ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
perl -CS -ne 'printf "%v04X\n", $_ for unpack("C0A*", $_)'
03B1.03C9
$ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
perl -CS -ne 'printf "%v02X\n", $_ for unpack("U0A*", $_)'
CE.B1.CF.89
$ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
perl -C0 -ne 'printf "%v02X\n", $_ for unpack("C0A*", $_)'
CE.B1.CF.89
$ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
perl -C0 -ne 'printf "%v02X\n", $_ for unpack("U0A*", $_)'
C3.8E.C2.B1.C3.8F.C2.89
Those examples also illustrate that you should not try to
use "pack"/"unpack" as a substitute for the Encode module.
* You must yourself do any alignment or padding by inserting,
for example, enough "x"es while packing. There is no way for
"pack" and "unpack" to know where characters are going to or
coming from, so they handle their output and input as flat
sequences of characters.
* A "()" group is a sub-TEMPLATE enclosed in parentheses. A
group may take a repeat count either as postfix, or for
"unpack", also via the "/" template character. Within each
repetition of a group, positioning with "@" starts over at
0. Therefore, the result of
pack("@1A((@2A)@3A)", qw[X Y Z])
is the string "\0X\0\0YZ".
* "x" and "X" accept the "!" modifier to act as alignment
commands: they jump forward or back to the closest position
aligned at a multiple of "count" characters. For example, to
"pack" or "unpack" a C structure like
struct {
char c; /* one signed, 8-bit character */
double d;
char cc[2];
}
one may need to use the template "c x![d] d c[2]". This
assumes that doubles must be aligned to the size of double.
For alignment commands, a "count" of 0 is equivalent to a
"count" of 1; both are no-ops.
* "n", "N", "v" and "V" accept the "!" modifier to represent
signed 16-/32-bit integers in big-/little-endian order. This
is portable only when all platforms sharing packed data use
the same binary representation for signed integers; for
example, when all platforms use two's-complement
representation.
* Comments can be embedded in a TEMPLATE using "#" through the
end of line. White space can separate pack codes from each
other, but modifiers and repeat counts must follow
immediately. Breaking complex templates into individual
line-by-line components, suitably annotated, can do as much
to improve legibility and maintainability of pack/unpack
formats as "/x" can for complicated pattern matches.
* If TEMPLATE requires more arguments than "pack" is given,
"pack" assumes additional "" arguments. If TEMPLATE requires
fewer arguments than given, extra arguments are ignored.
* Attempting to pack the special floating point values "Inf"
and "NaN" (infinity, also in negative, and not-a-number)
into packed integer values (like "L") is a fatal error. The
reason for this is that there simply isn't any sensible
mapping for these special values into integers.
Examples:
$foo = pack("WWWW",65,66,67,68);
# foo eq "ABCD"
$foo = pack("W4",65,66,67,68);
# same thing
$foo = pack("W4",0x24b6,0x24b7,0x24b8,0x24b9);
# same thing with Unicode circled letters.
$foo = pack("U4",0x24b6,0x24b7,0x24b8,0x24b9);
# same thing with Unicode circled letters. You don't get the
# UTF-8 bytes because the U at the start of the format caused
# a switch to U0-mode, so the UTF-8 bytes get joined into
# characters
$foo = pack("C0U4",0x24b6,0x24b7,0x24b8,0x24b9);
# foo eq "\xe2\x92\xb6\xe2\x92\xb7\xe2\x92\xb8\xe2\x92\xb9"
# This is the UTF-8 encoding of the string in the
# previous example
$foo = pack("ccxxcc",65,66,67,68);
# foo eq "AB\0\0CD"
# NOTE: The examples above featuring "W" and "c" are true
# only on ASCII and ASCII-derived systems such as ISO Latin 1
# and UTF-8. On EBCDIC systems, the first example would be
# $foo = pack("WWWW",193,194,195,196);
$foo = pack("s2",1,2);
# "\001\000\002\000" on little-endian
# "\000\001\000\002" on big-endian
$foo = pack("a4","abcd","x","y","z");
# "abcd"
$foo = pack("aaaa","abcd","x","y","z");
# "axyz"
$foo = pack("a14","abcdefg");
# "abcdefg\0\0\0\0\0\0\0"
$foo = pack("i9pl", gmtime);
# a real struct tm (on my system anyway)
$utmp_template = "Z8 Z8 Z16 L";
$utmp = pack($utmp_template, @utmp1);
# a struct utmp (BSDish)
@utmp2 = unpack($utmp_template, $utmp);
# "@utmp1" eq "@utmp2"
sub bintodec {
unpack("N", pack("B32", substr("0" x 32 . shift, -32)));
}
$foo = pack('sx2l', 12, 34);
# short 12, two zero bytes padding, long 34
$bar = pack('s@4l', 12, 34);
# short 12, zero fill to position 4, long 34
# $foo eq $bar
$baz = pack('s.l', 12, 4, 34);
# short 12, zero fill to position 4, long 34
$foo = pack('nN', 42, 4711);
# pack big-endian 16- and 32-bit unsigned integers
$foo = pack('S>L>', 42, 4711);
# exactly the same
$foo = pack('s<l<', -42, 4711);
# pack little-endian 16- and 32-bit signed integers
$foo = pack('(sl)<', -42, 4711);
# exactly the same
The same template may generally also be used in "unpack".