perldoc/perlfun_pack.txt

    pack TEMPLATE,LIST
            Takes a LIST of values and converts it into a string using the
            rules given by the TEMPLATE. The resulting string is the
            concatenation of the converted values. Typically, each converted
            value looks like its machine-level representation. For example,
            on 32-bit machines an integer may be represented by a sequence
            of 4 bytes, which will in Perl be presented as a string that's 4
            characters long.

            See perlpacktut for an introduction to this function.

            The TEMPLATE is a sequence of characters that give the order and
            type of values, as follows:

                a  A string with arbitrary binary data, will be null padded.
                A  A text (ASCII) string, will be space padded.
                Z  A null-terminated (ASCIZ) string, will be null padded.

                b  A bit string (ascending bit order inside each byte,
                   like vec()).
                B  A bit string (descending bit order inside each byte).
                h  A hex string (low nybble first).
                H  A hex string (high nybble first).

                c  A signed char (8-bit) value.
                C  An unsigned char (octet) value.
                W  An unsigned char value (can be greater than 255).

                s  A signed short (16-bit) value.
                S  An unsigned short value.

                l  A signed long (32-bit) value.
                L  An unsigned long value.

                q  A signed quad (64-bit) value.
                Q  An unsigned quad value.
                     (Quads are available only if your system supports 64-bit
                      integer values _and_ if Perl has been compiled to support
                      those.  Raises an exception otherwise.)

                i  A signed integer value.
                I  A unsigned integer value.
                     (This 'integer' is _at_least_ 32 bits wide.  Its exact
                      size depends on what a local C compiler calls 'int'.)

                n  An unsigned short (16-bit) in "network" (big-endian) order.
                N  An unsigned long (32-bit) in "network" (big-endian) order.
                v  An unsigned short (16-bit) in "VAX" (little-endian) order.
                V  An unsigned long (32-bit) in "VAX" (little-endian) order.

                j  A Perl internal signed integer value (IV).
                J  A Perl internal unsigned integer value (UV).

                f  A single-precision float in native format.
                d  A double-precision float in native format.

                F  A Perl internal floating-point value (NV) in native format
                D  A float of long-double precision in native format.
                     (Long doubles are available only if your system supports
                      long double values _and_ if Perl has been compiled to
                      support those.  Raises an exception otherwise.
                      Note that there are different long double formats.)

                p  A pointer to a null-terminated string.
                P  A pointer to a structure (fixed-length string).

                u  A uuencoded string.
                U  A Unicode character number.  Encodes to a character in char-
                   acter mode and UTF-8 (or UTF-EBCDIC in EBCDIC platforms) in
                   byte mode.

                w  A BER compressed integer (not an ASN.1 BER, see perlpacktut
                   for details).  Its bytes represent an unsigned integer in
                   base 128, most significant digit first, with as few digits
                   as possible.  Bit eight (the high bit) is set on each byte
                   except the last.

                x  A null byte (a.k.a ASCII NUL, "\000", chr(0))
                X  Back up a byte.
                @  Null-fill or truncate to absolute position, counted from the
                   start of the innermost ()-group.
                .  Null-fill or truncate to absolute position specified by
                   the value.
                (  Start of a ()-group.

            One or more modifiers below may optionally follow certain
            letters in the TEMPLATE (the second column lists letters for
            which the modifier is valid):

                !   sSlLiI     Forces native (short, long, int) sizes instead
                               of fixed (16-/32-bit) sizes.

                !   xX         Make x and X act as alignment commands.

                !   nNvV       Treat integers as signed instead of unsigned.

                !   @.         Specify position as byte offset in the internal
                               representation of the packed string.  Efficient
                               but dangerous.

                >   sSiIlLqQ   Force big-endian byte-order on the type.
                    jJfFdDpP   (The "big end" touches the construct.)

                <   sSiIlLqQ   Force little-endian byte-order on the type.
                    jJfFdDpP   (The "little end" touches the construct.)

            The ">" and "<" modifiers can also be used on "()" groups to
            force a particular byte-order on all components in that group,
            including all its subgroups.

            The following rules apply:

            *   Each letter may optionally be followed by a number
                indicating the repeat count. A numeric repeat count may
                optionally be enclosed in brackets, as in "pack("C[80]",
                @arr)". The repeat count gobbles that many values from the
                LIST when used with all format types other than "a", "A",
                "Z", "b", "B", "h", "H", "@", ".", "x", "X", and "P", where
                it means something else, described below. Supplying a "*"
                for the repeat count instead of a number means to use
                however many items are left, except for:

                *   "@", "x", and "X", where it is equivalent to 0.

                *   <.>, where it means relative to the start of the string.

                *   "u", where it is equivalent to 1 (or 45, which here is
                    equivalent).

                One can replace a numeric repeat count with a template
                letter enclosed in brackets to use the packed byte length of
                the bracketed template for the repeat count.

                For example, the template "x[L]" skips as many bytes as in a
                packed long, and the template "$t X[$t] $t" unpacks twice
                whatever $t (when variable-expanded) unpacks. If the
                template in brackets contains alignment commands (such as
                "x![d]"), its packed length is calculated as if the start of
                the template had the maximal possible alignment.

                When used with "Z", a "*" as the repeat count is guaranteed
                to add a trailing null byte, so the resulting string is
                always one byte longer than the byte length of the item
                itself.

                When used with "@", the repeat count represents an offset
                from the start of the innermost "()" group.

                When used with ".", the repeat count determines the starting
                position to calculate the value offset as follows:

                *   If the repeat count is 0, it's relative to the current
                    position.

                *   If the repeat count is "*", the offset is relative to
                    the start of the packed string.

                *   And if it's an integer *n*, the offset is relative to
                    the start of the *n*th innermost "( )" group, or to the
                    start of the string if *n* is bigger then the group
                    level.

                The repeat count for "u" is interpreted as the maximal
                number of bytes to encode per line of output, with 0, 1 and
                2 replaced by 45. The repeat count should not be more than
                65.

            *   The "a", "A", and "Z" types gobble just one value, but pack
                it as a string of length count, padding with nulls or spaces
                as needed. When unpacking, "A" strips trailing whitespace
                and nulls, "Z" strips everything after the first null, and
                "a" returns data with no stripping at all.

                If the value to pack is too long, the result is truncated.
                If it's too long and an explicit count is provided, "Z"
                packs only "$count-1" bytes, followed by a null byte. Thus
                "Z" always packs a trailing null, except when the count is
                0.

            *   Likewise, the "b" and "B" formats pack a string that's that
                many bits long. Each such format generates 1 bit of the
                result. These are typically followed by a repeat count like
                "B8" or "B64".

                Each result bit is based on the least-significant bit of the
                corresponding input character, i.e., on "ord($char)%2". In
                particular, characters "0" and "1" generate bits 0 and 1, as
                do characters "\000" and "\001".

                Starting from the beginning of the input string, each
                8-tuple of characters is converted to 1 character of output.
                With format "b", the first character of the 8-tuple
                determines the least-significant bit of a character; with
                format "B", it determines the most-significant bit of a
                character.

                If the length of the input string is not evenly divisible by
                8, the remainder is packed as if the input string were
                padded by null characters at the end. Similarly during
                unpacking, "extra" bits are ignored.

                If the input string is longer than needed, remaining
                characters are ignored.

                A "*" for the repeat count uses all characters of the input
                field. On unpacking, bits are converted to a string of 0s
                and 1s.

            *   The "h" and "H" formats pack a string that many nybbles
                (4-bit groups, representable as hexadecimal digits, "0".."9"
                "a".."f") long.

                For each such format, "pack" generates 4 bits of result.
                With non-alphabetical characters, the result is based on the
                4 least-significant bits of the input character, i.e., on
                "ord($char)%16". In particular, characters "0" and "1"
                generate nybbles 0 and 1, as do bytes "\000" and "\001". For
                characters "a".."f" and "A".."F", the result is compatible
                with the usual hexadecimal digits, so that "a" and "A" both
                generate the nybble "0xA==10". Use only these specific hex
                characters with this format.

                Starting from the beginning of the template to "pack", each
                pair of characters is converted to 1 character of output.
                With format "h", the first character of the pair determines
                the least-significant nybble of the output character; with
                format "H", it determines the most-significant nybble.

                If the length of the input string is not even, it behaves as
                if padded by a null character at the end. Similarly, "extra"
                nybbles are ignored during unpacking.

                If the input string is longer than needed, extra characters
                are ignored.

                A "*" for the repeat count uses all characters of the input
                field. For "unpack", nybbles are converted to a string of
                hexadecimal digits.

            *   The "p" format packs a pointer to a null-terminated string.
                You are responsible for ensuring that the string is not a
                temporary value, as that could potentially get deallocated
                before you got around to using the packed result. The "P"
                format packs a pointer to a structure of the size indicated
                by the length. A null pointer is created if the
                corresponding value for "p" or "P" is "undef"; similarly
                with "unpack", where a null pointer unpacks into "undef".

                If your system has a strange pointer size--meaning a pointer
                is neither as big as an int nor as big as a long--it may not
                be possible to pack or unpack pointers in big- or
                little-endian byte order. Attempting to do so raises an
                exception.

            *   The "/" template character allows packing and unpacking of a
                sequence of items where the packed structure contains a
                packed item count followed by the packed items themselves.
                This is useful when the structure you're unpacking has
                encoded the sizes or repeat counts for some of its fields
                within the structure itself as separate fields.

                For "pack", you write *length-item*"/"*sequence-item*, and
                the *length-item* describes how the length value is packed.
                Formats likely to be of most use are integer-packing ones
                like "n" for Java strings, "w" for ASN.1 or SNMP, and "N"
                for Sun XDR.

                For "pack", *sequence-item* may have a repeat count, in
                which case the minimum of that and the number of available
                items is used as the argument for *length-item*. If it has
                no repeat count or uses a '*', the number of available items
                is used.

                For "unpack", an internal stack of integer arguments
                unpacked so far is used. You write "/"*sequence-item* and
                the repeat count is obtained by popping off the last element
                from the stack. The *sequence-item* must not have a repeat
                count.

                If *sequence-item* refers to a string type ("A", "a", or
                "Z"), the *length-item* is the string length, not the number
                of strings. With an explicit repeat count for pack, the
                packed string is adjusted to that length. For example:

                 This code:                             gives this result:

                 unpack("W/a", "\004Gurusamy")          ("Guru")
                 unpack("a3/A A*", "007 Bond  J ")      (" Bond", "J")
                 unpack("a3 x2 /A A*", "007: Bond, J.") ("Bond, J", ".")

                 pack("n/a* w/a","hello,","world")     "\000\006hello,\005world"
                 pack("a/W2", ord("a") .. ord("z"))    "2ab"

                The *length-item* is not returned explicitly from "unpack".

                Supplying a count to the *length-item* format letter is only
                useful with "A", "a", or "Z". Packing with a *length-item*
                of "a" or "Z" may introduce "\000" characters, which Perl
                does not regard as legal in numeric strings.

            *   The integer types "s", "S", "l", and "L" may be followed by
                a "!" modifier to specify native shorts or longs. As shown
                in the example above, a bare "l" means exactly 32 bits,
                although the native "long" as seen by the local C compiler
                may be larger. This is mainly an issue on 64-bit platforms.
                You can see whether using "!" makes any difference this way:

                    printf "format s is %d, s! is %d\n",
                        length pack("s"), length pack("s!");

                    printf "format l is %d, l! is %d\n",
                        length pack("l"), length pack("l!");

                "i!" and "I!" are also allowed, but only for completeness'
                sake: they are identical to "i" and "I".

                The actual sizes (in bytes) of native shorts, ints, longs,
                and long longs on the platform where Perl was built are also
                available from the command line:

                    $ perl -V:{short,int,long{,long}}size
                    shortsize='2';
                    intsize='4';
                    longsize='4';
                    longlongsize='8';

                or programmatically via the "Config" module:

                       use Config;
                       print $Config{shortsize},    "\n";
                       print $Config{intsize},      "\n";
                       print $Config{longsize},     "\n";
                       print $Config{longlongsize}, "\n";

                $Config{longlongsize} is undefined on systems without long
                long support.

            *   The integer formats "s", "S", "i", "I", "l", "L", "j", and
                "J" are inherently non-portable between processors and
                operating systems because they obey native byteorder and
                endianness. For example, a 4-byte integer 0x12345678
                (305419896 decimal) would be ordered natively (arranged in
                and handled by the CPU registers) into bytes as

                    0x12 0x34 0x56 0x78  # big-endian
                    0x78 0x56 0x34 0x12  # little-endian

                Basically, Intel and VAX CPUs are little-endian, while
                everybody else, including Motorola m68k/88k, PPC, Sparc, HP
                PA, Power, and Cray, are big-endian. Alpha and MIPS can be
                either: Digital/Compaq uses (well, used) them in
                little-endian mode, but SGI/Cray uses them in big-endian
                mode.

                The names *big-endian* and *little-endian* are comic
                references to the egg-eating habits of the little-endian
                Lilliputians and the big-endian Blefuscudians from the
                classic Jonathan Swift satire, *Gulliver's Travels*. This
                entered computer lingo via the paper "On Holy Wars and a
                Plea for Peace" by Danny Cohen, USC/ISI IEN 137, April 1,
                1980.

                Some systems may have even weirder byte orders such as

                   0x56 0x78 0x12 0x34
                   0x34 0x12 0x78 0x56

                These are called mid-endian, middle-endian, mixed-endian, or
                just weird.

                You can determine your system endianness with this
                incantation:

                   printf("%#02x ", $_) for unpack("W*", pack L=>0x12345678);

                The byteorder on the platform where Perl was built is also
                available via Config:

                    use Config;
                    print "$Config{byteorder}\n";

                or from the command line:

                    $ perl -V:byteorder

                Byteorders "1234" and "12345678" are little-endian; "4321"
                and "87654321" are big-endian. Systems with
                multiarchitecture binaries will have "ffff", signifying that
                static information doesn't work, one must use runtime
                probing.

                For portably packed integers, either use the formats "n",
                "N", "v", and "V" or else use the ">" and "<" modifiers
                described immediately below. See also perlport.

            *   Also floating point numbers have endianness. Usually (but
                not always) this agrees with the integer endianness. Even
                though most platforms these days use the IEEE 754 binary
                format, there are differences, especially if the long
                doubles are involved. You can see the "Config" variables
                "doublekind" and "longdblkind" (also "doublesize",
                "longdblsize"): the "kind" values are enums, unlike
                "byteorder".

                Portability-wise the best option is probably to keep to the
                IEEE 754 64-bit doubles, and of agreed-upon endianness.
                Another possibility is the "%a") format of "printf".

            *   Starting with Perl 5.10.0, integer and floating-point
                formats, along with the "p" and "P" formats and "()" groups,
                may all be followed by the ">" or "<" endianness modifiers
                to respectively enforce big- or little-endian byte-order.
                These modifiers are especially useful given how "n", "N",
                "v", and "V" don't cover signed integers, 64-bit integers,
                or floating-point values.

                Here are some concerns to keep in mind when using an
                endianness modifier:

                *   Exchanging signed integers between different platforms
                    works only when all platforms store them in the same
                    format. Most platforms store signed integers in
                    two's-complement notation, so usually this is not an
                    issue.

                *   The ">" or "<" modifiers can only be used on
                    floating-point formats on big- or little-endian
                    machines. Otherwise, attempting to use them raises an
                    exception.

                *   Forcing big- or little-endian byte-order on
                    floating-point values for data exchange can work only if
                    all platforms use the same binary representation such as
                    IEEE floating-point. Even if all platforms are using
                    IEEE, there may still be subtle differences. Being able
                    to use ">" or "<" on floating-point values can be
                    useful, but also dangerous if you don't know exactly
                    what you're doing. It is not a general way to portably
                    store floating-point values.

                *   When using ">" or "<" on a "()" group, this affects all
                    types inside the group that accept byte-order modifiers,
                    including all subgroups. It is silently ignored for all
                    other types. You are not allowed to override the
                    byte-order within a group that already has a byte-order
                    modifier suffix.

            *   Real numbers (floats and doubles) are in native machine
                format only. Due to the multiplicity of floating-point
                formats and the lack of a standard "network" representation
                for them, no facility for interchange has been made. This
                means that packed floating-point data written on one machine
                may not be readable on another, even if both use IEEE
                floating-point arithmetic (because the endianness of the
                memory representation is not part of the IEEE spec). See
                also perlport.

                If you know *exactly* what you're doing, you can use the ">"
                or "<" modifiers to force big- or little-endian byte-order
                on floating-point values.

                Because Perl uses doubles (or long doubles, if configured)
                internally for all numeric calculation, converting from
                double into float and thence to double again loses
                precision, so "unpack("f", pack("f", $foo)") will not in
                general equal $foo.

            *   Pack and unpack can operate in two modes: character mode
                ("C0" mode) where the packed string is processed per
                character, and UTF-8 byte mode ("U0" mode) where the packed
                string is processed in its UTF-8-encoded Unicode form on a
                byte-by-byte basis. Character mode is the default unless the
                format string starts with "U". You can always switch mode
                mid-format with an explicit "C0" or "U0" in the format. This
                mode remains in effect until the next mode change, or until
                the end of the "()" group it (directly) applies to.

                Using "C0" to get Unicode characters while using "U0" to get
                *non*-Unicode bytes is not necessarily obvious. Probably
                only the first of these is what you want:

                    $ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
                      perl -CS -ne 'printf "%v04X\n", $_ for unpack("C0A*", $_)'
                    03B1.03C9
                    $ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
                      perl -CS -ne 'printf "%v02X\n", $_ for unpack("U0A*", $_)'
                    CE.B1.CF.89
                    $ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
                      perl -C0 -ne 'printf "%v02X\n", $_ for unpack("C0A*", $_)'
                    CE.B1.CF.89
                    $ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
                      perl -C0 -ne 'printf "%v02X\n", $_ for unpack("U0A*", $_)'
                    C3.8E.C2.B1.C3.8F.C2.89

                Those examples also illustrate that you should not try to
                use "pack"/"unpack" as a substitute for the Encode module.

            *   You must yourself do any alignment or padding by inserting,
                for example, enough "x"es while packing. There is no way for
                "pack" and "unpack" to know where characters are going to or
                coming from, so they handle their output and input as flat
                sequences of characters.

            *   A "()" group is a sub-TEMPLATE enclosed in parentheses. A
                group may take a repeat count either as postfix, or for
                "unpack", also via the "/" template character. Within each
                repetition of a group, positioning with "@" starts over at
                0. Therefore, the result of

                    pack("@1A((@2A)@3A)", qw[X Y Z])

                is the string "\0X\0\0YZ".

            *   "x" and "X" accept the "!" modifier to act as alignment
                commands: they jump forward or back to the closest position
                aligned at a multiple of "count" characters. For example, to
                "pack" or "unpack" a C structure like

                    struct {
                        char   c;    /* one signed, 8-bit character */
                        double d;
                        char   cc[2];
                    }

                one may need to use the template "c x![d] d c[2]". This
                assumes that doubles must be aligned to the size of double.

                For alignment commands, a "count" of 0 is equivalent to a
                "count" of 1; both are no-ops.

            *   "n", "N", "v" and "V" accept the "!" modifier to represent
                signed 16-/32-bit integers in big-/little-endian order. This
                is portable only when all platforms sharing packed data use
                the same binary representation for signed integers; for
                example, when all platforms use two's-complement
                representation.

            *   Comments can be embedded in a TEMPLATE using "#" through the
                end of line. White space can separate pack codes from each
                other, but modifiers and repeat counts must follow
                immediately. Breaking complex templates into individual
                line-by-line components, suitably annotated, can do as much
                to improve legibility and maintainability of pack/unpack
                formats as "/x" can for complicated pattern matches.

            *   If TEMPLATE requires more arguments than "pack" is given,
                "pack" assumes additional "" arguments. If TEMPLATE requires
                fewer arguments than given, extra arguments are ignored.

            *   Attempting to pack the special floating point values "Inf"
                and "NaN" (infinity, also in negative, and not-a-number)
                into packed integer values (like "L") is a fatal error. The
                reason for this is that there simply isn't any sensible
                mapping for these special values into integers.

            Examples:

                $foo = pack("WWWW",65,66,67,68);
                # foo eq "ABCD"
                $foo = pack("W4",65,66,67,68);
                # same thing
                $foo = pack("W4",0x24b6,0x24b7,0x24b8,0x24b9);
                # same thing with Unicode circled letters.
                $foo = pack("U4",0x24b6,0x24b7,0x24b8,0x24b9);
                # same thing with Unicode circled letters.  You don't get the
                # UTF-8 bytes because the U at the start of the format caused
                # a switch to U0-mode, so the UTF-8 bytes get joined into
                # characters
                $foo = pack("C0U4",0x24b6,0x24b7,0x24b8,0x24b9);
                # foo eq "\xe2\x92\xb6\xe2\x92\xb7\xe2\x92\xb8\xe2\x92\xb9"
                # This is the UTF-8 encoding of the string in the
                # previous example

                $foo = pack("ccxxcc",65,66,67,68);
                # foo eq "AB\0\0CD"

                # NOTE: The examples above featuring "W" and "c" are true
                # only on ASCII and ASCII-derived systems such as ISO Latin 1
                # and UTF-8.  On EBCDIC systems, the first example would be
                #      $foo = pack("WWWW",193,194,195,196);

                $foo = pack("s2",1,2);
                # "\001\000\002\000" on little-endian
                # "\000\001\000\002" on big-endian

                $foo = pack("a4","abcd","x","y","z");
                # "abcd"

                $foo = pack("aaaa","abcd","x","y","z");
                # "axyz"

                $foo = pack("a14","abcdefg");
                # "abcdefg\0\0\0\0\0\0\0"

                $foo = pack("i9pl", gmtime);
                # a real struct tm (on my system anyway)

                $utmp_template = "Z8 Z8 Z16 L";
                $utmp = pack($utmp_template, @utmp1);
                # a struct utmp (BSDish)

                @utmp2 = unpack($utmp_template, $utmp);
                # "@utmp1" eq "@utmp2"

                sub bintodec {
                    unpack("N", pack("B32", substr("0" x 32 . shift, -32)));
                }

                $foo = pack('sx2l', 12, 34);
                # short 12, two zero bytes padding, long 34
                $bar = pack('s@4l', 12, 34);
                # short 12, zero fill to position 4, long 34
                # $foo eq $bar
                $baz = pack('s.l', 12, 4, 34);
                # short 12, zero fill to position 4, long 34

                $foo = pack('nN', 42, 4711);
                # pack big-endian 16- and 32-bit unsigned integers
                $foo = pack('S>L>', 42, 4711);
                # exactly the same
                $foo = pack('s<l<', -42, 4711);
                # pack little-endian 16- and 32-bit signed integers
                $foo = pack('(sl)<', -42, 4711);
                # exactly the same

            The same template may generally also be used in "unpack".