Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use faster Python script to amalgamate #3005

Merged
merged 10 commits into from
Jan 22, 2022
4 changes: 2 additions & 2 deletions build/single_file_libs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ This is the most common use case. The decompression library is small, adding, fo
Create `zstddeclib.c` from the Zstd source using:
```
cd zstd/build/single_file_libs
./combine.sh -r ../../lib -o zstddeclib.c zstddeclib-in.c
python3 combine.py -r ../../lib -x legacy/zstd_legacy.h -o zstddeclib.c zstddeclib-in.c
```
Then add the resulting file to your project (see the [example files](examples)).

Expand All @@ -26,7 +26,7 @@ The same tool can amalgamate the entire Zstd library for ease of adding both com
Create `zstd.c` from the Zstd source using:
```
cd zstd/build/single_file_libs
./combine.sh -r ../../lib -o zstd.c zstd-in.c
python3 combine.py -r ../../lib -x legacy/zstd_legacy.h -k zstd.h -o zstd.c zstd-in.c
```
It's possible to create a compressor-only library but since the decompressor is so small in comparison this doesn't bring much of a gain (but for the curious, simply remove the files in the _decompress_ section at the end of `zstd-in.c`).

Expand Down
2 changes: 1 addition & 1 deletion build/single_file_libs/build_library_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ fi
echo "Single file library creation script: PASSED"

# Copy the header to here (for the tests)
cp "$ZSTD_SRC_ROOT/zstd.h" zstd.h
cp "$ZSTD_SRC_ROOT/zstd.h" examples/zstd.h

# Compile the generated output
cc -Wall -Wextra -Werror -Wshadow -pthread -I. -Os -g0 -o $OUT_FILE zstd.c examples/roundtrip.c
Expand Down
234 changes: 234 additions & 0 deletions build/single_file_libs/combine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
#!/usr/bin/env python3

# Tool to bundle multiple C/C++ source files, inlining any includes.
#
# Note: there are two types of exclusion options: the '-x' flag, which besides
# excluding a file also adds an #error directive in place of the #include, and
# the '-k' flag, which keeps the #include and doesn't inline the file. The
# intended use cases are: '-x' for files that would normally be #if'd out, so
# features that 100% won't be used in the amalgamated file, for which every
# occurrence adds the error, and '-k' for headers that we wish to manually
# include, such as a project's public API, for which occurrences after the first
# are removed.
#
# Todo: the error handling could be better, which currently throws and halts
# (which is functional just not very friendly).
#
# Author: Carl Woffenden, Numfum GmbH (this script is released under a CC0 license/Public Domain)

import argparse, re, sys

from pathlib import Path
from typing import Any, List, Optional, Pattern, Set, TextIO

# Set of file roots when searching (equivalent to -I paths for the compiler).
roots: Set[Path] = set()

# Set of (canonical) file Path objects to exclude from inlining (and not only
# exclude but to add a compiler error directive when they're encountered).
excludes: Set[Path] = set()

# Set of (canonical) file Path objects to keep as include directives.
keeps: Set[Path] = set()

# Whether to keep the #pragma once directives (unlikely, since this will result
# in a warning, but the option is there).
keep_pragma: bool = False

# Destination file object (or stdout if no output file was supplied).
destn: TextIO = sys.stdout

# Set of file Path objects previously inlined (and to ignore if reencountering).
found: Set[Path] = set()

# Compiled regex Patern to handle "#pragma once" in various formats:
#
# #pragma once
# #pragma once
# # pragma once
# #pragma once
# #pragma once // comment
#
# Ignoring commented versions, same as include_regex.
#
pragma_regex: Pattern = re.compile(r'^\s*#\s*pragma\s*once\s*')

# Compiled regex Patern to handle the following type of file includes:
#
# #include "file"
# #include "file"
# # include "file"
# #include "file"
# #include "file" // comment
# #include "file" // comment with quote "
#
# And all combinations of, as well as ignoring the following:
#
# #include <file>
# //#include "file"
# /*#include "file"*/
#
# We don't try to catch errors since the compiler will do this (and the code is
# expected to be valid before processing) and we don't care what follows the
# file (whether it's a valid comment or not, since anything after the quoted
# string is ignored)
#
include_regex: Pattern = re.compile(r'^\s*#\s*include\s*"(.+?)"')

# Simple tests to prove include_regex's cases.
#
def test_match_include() -> bool:
cwoffenden marked this conversation as resolved.
Show resolved Hide resolved
if (include_regex.match('#include "file"') and
include_regex.match(' #include "file"') and
include_regex.match('# include "file"') and
include_regex.match('#include "file"') and
include_regex.match('#include "file" // comment')):
if (not include_regex.match('#include <file>') and
not include_regex.match('//#include "file"') and
not include_regex.match('/*#include "file"*/')):
found = include_regex.match('#include "file" // "')
if (found and found.group(1) == 'file'):
print('#include match valid')
return True
return False

# Simple tests to prove pragma_regex's cases.
#
def test_match_pragma() -> bool:
if (pragma_regex.match('#pragma once') and
pragma_regex.match(' #pragma once') and
pragma_regex.match('# pragma once') and
pragma_regex.match('#pragma once') and
pragma_regex.match('#pragma once // comment')):
if (not pragma_regex.match('//#pragma once') and
not pragma_regex.match('/*#pragma once*/')):
print('#pragma once match valid')
return True
return False

# Finds 'file'. First the list of 'root' paths are searched, followed by the
# the currently processing file's 'parent' path, returning a valid Path in
# canonical form. If no match is found None is returned.
#
def resolve_include(file: str, parent: Optional[Path] = None) -> Optional[Path]:
for root in roots:
found = root.joinpath(file).resolve()
if (found.is_file()):
return found
if (parent):
found = parent.joinpath(file).resolve();
else:
found = Path(file)
if (found.is_file()):
return found
return None

# Helper to resolve lists of files. 'file_list' is passed in from the arguments
# and each entry resolved to its canonical path (like any include entry, either
# from the list of root paths or the owning file's 'parent', which in this case
# is case is the input file). The results are stored in 'resolved'.
#
def resolve_excluded_files(file_list: Optional[List[str]], resolved: Set[Path], parent: Optional[Path] = None) -> None:
if (file_list):
for filename in file_list:
found = resolve_include(filename, parent)
if (found):
resolved.add(found)
else:
error_line(f'Warning: excluded file not found: {filename}')

# Writes 'line' to the open 'destn' (or stdout).
#
def write_line(line: str) -> None:
print(line, file=destn)

# Logs 'line' to stderr. This is also used for general notifications that we
# don't want to go to stdout (so the source can be piped).
#
def error_line(line: Any) -> None:
print(line, file=sys.stderr)

# Inline the contents of 'file' (with any of its includes also inlined, etc.).
#
# Note: text encoding errors are ignored and replaced with ? when reading the
# input files. This isn't ideal, but it's more than likely in the comments than
# code and a) the text editor has probably also failed to read the same content,
# and b) the compiler probably did too.
#
def add_file(file: Path, file_name: str = None) -> None:
if (file.is_file()):
if (not file_name):
file_name = file.name
error_line(f'Processing: {file_name}')
with file.open('r', errors='replace') as opened:
for line in opened:
line = line.rstrip('\n')
match_include = include_regex.match(line);
if (match_include):
# We have a quoted include directive so grab the file
inc_name = match_include.group(1)
resolved = resolve_include(inc_name, file.parent)
if (resolved):
if (resolved in excludes):
# The file was excluded so error if the compiler uses it
write_line(f'#error Using excluded file: {inc_name} (re-amalgamate source to fix)')
error_line(f'Excluding: {inc_name}')
else:
if (resolved not in found):
# The file was not previously encountered
found.add(resolved)
if (resolved in keeps):
# But the include was flagged to keep as included
write_line(f'/**** *NOT* inlining {inc_name} ****/')
write_line(line)
error_line(f'Not inlining: {inc_name}')
else:
# The file was neither excluded nor seen before so inline it
write_line(f'/**** start inlining {inc_name} ****/')
add_file(resolved, inc_name)
write_line(f'/**** ended inlining {inc_name} ****/')
else:
write_line(f'/**** skipping file: {inc_name} ****/')
else:
# The include file didn't resolve to a file
write_line(f'#error Unable to find: {inc_name}')
error_line(f'Error: Unable to find: {inc_name}')
else:
# Skip any 'pragma once' directives, otherwise write the source line
if (keep_pragma or not pragma_regex.match(line)):
write_line(line)
else:
error_line(f'Error: Invalid file: {file}')

# Start here
parser = argparse.ArgumentParser(description='Amalgamate Tool', epilog=f'example: {sys.argv[0]} -r ../my/path -r ../other/path -o out.c in.c')
parser.add_argument('-r', '--root', action='append', type=Path, help='file root search path')
parser.add_argument('-x', '--exclude', action='append', help='file to completely exclude from inlining')
parser.add_argument('-k', '--keep', action='append', help='file to exclude from inlining but keep the include directive')
parser.add_argument('-p', '--pragma', action='store_true', default=False, help='keep any "#pragma once" directives (removed by default)')
parser.add_argument('-o', '--output', type=argparse.FileType('w'), help='output file (otherwise stdout)')
parser.add_argument('input', type=Path, help='input file')
args = parser.parse_args()

# Fail early on an invalid input (and store it so we don't recurse)
args.input = args.input.resolve(strict=True)
found.add(args.input)

# Resolve all of the root paths upfront (we'll halt here on invalid roots)
if (args.root):
for path in args.root:
roots.add(path.resolve(strict=True))

# The remaining params: so resolve the excluded files and #pragma once directive
resolve_excluded_files(args.exclude, excludes, args.input.parent)
resolve_excluded_files(args.keep, keeps, args.input.parent)
keep_pragma = args.pragma;

# Then recursively process the input file
try:
if (args.output):
destn = args.output
add_file(args.input)
finally:
if (destn):
destn.close()
3 changes: 2 additions & 1 deletion build/single_file_libs/combine.sh
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ add_file() {
local res_inc="$(resolve_include "$srcdir" "$inc")"
if list_has_item "$XINCS" "$inc"; then
# The file was excluded so error if the source attempts to use it
write_line "#error Using excluded file: $inc"
write_line "#error Using excluded file: $inc (re-amalgamate source to fix)"
log_line "Excluding: $inc"
else
if ! list_has_item "$FOUND" "$res_inc"; then
Expand Down Expand Up @@ -200,6 +200,7 @@ if [ -n "$1" ]; then
printf "" > "$DESTN"
fi
test_deps
log_line "Processing using the slower shell script; this might take a while"
add_file "$1"
else
echo "Input file not found: \"$1\""
Expand Down
9 changes: 7 additions & 2 deletions build/single_file_libs/create_single_file_decoder.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,13 @@
ZSTD_SRC_ROOT="../../lib"

# Amalgamate the sources
echo "Amalgamating files... this can take a while"
./combine.sh -r "$ZSTD_SRC_ROOT" -o zstddeclib.c zstddeclib-in.c
echo "Amalgamating files..."
# Using the faster Python script if we have 3.8 or higher
if python3 -c 'import sys; assert sys.version_info >= (3,8)' 2>/dev/null; then
./combine.py -r "$ZSTD_SRC_ROOT" -x legacy/zstd_legacy.h -o zstddeclib.c zstddeclib-in.c
else
./combine.sh -r "$ZSTD_SRC_ROOT" -x legacy/zstd_legacy.h -o zstddeclib.c zstddeclib-in.c
fi
# Did combining work?
if [ $? -ne 0 ]; then
echo "Combine script: FAILED"
Expand Down
9 changes: 7 additions & 2 deletions build/single_file_libs/create_single_file_library.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,13 @@
ZSTD_SRC_ROOT="../../lib"

# Amalgamate the sources
echo "Amalgamating files... this can take a while"
./combine.sh -r "$ZSTD_SRC_ROOT" -o zstd.c zstd-in.c
echo "Amalgamating files..."
# Using the faster Python script if we have 3.8 or higher
if python3 -c 'import sys; assert sys.version_info >= (3,8)' 2>/dev/null; then
./combine.py -r "$ZSTD_SRC_ROOT" -x legacy/zstd_legacy.h -o zstd.c zstd-in.c
else
./combine.sh -r "$ZSTD_SRC_ROOT" -x legacy/zstd_legacy.h -o zstd.c zstd-in.c
fi
# Did combining work?
if [ $? -ne 0 ]; then
echo "Combine script: FAILED"
Expand Down
8 changes: 6 additions & 2 deletions build/single_file_libs/zstd-in.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
*
* Generate using:
* \code
* combine.sh -r ../../lib -o zstd.c zstd-in.c
* python combine.py -r ../../lib -x legacy/zstd_legacy.h -o zstd.c zstd-in.c
* \endcode
*/
/*
* Copyright (c) 2016-2021, Yann Collet, Facebook, Inc.
* Copyright (c) 2016-present, Yann Collet, Facebook, Inc.
* All rights reserved.
*
* This source code is licensed under both the BSD-style license (found in the
Expand All @@ -28,6 +28,10 @@
* Note: the undefs for xxHash allow Zstd's implementation to coincide with with
* standalone xxHash usage (with global defines).
*
* Note: if you enable ZSTD_LEGACY_SUPPORT the combine.py script will need
* re-running without the "-x legacy/zstd_legacy.h" option (it excludes the
* legacy support at the source level).
*
* Note: multithreading is enabled for all platforms apart from Emscripten.
*/
#define DEBUGLEVEL 0
Expand Down
8 changes: 6 additions & 2 deletions build/single_file_libs/zstddeclib-in.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
*
* Generate using:
* \code
* combine.sh -r ../../lib -o zstddeclib.c zstddeclib-in.c
* python combine.py -r ../../lib -x legacy/zstd_legacy.h -o zstddeclib.c zstddeclib-in.c
* \endcode
*/
/*
* Copyright (c) 2016-2021, Yann Collet, Facebook, Inc.
* Copyright (c) 2016-present, Yann Collet, Facebook, Inc.
* All rights reserved.
*
* This source code is licensed under both the BSD-style license (found in the
Expand All @@ -27,6 +27,10 @@
*
* Note: the undefs for xxHash allow Zstd's implementation to coincide with with
* standalone xxHash usage (with global defines).
*
* Note: if you enable ZSTD_LEGACY_SUPPORT the combine.py script will need
* re-running without the "-x legacy/zstd_legacy.h" option (it excludes the
* legacy support at the source level).
*/
#define DEBUGLEVEL 0
#define MEM_MODULE
Expand Down