Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INFRA] add support for building PDF versions of the spec #431

Merged
merged 2 commits into from
Mar 11, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,22 @@ jobs:
# failures for local file:/// -- yoh found no better way,
linkchecker -t 1 --check-extern --ignore-url 'file:///.*' --ignore-url https://fonts.gstatic.com ~/build/site/*html ~/build/site/*/*.html

build_docs_pdf:
working_directory: ~/bids-specification/pdf_build_src
docker:
- image: danteev/texlive:TL2017
steps:
- checkout:
path: ~/bids-specification
- run:
name: generate pdf version docs
command: sh build_pdf.sh
- store_artifacts:
path: bids-spec.pdf
- run:
name: remove pdf version from repo
command: rm bids-spec.pdf

# Auto changelog collector
github-changelog-generator:
working_directory: ~/build
Expand Down Expand Up @@ -144,6 +160,7 @@ workflows:
search_build:
jobs:
- build_docs
- build_docs_pdf
- linkchecker:
requires:
- build_docs
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
site/
site/
.DS_Store
src/.DS_Store
src/04-modality-specific-files/.DS_Store
41 changes: 41 additions & 0 deletions pdf_build_src/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# pdf-version of BIDS specification

The `pdf_build_src` directory contains the scripts and tex files required to build a pdf document of the BIDS specification from multiple markdown files using the pandoc library.

Pandoc is command line tool which is also a Haskell library that converts files from one markup format to another. More here: https://pandoc.org/index.html

## Requirements

For the pdf build to be successful, the following need to be installed:

- Python 3.x
- pandoc
- Latest version of LaTeX: By default, Pandoc creates PDFs using LaTeX. Because a full MacTeX installation uses four gigabytes of disk space, pandoc recommends BasicTeX or TinyTeX and using the tlmgr tool to install additional packages as needed.

Installation instructions for both pandoc and LaTeX: https://pandoc.org/installing.html

## Building pdf document

Run the `build_pdf.sh` from the `pdf_build_src` with the command `sh build_pdf.sh` from the command line terminal

List of warnings are for missing characters like emojis while converting from markdown to pdf. Except for losing those characters in the final document, it doesn't affect the formatting or contents and therefore, can be ignored.

## Technical Overview

Pandoc comes with a plethora of options to format the resulting document. For building a pdf from multiple markdowns, a consolidated intermediate tex file is first built, which is then converted to a pdf document. To achieve the desired formatting in the final pdf, additional tex files are used with options offered by pandoc.

### Formatting files

`listings_setup.tex` - Listings is a LaTeX package used for typestting programming code in TeX. This file sets up the listings package to suit our needs and is used with the `--listings` option.

`cover.tex` - BIDS Logo is used as a cover page for the document. `cover.tex` is used with the option `--include-before-body`

`header.tex` - Header tex file that's updated with the latest version number and date when `build_pdf.sh` is run. Used with the `-H` header option.

### Scripts

`process_markdowns.py` - Script that processes markdown files in the `src` directory that are duplicated and modified for the needs of the pdf.

`pandoc_script.py` - Prepares and runs the final pandoc command through the `build_pdf.sh` script

`build_pdf.sh` - Shell script that organizes the directory structure and runs the above two python scripts
16 changes: 16 additions & 0 deletions pdf_build_src/build_pdf.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Shell script that runs process_markdowns.py and pandoc_script.py in sequence to build the pdf document

# prepare the copied src directory
python3 process_markdowns.py

# copy pandoc_script into the temp src_copy directory
cp pandoc_script.py header.tex cover.tex listings_setup.tex src_copy/src

# run pandoc_script from src_copy directory
cd src_copy/src
python3 pandoc_script.py
mv bids-spec.pdf ../..
cd ../..

# delete the duplicated src directory
rm -rf src_copy
27 changes: 27 additions & 0 deletions pdf_build_src/cover.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
% adds the bids logo as the cover page of the pdf
\begin{titlepage}

\newcommand{\HRule}{\rule{\linewidth}{0.5mm}} % Defines a new command for the horizontal lines, change thickness here

\center % Center everything on the page



%----------------------------------------------------------------------------------------
% LOGO SECTION
%----------------------------------------------------------------------------------------

\includegraphics[width=0.6\textwidth]{images/BIDS_logo.jpg}\\[1cm]

%----------------------------------------------------------------------------------------
% TITLE SECTION
%----------------------------------------------------------------------------------------

\HRule \\[0.4cm]
{ \huge \bfseries Brain Imaging Data Structure Specification}\\[0.4cm] % Title of your document
\HRule \\[1.5cm]

% \textsc{\large v1.2.1}\\[0.5cm]{\large 2019-08-14}\\[2cm]

% \vfill % Fill the rest of the page with whitespace
\textsc{\large v1.2.1}\\[0.5cm]{\large 2019-08-14}\\[2cm]\vfill\end{titlepage}
6 changes: 6 additions & 0 deletions pdf_build_src/header.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
% header file
\usepackage{fancyhdr}
\pagestyle{fancy}
\fancyhf{}
\chead{Brain Imaging Data Structure v1.2.1 2019-08-14}
\fancyfoot[LE,RO]{\thepage}
25 changes: 25 additions & 0 deletions pdf_build_src/listings_setup.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
% Contents of listings-setup.tex
\usepackage{xcolor}
\usepackage{graphicx}

\lstset{
basicstyle=\ttfamily,
numbers=left,
keywordstyle=\color[rgb]{0.13,0.29,0.53}\bfseries,
stringstyle=\color[rgb]{0.31,0.60,0.02},
commentstyle=\color[rgb]{0.56,0.35,0.01}\itshape,
numberstyle=\footnotesize,
stepnumber=1,
numbersep=5pt,
backgroundcolor=\color[RGB]{248,248,248},
showspaces=false,
showstringspaces=false,
showtabs=false,
tabsize=2,
captionpos=b,
breaklines=true,
breakautoindent=true,
escapeinside={\%*}{*)},
linewidth=\textwidth,
basewidth=0.5em
}
43 changes: 43 additions & 0 deletions pdf_build_src/pandoc_script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
"""Use the pandoc library as a final step to build the pdf.

This is done once the duplicate src directory is processed.
"""
import os
import subprocess


def build_pdf(filename):
"""Construct command with required pandoc flags and run using subprocess.

Parameters
----------
filename : str
Name of the output file.

"""
markdown_list = []
for root, dirs, files in os.walk('.'):
for file in files:
if file.endswith(".md") and file != 'index.md':
markdown_list.append(os.path.join(root, file))
elif file == 'index.md':
index_page = os.path.join(root, file)

default_pandoc_cmd = "pandoc "

# creates string of file paths in the order we'd like them to be appear
# ordering is taken care of by the inherent file naming
files_string = index_page + " " + " ".join(sorted(markdown_list))

flags = (" -f markdown_github --include-before-body cover.tex --toc "
"-V documentclass=report --listings -H listings_setup.tex "
"-H header.tex -V linkcolor:blue -V geometry:a4paper "
"-V geometry:margin=2cm --pdf-engine=xelatex -o ")
output_filename = filename

cmd = default_pandoc_cmd + files_string + flags + output_filename
subprocess.run(cmd.split())


if __name__ == "__main__":
build_pdf('bids-spec.pdf')
191 changes: 191 additions & 0 deletions pdf_build_src/process_markdowns.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
"""Process the markdown files.
The purpose of the script is to create a duplicate src directory within which
all of the markdown files are processed to match the specifications of building
a pdf from multiple markdown files using the pandoc library (***add link to
pandoc library documentation***) with pdf specific text rendering in mind as
well.
"""

import os
import subprocess
import re
from datetime import datetime


def run_shell_cmd(command):
"""Run shell/bash commands passed as a string using subprocess module."""
process = subprocess.Popen(command.split(), stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
output = process.stdout.read()

return output.decode('utf-8')


def copy_src():
"""Duplicate src directory to a new but temp directory named 'src_copy'."""
# source and target directories
src_path = "../src/"
target_path = "src_copy"

# make new directory
mkdir_cmd = "mkdir "+target_path
run_shell_cmd(mkdir_cmd)

# copy contents of src directory
copy_cmd = "cp -R "+src_path+" "+target_path
run_shell_cmd(copy_cmd)


def copy_bids_logo():
"""Copy BIDS_logo.jpg from the BIDS_logo dir in the root of the repo."""
run_shell_cmd("cp ../BIDS_logo/BIDS_logo.jpg src_copy/src/images/")


def copy_images(root_path):
"""Copy images.
Will be done from images directory of subdirectories to images directory
in the src directory
"""
subdir_list = []

# walk through the src directory to find subdirectories named 'images'
# and copy contents to the 'images' directory in the duplicate src
# directory
for root, dirs, files in os.walk(root_path):
if 'images' in dirs:
subdir_list.append(root)

for each in subdir_list:
if each != root_path:
run_shell_cmd("cp -R "+each+"/images"+" "+root_path+"/images/")


def extract_header_string():
"""Extract the latest release's version number and date from CHANGES.md."""
released_versions = []
run_shell_cmd("cp ../mkdocs.yml src_copy/")

with open(os.path.join(os.path.dirname(__file__), 'src_copy/mkdocs.yml'), 'r') as file:
data = file.readlines()

header_string = data[0].split(": ")[1]

title = " ".join(header_string.split()[0:4])
version_number = header_string.split()[-1]
build_date = datetime.today().strftime('%Y-%m-%d')

return title, version_number, build_date


def add_header():
"""Add the header string extracted from changelog to header.tex file."""
title, version_number, build_date = extract_header_string()
header = " ".join([title, version_number, build_date])

# creating a header string with latest version number and date
header_string = ("\chead{ " + header + " }")

with open('header.tex', 'r') as file:
data = file.readlines()

# now change the last but 2nd line, note that you have to add a newline
data[-2] = header_string+'\n'

# re-write header.tex file with new header string
with open('header.tex', 'w') as file:
file.writelines(data)


def remove_internal_links(root_path, link_type):
"""Find and replace all cross and same markdown internal links.
The links will be replaced with plain text associated with it.
"""
if link_type == 'cross':
# regex that matches cross markdown links within a file
# TODO: add more documentation explaining regex
primary_pattern = re.compile(r'\[((?!http).[\w\s.\(\)`*/–]+)\]\(((?!http).+(\.md|\.yml|\.md#[\w\-\w]+))\)') # noqa: E501
elif link_type == 'same':
# regex that matches references sections within the same markdown
primary_pattern = re.compile(r'\[([\w\s.\(\)`*/–]+)\]\(([#\w\-._\w]+)\)')

for root, dirs, files in os.walk(root_path):
for file in files:
if file.endswith(".md"):
with open(os.path.join(root, file), 'r') as markdown:
data = markdown.readlines()

for ind, line in enumerate(data):
match = primary_pattern.search(line)

if match:
line = re.sub(primary_pattern,
match.group().split('](')[0][1:], line)

data[ind] = line

with open(os.path.join(root, file), 'w') as markdown:
markdown.writelines(data)


def modify_changelog():
"""Change first line of the changelog to markdown Heading 1.
This modification makes sure that in the pdf build, changelog is a new
chapter.
"""
with open('src_copy/src/CHANGES.md', 'r') as file:
data = file.readlines()

data[0] = "# Changelog"

with open('src_copy/src/CHANGES.md', 'w') as file:
file.writelines(data)


def edit_titlepage():
"""Add title and version number of the specification to the titlepage."""
title, version_number, build_date = extract_header_string()

with open('cover.tex', 'r') as file:
data = file.readlines()

data[-1] = ("\\textsc{\large "+version_number+"}" +
"\\\\[0.5cm]" +
"{\large " +
build_date +
"}" +
"\\\\[2cm]" +
"\\vfill" +
"\\end{titlepage}")

with open('cover.tex', 'w') as file:
data = file.writelines(data)


if __name__ == '__main__':

duplicated_src_dir_path = 'src_copy/src'

# Step 1: make a copy of the src directory in the current directory
copy_src()

# Step 2: copy BIDS_logo to images directory of the src_copy directory
copy_bids_logo()

# Step 3: copy images from subdirectories of src_copy directory
copy_images(duplicated_src_dir_path)
subprocess.call("mv src_copy/src/images/images/* src_copy/src/images/",
shell=True)

# Step 4: extract the latest version number, date and title
extract_header_string()
add_header()

edit_titlepage()

# Step 5: modify changelog to be a level 1 heading to facilitate section
# separation
modify_changelog()

# Step 6: remove all internal links
remove_internal_links(duplicated_src_dir_path, 'cross')
remove_internal_links(duplicated_src_dir_path, 'same')
Loading