Pandoc Github



Over the past few years, I have been using some dedicated note-taking softwareto manage my notes. But all these tools I have tried are unsatisfactory: theyare either slow or cumbersome when I want to search my notes. Finally, Idecided to take my notes in Markdown and convert them to PDF using Pandoc forreading. In this post, I will summarize how I do it.

Template per la creazione di progetti con documentazione in formato markdown e la generazione di ebook in PDF, ePub, HTML. Con pandoc - emanbuc/md-ebook-template. In this article we demonstrate the feasibility of writing scientific manuscripts in plain markdown (MD) text files, which can be easily converted into common publication formats, such as PDF, HTML or EPUB, using pandoc. The simple syntax of markdown assures the long-term readability of raw files and the development of software and workflows. Pandoc is a command line tool that you can use to automatically convert files from markup format to another. With Pandoc, you can write in something easy like Markdown, Microsoft Word, or LibreOffice, and convert it something hard like. The universal markup converter Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It can convert from commonmark (CommonMark Markdown).

本文的中文版本参见 这里。

Taking our notes in Markdown has several advantages:

  • We can edit the Markdown files with our favorite editor, for example,Sublime Text, which means more efficient editing and pleasant writingexperience.

  • Since a Markdown file is a textual file, we can search it using powerful

  • search tool such as grep or ripgrep.

  • We can covert the Markdown files to various formats such as PDF, HTML, epub,mobi etc., for better reading experience, with the help of Pandoc.

  • The notes are all text files and are small in size, which meanseasier and faster syncing or backup between your native PC and the cloudservice you use.

In this post, I would like to share how to generate beautiful PDF files fromMarkdown and give solutions to the issues I have encountered during theprocess.

Before we begin, you need to make sure that you have installed the followingtools:

  • First, Pandoc.After installation, you should add the path of the Pandoc executable tothe system PATH variable.

  • TeX distribution. Please make sure that TeX has been installed on yoursystem. You can use either TeX Live orMiKTeX or MacTeX base on yourplatform. You may need to set up the PATH variable1.

  • A powerful text editor. One of my favorite is SublimeText. You can also choose to use VSCode or evenNeovim.

Background

For those who are not familiar with Pandoc, Pandoc is a powerful tool forconverting document between different formats. It is called the swiss knife ofdocument converter. There are actually two steps involved in convertingMarkdown files to PDF:

  1. Markdown files are converted to LaTeX source files.
  2. Pandoc invokes the pdflatex, xelatex or other TeX command and converts.tex source file to the final PDF file.

Because I often use non-ASCII characters in my files and my Markdown files usequotation, table and other complex format, I have met a few problems during theconversion process. In the following text, I will introduce how to solve theseissues.

How to Handle Languages other than English

By default, Pandoc uses pdflatex command to generate PDF files, which can nothandle Unicode characters well. You will encounter errors when you try toconvert Markdown files containing Unicode characters to PDF files.

In order to handle Unicode characters, we need to use xelatex commandinstead. For the CJK languages, you need to use CJKmainfont option to givethe proper font which supports the language you are using2. In this post, I willuse the Chinese language as an example.

On Windows systems, for Pandoc version above 2.0, you can use the followingcommand to generate the PDF file:

In the above command, KaiTi is the name of a font which supports the Chinesecharacters. How do we find a font supporting a particular language? First,you need to know the language codefor the language you are using. For example, the language code for Chinese iszh. Then, use the fc-list command to look up the fonts which support thislanguage3:

The output of command is like the following:

The font name is the string after the font location. Since the font names maycontain spaces, you need to quote the font name when you want to use aparticular font, e.g., -V CJKmainfont='Source Han Serif CN'.

In Pandoc version 2.0, --pdf-engine option replaces the old --latex-engineoption. OnLinux systems where the Pandoc version may be old, the above command will notwork. You need to use the following command instead4:

On Linux systems, the way to find the font supporting your language is the sameas Windows system.

Block quote, table and list are not correctly rendered

The reason is that Pandoc requires that you leave an empty line before blockquote, list and table environment. If the lines in the block quote are notcorrectly broken, i.e., all the lines are merged as one paragraph, you can adda space after each line to solve this issue.

Add highlight to block code

Pandoc supports block code syntax highlighting for many languages and offersseveral highlight themes. To list the highlight themes that Pandoc provides,use the following command:

To list all the languages that Pandoc supports, use the following command:

To use syntax highlighting for different languages, you need to specify thelanguage in the block quote and use --highlight-style, e.g.,:

In the above command, we use the zenburn theme, I also recommend using thetango or breezedark theme.

Use numbered section and add TOC

By default, there is no table of contents (TOC) in the generated PDF and nonumbers in the headers5. To add TOC, use the --toc option; to add sectionnumbers, use the -N option. A complete example is as follows:

Add colors to links

According to the Pandoc user guide, we can add colors to different links viathe colorlinks option to separate the links from the normal texts:

colorlinksadd color to link text; automatically enabled if any of linkcolor,filecolor, citecolor, urlcolor, or toccolor are set

To customize the color of different types of links, Pandoc offers differentoptions:

linkcolor, filecolor, citecolor, urlcolor, toccolorcolor for internal links, external links, citation links, linked URLs,and links in table of contents, respectively: uses options allowed by xcolor,including the dvipsnames, svgnames, and x11names lists

For example, to set the URL color to NavyBlue and set the TOC color to Red,we can use the following command:

Note that the urlcolor option will not color the raw URL links in the text.To color those raw links, you can enclose those links with <>, e.g.,<www.google.com>.

Change the PDF margin

Pandoc

The default margin for the generated PDF is too large. According to the PandocFAQ,you can use the following option to change the margin:

The complete command is:

Error when using backslash inside Markdown

In ordinary Markdown format, it is fine to use backslash characters inside thefiles. But Pandoc interpret the backslash and string after it as LaTeX commandby default. As a result, you may encounter weirederrors when trying to compileMarkdown files containing backslash characters. Based on discussionshere andhere, the solution is to makePandoc treat the Markdown file as normal Markdown files and not interpret theLaTeX command. You need to use the following flag:

Or you can use two backslash to represent a literal backslash, e.g., sometxt . If you want to express a LaTeX command, enclose the command withinline code block, like this: textt{}.

Add background color to inline code

In translating Markdown source file to TeX files, Pandoc use thetextttcommand to represent the inline code. So inline code has no background color inthe generated PDF files. To increase the readability of inline code, we canmodify the texttt command to add background color to text.

First, we need to create a file named head.tex and add the following settingsto it:

When converting Markdown files, use the -H option to refer the head.texfile, e.g.,:

In the generated PDF, the inline code will have grey background color. You canchange the background color as you wish.

Change the default style of block quote

By default, when converting Markdown to PDF, Pandoc use thequote environment forMarkdown block quotes. The texts inside quotation are only indented, making ithard to recognize the environment.

We can create a custom environment to add background color and indentation tothe quotation environment. Add the following setting to head.tex:

When you want to convert Markdown file to PDF, you can use the followingcommand:

The produced PDF is like the following:

References

  • Change background color of quotation.
  • Redefining existing environment in LaTeX.

Put the settings to head.tex

You may have noticed the clumsiness if you try to customize a lot of settings.When converting Markdown to PDF, we often need to use several settings. If youspecify all these options on the command line, it would be time consuming andcumbersome to edit. A good way is ease the issue is to put some commandsettings to head.tex file and refer to this file during Markdown file conversion.

For example, we can put the settings related to margin, inline codehighlighting, and link color to head.tex:

Pandoc
Click to see the code.

Nested list level exceed the limit

One reader Karl Liu mentioned thatif the nested list level exceeds 6, you will encounter the following error whentrying to generate PDF file:

! LaTeX Error: Too deeply nested.

More detailed discussions can be foundhere. The solution proposed is toadd the following settings in head.tex:

Click to see the code.

Add the -H head.tex option when compiling PDF files.

Add anchors in Markdown

I try to use anchors in Markdown following the discussionhere.Unfortunately, in the generated PDF, the anchor does not work: when I click thelinking text, there is no jump to the destination page.

Instead, we should use the attribute to give an id to the location we want tojump to and then refer to it in other places using the id. Here is an example:

How to resize image

We can also resize images using the attribute. You can specify width or heightin absolute pixel values or as percentage relative to the page or column width.For example:

How to start a new page for each section

By default, when you generate PDF from Markdown files, each section started bythe level 1 header do not start from the new page: it will continue from wherethe last section ends. If you want to start a new page when a new sectionstarts, you need to add the following settings to head.tex according tothis:

But when I tried to produce PDF with the updated head.tex files, I got anerror:

According to discussionshere,it is because Pandoc’s default LaTeX redefines the pragraph command and wehave to disable this behaviour. We need to use -V subparagraph when invokingthe pandoc command:

Start a new page only after TOC

What if we only want to add a new page after the table of contents page? Aneasy way is to hack the tableofcontents command. Add the following commandto head.tex to redefine tableofcontents command:

In the above command, we first save the old command and then redefine it toavoid recursive calls.

Line breaks

In Markdown, you can create a hard linebreak by appending two spaces after a line:

Using space at the line end for formating is annoying since it cause thetrailing whitespace warning. The space characters are also not visible.

Pandoc also provides anescaped_line_breaksextension. You can use in the end of a line followed by newline characterto represent a hard line break:

Images references

Pandoc supports LaTeX command inside Markdown, to refer to an image, you can use the LaTeX syntax:

It is cumbersome to switch to the terminal and use Pandoc to generate the PDFfiles and preview it after finishing writing the Markdown files. To simply theprocess, I use the Sublime Text build system for building PDF file andpreviewing. I use the light-weight Sumatra PDFreader for PDFpreviewing.

Pandoc github markdown

An example build system is shown below:

Click to see the code.

You can download the build system and head.tex filehere.

Pandoc is not recognized on Windows systems

For some reasons unknown to me, when using the above build systems to compileMarkdown files, I encountered the following errors:

‘pandoc’ is not recognized as an internal or external command, operableprogram or batch file.

After looking up the Sublime Text documentation, I find that we can add pathin the build system. So I adjust the above build system:

Click to see the code.

After that, everything goes well.

In this post, I give a complete summary on how to generate beautiful PDF filesfrom Markdown. I also share several solutions to the issues I have encountered.I hope that you can now generate beautiful PDF from Markdown files.

  • Anchors in Pandoc
  • Pandoc hard line break
  1. Make sure that you can use latex command on the command line. ↩︎

  2. For other languages, you need to use --mainfont option. ↩︎

  3. For Windows system, you can use fc-list command after installing the TeX Live full edition. For Linux systems, this command is usually pre-installed. ↩︎

  4. Tested on Pandoc version 1.12.3.1. ↩︎

  5. Only the font size varies for different header levels. ↩︎

Markdown has become the de-facto standard for writing software documentation. This post documents my experience using Pandoc to convert Word documents (docx) to markdown.

To follow along, install Pandoc, if you haven’t done so already. Word documents need to be in the docx format. Legacy binary doc files are not supported.

Pandoc supports several flavors of markdown such as the popular GitHub flavored Markdown (GFM). To produce a standalone GFM document from docx, run

The --extract-media option tells Pandoc to extract media to a ./media folder.

Pandoc Github Actions

Creating a PDF

To create a PDF, run

Pandoc requires (LaTeX) to produce the PDF. Remove --toc option if you don’t want Pandoc to create a table of contents (TOC). Remove -N option if you don’t want it to number sections automatically.

Markdown Editor

You’ll need a text editor to edit a markdown file. I use vscode. It has built-in support for editing and previewing markdown files. I use a few additional plugins to make editing markdown files more productive

HTML in Markdown

GFM allows HTML blocks in markdown. These get rendered when previewed in vscode, GitHub, or GitLab. Pandoc suppresses raw HTML output to PDF format and hence HTML blocks get rendered as plain text. For example, <sup>1</sup> gets rendered as (1) instead of (^1). You can use ^text^ in Pandoc’s markdown syntax to render superscript.

You can use HTML character entities to write out characters and symbols not available on the keyboard.

Tables

Pandoc converts docx tables whose cells contain a single line of text each, to the pipe table syntax. Column text alignment is not rendered—you can add that back using colons. Relative column widths can be specified using dashes. Pipe table cells with long text or images, may stretch beyond the page.

Tables in docx that have complex data in cells such as lists and multiple lines, are converted to HTML table syntax. That is highly unfortunate because Pandoc renders HTML tables to PDF as plain text.

It is not unusual for docx tables, with complex layouts such as merged cells, to be missing columns or rows. I suggest simplifying such tables, in the original docx, before conversion.

Review all tables very carefully!

I’ve obtained nice results with Pandoc’s grid table syntax, but these tables cannot be previewed in vscode, GitHub, or GitLab.

Table of Contents

Pandora converts TOC in docx as a sequence of lines, where each line corresponds to a topic or section. Section headings are generated without numbering. I suggest deleting the TOC, and using the command line options discussed earlier to number sections and to render TOC.

If you have cross-references in docx that use section numbers, you can generate a hyperlinked TOC using the Markdown TOC plugin of vscode. The plugin can also add, update, or remove section numbers.

Pandoc Github Markdown

Github

I suggest avoiding section numbers for cross-referencing and using hyperlinked section references instead.

Images

Images are exported to their native format and size. They are rendered in GFM using the ![[caption]](path) syntax. Image sizes cannot be customized in GFM syntax, but Pandoc’s markdown syntax allows setting image attributes such as width using the ![[caption]](path){key1=value1 key2=value2} syntax.

Figures

Pandoc github css

Pandoc does not convert vector diagrams created using Word’s figures and shapes. You’ll need to screen grab, or copy and paste, the image rendered by Word.

You can use mermaid.js syntax to recreate diagrams such as flowcharts and message sequence charts. mermaid.js syntax can be embedded in markdown, and converted using mermaid-filter

GitHub doesn’t yet allow you to preview mermaid.js diagrams, but GitLab does. vscode is able to preview them using the Markdown Preview Mermaid Support plugin.

Captions

Pandoc converts captions in the docx as plain text positioned after an image or table. I suggest using Pandoc’s native markdown syntax for captions.

Cross-references

GFM does not natively support linking to figures and tables, and HTML anchors are not a viable option with Pandoc. Link to the section containing a figure or table when referencing it from other parts of the document.

Figure and table numbers in docx may sometimes go missing from cross-references.

I suggest reviewing captions and cross-references very carefully!

Large Documents

Pandoc can handle large documents that have hundreds of pages. You may want to maintain large documents in separate markdown files. This makes concurrent editing productive and allows for reuse. It also allows for faster previews on GitHub or GitLab. In fact, previewing may entirely fail to work for complex documents. You may want to pre-render such documents to HTML using Pandoc.

Pandoc - Pandoc User’s Guide

Pandoc is capable of converting multiple markdown files

Regular Expressions

Using regular expressions significantly speeds up your ability to search and replace text. Some examples follow

  • Empty heading

    ^#+s*$

  • Line with trailing spaces

    s+$

  • Repeated whitespace between words

    bss+b

  • Whitespace before , or .

    s+[,;.]

  • Paragraph starts with small case

    nn[a-z]

  • Word figure not followed by a number

    figures+(?!([d]){1,})

  • Word section not followed by a number

    sections+(?!(d+.*d*?){1,})