Add FAQ about special characters (#829)

* Add FAQ for extracting special characters * Update CHANGELOG.md * Update faq.rst
pdfminer · Nov 5, 2022 · ebf7bcd · ebf7bcd
1 parent 3688911
commit ebf7bcd
Show file tree

Hide file tree

Showing 2 changed files with 28 additions and 0 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
 - Output converter for the hOCR format ([#651](https://github.com/pdfminer/pdfminer.six/pull/651))
 - Font name aliases for Arial, Courier New and Times New Roman ([#790](https://github.com/pdfminer/pdfminer.six/pull/790))
+- Documentation on why special characters can sometimes not be extracted ([#829](https://github.com/pdfminer/pdfminer.six/pull/829))
 
 ### Fixed
 

diff --git a/docs/source/faq.rst b/docs/source/faq.rst
@@ -39,3 +39,30 @@ improves pdfminer.
 Since 2020, the original pdfminer is `dormant
 <https://github.com/euske/pdfminer#pdfminer>`_, and pdfminer.six is the fork
 which Euske recommends if you need an actively maintained version of pdfminer.
+
+Why are there `(cid:x)` values in the textual output?
+=====================================================
+
+One of the most common issues with pdfminer.six is that the textual output
+contains raw character id's `(cid:x)`. This is often experienced as confusing
+because the text is shown fine in a PDF viewer and other text from the same
+PDF is extracted properly.
+
+The underlying problem is that a PDF has two different representations
+of each character. Each character is mapped to a glyph that determines
+how the character is shown in a PDF viewer. And each character is also
+mapped to its unicode value that is used when copy-pasting the character.
+Some PDF's have incomplete unicode mappings and therefore it is impossible
+to convert the character to unicode. In these cases pdfminer.six defaults
+to showing the raw character id `(cid:x)`
+
+A quick test to see if pdfminer.six should be able to do better is to
+copy-paste the text from a PDF viewer to a text editor. If the result
+is proper text, pdfminer.six should also be able to extract proper text.
+If the result is gibberish, pdfminer.six will also not be able to convert
+the characters to unicode.
+
+References: 
+
+#. `Chapter 5: Text, PDF Reference 1.7 <https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference>`_
+#. `Text: PDF, Wikipedia <https://en.wikipedia.org/wiki/PDF#Text>`_