You copy text from a PDF and paste it somewhere else — and the result looks wrong. Characters are out of order, ligatures like "fi" become "fi" or disappear, words run together without spaces, or special characters turn into question marks. This is a PDF text encoding problem, and it has specific causes that explain why it happens and what can be done about it.

How PDF Stores Text — and Why It Goes Wrong
PDF was designed primarily as a visual format — it describes exactly how a page looks, not what the text means. The internal text encoding in a PDF can be quite different from standard Unicode. Some PDFs use custom glyph mappings where the character codes stored internally don't correspond to standard letter codes — so when you copy, the clipboard receives the internal codes rather than the characters you see.
A well-constructed PDF includes a ToUnicode mapping table that tells the viewer how to translate internal codes to standard Unicode characters. When this table is missing, incomplete, or incorrect, copy-paste produces garbled results even though the text displays perfectly on screen. The display and the copyable text come from different systems — display uses the visual glyph, copy-paste uses the text data.
Try PDF OCR
No installation needed. Works directly in your browser.
Ligatures and Special Characters
Ligatures are typographic combinations — "fi", "fl", "ff", "ffi" — where two or three characters are joined into a single glyph for aesthetic reasons. In a poorly encoded PDF, the ligature glyph has no ToUnicode mapping for the individual characters it represents. When copied, the ligature either becomes a single special character (fi instead of fi), becomes nothing, or becomes a placeholder symbol.
This is why copying from some professionally typeset PDFs produces text with missing letters — words like "office" become "o ce" because the "ffi" ligature had no usable Unicode mapping. The word looked correct on screen; the underlying text data was broken.
Missing Spaces Between Words
Some PDFs represent spaces not as actual space characters in the text stream but as positional offsets — the viewer renders a gap between words by moving the cursor position, not by inserting a space character. When copying, the positional offset isn't translated to a space character, so words run together: "theword" instead of "the word".
This is common in PDFs exported from design applications like InDesign or Illustrator when text spacing is controlled at the design level rather than through standard text encoding.
Column and Reading Order Issues
In a multi-column PDF, the visual reading order (down column one, then down column two) may not match the internal text order (left to right across the full page width). Copying text from a two-column layout often produces text that alternates between columns line by line, making it appear scrambled even though each individual word is correct.
This isn't an encoding problem — it's a reading order problem. The text is correctly encoded; it's just stored in an order that doesn't match how a human would read it. The fix is to copy text from one column at a time rather than selecting across both columns.
What to Do When Copied Text Is Garbled
- Try a different PDF viewer: different viewers handle ToUnicode mapping differently. If Chrome's copy produces garbled text, try copying from Adobe Reader — it often produces cleaner results for the same PDF.
- Convert to Word first: a PDF to Word converter reprocesses the text encoding during conversion. The resulting Word document often produces clean copy-paste even when the original PDF didn't.
- Run OCR on a copy: OCR tools re-read the visible text from page images and create fresh, correctly encoded text. The OCR PDF result may produce better copy-paste than the original encoding, particularly for poorly encoded professional typesetting.
- Use Find & Replace for common errors: if the same ligature or character consistently pastes incorrectly, paste the pasted result into Word and use Find & Replace to fix the recurring error throughout.
Preventing the Problem at the Source
If you're creating PDFs and want to ensure clean copy-paste behavior for recipients, use applications that generate correct ToUnicode mappings. Microsoft Word exports with proper Unicode mapping by default. Adobe InDesign can export with or without proper text encoding depending on settings — in the Export PDF dialog, ensure "Use document structure for tab order" and text accessibility options are enabled. Test copy-paste from the exported PDF before distributing to catch encoding problems before they reach recipients.
Try PDF OCR
No installation needed. Works directly in your browser.
