Sometimes, for example when sending a review of a paper, I do not want the pdf file to contain any metadata. Ideally, the editorial process should take care of this, but I do not want to take any chances. This is how I strip all metadata from my pdf files.
First, lets see what metadata is generated by a simple ConTeXt file. Opening the file in Adobe Reader and going to File -> Properties gives me
So, I am giving away that the file is produced by ConTeXt. There is more metadata that Adobe Reader does not show by default. To see that, I use pdftk.
$ pdftk test.pdf dump_data
InfoKey: Producer
InfoValue: LuaTeX-0.61.0
InfoKey: Creator
InfoValue: ConTeXt - 2010.08.17 13:48
InfoKey: ModDate
InfoValue: D:20100818201404+20'00'
InfoKey: ConTeXt.Time
InfoValue: 2010.08.18 20:14
InfoKey: ConTeXt.Jobname
InfoValue: test
InfoKey: PTEX.Fullbanner
InfoValue: This is LuaTeX, Version beta-0.61.0-2010072816 (Web2C 2010/dev) kpathsea version 6.0.0dev
InfoKey: ConTeXt.Url
InfoValue: www.pragma-ade.com
InfoKey: ConTeXt.Version
InfoValue: 2010.08.17 13:48
InfoKey: ID
InfoValue: test.2010-08-18T20:14:04+20:00
InfoKey: Title
InfoValue: test
InfoKey: CreationDate
InfoValue: D:20100818201404+20'00'
PdfID0: 8d83b9ce1114e6d36afbc553ece4b72
PdfID1: 8d83b9ce1114e6d36afbc553ece4b72
NumberOfPages: 1
PageLabelNewIndex: 1
PageLabelStart: 1
PageLabelNumStyle: DecimalArabicNumerals
The file literarily contains a “Made by ConTeXt” badge. Given the number of ConTeXt users, this might be more than enough to identify me in my research community. I do not want this information in the pdf file.
Fortunately, stripping this information is easy. I use the following function in my .zshrc file
# Strip metadata in pdf
strip-metadata() {
pdftk $1 dump_data | \
sed -e 's/\(InfoValue:\)\s.*/\1\ /g' | \
pdftk $1 update_info - output clean-$1 }
This function first dumps the file metadata, then blanks the value of all InfoFields, and writes this back as the new metadata. I can then use this as
$ strip-metadata test.pdf
which produces clean-test.pdf file.
$pdftk clean-test.pdf dump_data
PdfID0: 8d83b9ce1114e6d36afbc553ece4b72
PdfID1: 8d83b9ce1114e6d36afbc553ece4b72
NumberOfPages: 1
PageLabelNewIndex: 1
PageLabelStart: 1
PageLabelNumStyle: DecimalArabicNumerals
Ah! Hardly any hints to give away. Now, only if obfuscating the font names were so easy

Hello,
first would I like to thank you for the affords,
but it seems that it does not work 100% for me like you described.
For example do I still have the InfoValue for Time and Date in my PDF.
See the following …
Before:
InfoKey: Creator
InfoValue: Writer
InfoKey: Producer
InfoValue: OpenOffice.org 3.2
InfoKey: CreationDate
InfoValue: D:20111114100839+01’00′
PdfID0: a47c319656211821c4ad5bae32c480
PdfID1: a47c319656211821c4ad5bae32c480
After:
InfoKey: Producer
InfoValue: iText 2.1.7 by 1T3XT
InfoKey: ModDate
InfoValue: D:20111114105020+01’00′
PdfID0: 1edbaa630f281906d5871a0c5514cd
PdfID1: 7225af44dda1efbfdac64645b7a35932
NumberOfPages: 1
Have you got any idea on this? I’m using Ubuntu 10.04 with pdftk Version: 1.41+dfsg-7. thx
I was able to use sed to expunge the remaining infovalue strings that I didn’t want in the output pdf. I used sed, as seen below, which is far from ideal, but the pdf renders. I think the PDFID lines are some sort of hash, because they’re not stored in the PDF, so I manually redacted them in this paste.
$ pdftk Output.pdf dump_data
InfoKey: Producer
InfoValue: iText 2.1.7 by 1T3XT
InfoKey: ModDate
InfoValue: D:20111115170615Z
PdfID0: 0123456789012345678901234567890
PdfID1: 0123456789012345678901234567
NumberOfPages: 75
$ sed -i ‘s/iText\ 2\.1\.7\ by\ 1T3XT//;s/D:20111115170615Z//’ Output.pdf
$ pdftk Letters.pdf dump_data
PdfID0: 0123456789012345678901234567890
PdfID1: 0123456789012345678901234567
NumberOfPages: 75
It would be nice if pdftk ket you actually clear these fields, but I assume pdftk leans on the iText library, whose developers do not permit you to violate spec. I don’t care.
I don’t understand the pdf spec to know if the resultant pdf is still valid. A slightly better option will be to first uncompress the pdf (pdftk –uncompress) remove the offending fields and then compress the pdf again.
Upon compressing, it adds the producer and moddate back, which voids the point.
Here’s my finished script that redacts as much as it can with pdftk, then gets out the spiked club.
#!/bin/bash
echo “Your original headers (Which would have been leaked):”
pdftk “${1}” dump_data
#Reduce likelyhood of timezone leaking
export TZ=”GMT”
#Filter out InfoValues with pdftk first
pdftk “${1}” dump_data | \
sed -e ‘s/\(InfoValue:\)\s.*/\1\ /g;s/\(PdfID.:\)\s.*/\1\ /g’ | \
pdftk “${1}” update_info – output “${1}.tmp”
#Enumerate the ones PdfTK failed to remove.
StringsToRedact=”$(
pdftk “${1}.tmp” dump_data | \
grep -e ‘^InfoValue:\ ‘ -e ‘^PdfID.:\ ‘ | sed ‘s/^.*:\ //’
)”
#Remove them, the hard way…
while read StringToRedact
do
sed -i input.pdf.tmp -f – <<<"s/${StringToRedact}//"
done <<<"${StringsToRedact}"
#Eviscerate the original file
shred "${1}"
mv "${1}.tmp" "${1}"
echo Examining resultant file…
pdftk "${1}" dump_data
evince "${1}"