Tags

, ,

Sometimes, for example when sending a review of a paper, I do not want the pdf file to contain any metadata. Ideally, the editorial process should take care of this, but I do not want to take any chances. This is how I strip all metadata from my pdf files.

First, lets see what metadata is generated by a simple ConTeXt file. Opening the file in Adobe Reader and going to File -> Properties gives me

So, I am giving away that the file is produced by ConTeXt. There is more metadata that Adobe Reader does not show by default. To see that, I use pdftk.

$ pdftk test.pdf dump_data

InfoKey: Producer
InfoValue: LuaTeX-0.61.0
InfoKey: Creator
InfoValue: ConTeXt - 2010.08.17 13:48
InfoKey: ModDate
InfoValue: D:20100818201404+20'00'
InfoKey: ConTeXt.Time
InfoValue: 2010.08.18 20:14
InfoKey: ConTeXt.Jobname
InfoValue: test
InfoKey: PTEX.Fullbanner
InfoValue: This is LuaTeX, Version beta-0.61.0-2010072816 (Web2C 2010/dev) kpathsea version 6.0.0dev
InfoKey: ConTeXt.Url
InfoValue: www.pragma-ade.com
InfoKey: ConTeXt.Version
InfoValue: 2010.08.17 13:48
InfoKey: ID
InfoValue: test.2010-08-18T20:14:04+20:00
InfoKey: Title
InfoValue: test
InfoKey: CreationDate
InfoValue: D:20100818201404+20'00'
PdfID0: 8d83b9ce1114e6d36afbc553ece4b72
PdfID1: 8d83b9ce1114e6d36afbc553ece4b72
NumberOfPages: 1
PageLabelNewIndex: 1
PageLabelStart: 1
PageLabelNumStyle: DecimalArabicNumerals

The file literarily contains a “Made by ConTeXt” badge. Given the number of ConTeXt users, this might be more than enough to identify me in my research community. I do not want this information in the pdf file.

Fortunately, stripping this information is easy. I use the following function in my .zshrc file

# Strip metadata in pdf
strip-metadata() {
   pdftk $1  dump_data | \
   sed -e 's/\(InfoValue:\)\s.*/\1\ /g' | \
   pdftk $1 update_info - output clean-$1 }

This function first dumps the file metadata, then blanks the value of all InfoFields, and writes this back as the new metadata. I can then use this as

$ strip-metadata test.pdf

which produces clean-test.pdf file.

 $pdftk clean-test.pdf dump_data

PdfID0: 8d83b9ce1114e6d36afbc553ece4b72
PdfID1: 8d83b9ce1114e6d36afbc553ece4b72
NumberOfPages: 1
PageLabelNewIndex: 1
PageLabelStart: 1
PageLabelNumStyle: DecimalArabicNumerals

Ah! Hardly any hints to give away. Now, only if obfuscating the font names were so easy