13 thoughts on “Stripping metadata from pdf files

  1. Hello,
    first would I like to thank you for the affords,
    but it seems that it does not work 100% for me like you described.
    For example do I still have the InfoValue for Time and Date in my PDF.

    See the following …

    Before:
    InfoKey: Creator
    InfoValue: Writer
    InfoKey: Producer
    InfoValue: OpenOffice.org 3.2
    InfoKey: CreationDate
    InfoValue: D:20111114100839+01’00′
    PdfID0: a47c319656211821c4ad5bae32c480
    PdfID1: a47c319656211821c4ad5bae32c480

    After:
    InfoKey: Producer
    InfoValue: iText 2.1.7 by 1T3XT
    InfoKey: ModDate
    InfoValue: D:20111114105020+01’00′
    PdfID0: 1edbaa630f281906d5871a0c5514cd
    PdfID1: 7225af44dda1efbfdac64645b7a35932
    NumberOfPages: 1

    Have you got any idea on this? I’m using Ubuntu 10.04 with pdftk Version: 1.41+dfsg-7. thx

  2. I was able to use sed to expunge the remaining infovalue strings that I didn’t want in the output pdf. I used sed, as seen below, which is far from ideal, but the pdf renders. I think the PDFID lines are some sort of hash, because they’re not stored in the PDF, so I manually redacted them in this paste.

    $ pdftk Output.pdf dump_data
    InfoKey: Producer
    InfoValue: iText 2.1.7 by 1T3XT
    InfoKey: ModDate
    InfoValue: D:20111115170615Z
    PdfID0: 0123456789012345678901234567890
    PdfID1: 0123456789012345678901234567
    NumberOfPages: 75
    $ sed -i ‘s/iText\ 2\.1\.7\ by\ 1T3XT//;s/D:20111115170615Z//’ Output.pdf
    $ pdftk Letters.pdf dump_data
    PdfID0: 0123456789012345678901234567890
    PdfID1: 0123456789012345678901234567
    NumberOfPages: 75

    It would be nice if pdftk ket you actually clear these fields, but I assume pdftk leans on the iText library, whose developers do not permit you to violate spec. I don’t care.

    • I don’t understand the pdf spec to know if the resultant pdf is still valid. A slightly better option will be to first uncompress the pdf (pdftk –uncompress) remove the offending fields and then compress the pdf again.

  3. Upon compressing, it adds the producer and moddate back, which voids the point.

    Here’s my finished script that redacts as much as it can with pdftk, then gets out the spiked club.

    #!/bin/bash
    echo “Your original headers (Which would have been leaked):”
    pdftk “${1}” dump_data

    #Reduce likelyhood of timezone leaking
    export TZ=”GMT”

    #Filter out InfoValues with pdftk first
    pdftk “${1}” dump_data | \
    sed -e ‘s/\(InfoValue:\)\s.*/\1\ /g;s/\(PdfID.:\)\s.*/\1\ /g’ | \
    pdftk “${1}” update_info – output “${1}.tmp”

    #Enumerate the ones PdfTK failed to remove.
    StringsToRedact=”$(
    pdftk “${1}.tmp” dump_data | \
    grep -e ‘^InfoValue:\ ‘ -e ‘^PdfID.:\ ‘ | sed ‘s/^.*:\ //’
    )”

    #Remove them, the hard way…
    while read StringToRedact
    do
    sed -i input.pdf.tmp -f – <<<"s/${StringToRedact}//"
    done <<<"${StringsToRedact}"

    #Eviscerate the original file
    shred "${1}"

    mv "${1}.tmp" "${1}"

    echo Examining resultant file…
    pdftk "${1}" dump_data
    evince "${1}"

  4. Inspired from this blog post:

    #!/bin/bash

    # this script file has to be executed as a program;
    # which can be permitted through chmod +x metadata;

    pdftk “$1″/”$2″.pdf dump_data | \
    sed -e ‘s/\(InfoValue:\)\s.*/\1\ /g’ | \
    pdftk “$1″/”$2″.pdf update_info – output “$1″/1-”$2″.pdf

    # useable through running ./metadata “” “” from home;
    # sudo apt-get install pdftk is a requisite for the above to be functional;

    – However, I have just found out (through http://www.nsa.gov/ia/_files/app/pdf_risks.pdf) that removing baisc PDF v1.0 metadata (that is, the info dictionary) is not everything. For other types of metadata to be removed, another program has to be used (such as Acrobat Pro). Until a libre software on the Linux side comes out, then.

    Cheers!
    twipley

  5. Just as a follow-up: files creating through LibreOffice or some other software to that effect seem fine to be made ran through pdftk for metadata removal, as at the time, no v1.4 metadata stream is likely to be embedded. For files coming from other sources, though, an more-thorough examination is quite justified.

  6. This is a simpler method to remove all, absolutely all, the metadata (as well as the page numbers and bookmarks):

    qpdf –empty outfile.pdf –pages infile.pdf

    (manual of qpdf)

    Indeed, it is all gone:

    pdftk outfile.pdf dump_data
    Warning: no info dictionary found
    PdfID0: some hash
    PdfID1: the same hash
    NumberOfPages: the number of pages

  7. Indeed; this method seems faster and simpler (and, working — at least for files outputted using LibreOffice; as, as previously noted, Acrobat often detects more metadata than pdftk does).

    The command therefore would be (hyphens having to be doubled, and the source file having to be named to “infile.pdf”): qpdf –empty –pages infile.pdf 1-z — outfile.pdf

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s