Metadata in ODF Files

A Simple ODF Metadata Reader

Listing 1 shows a script, called odfmetareader.sh that follows the Unix philosophy of small tools that each do just one thing but can be connected in a pipeline. It just prints out, one per line, all the explicit and hidden metadata it finds in the single ODF file passed to it as an argument. Analysis of the output, or its insertion into some database or spreadsheet, is delegated to other tools. You can use this script inside a loop to work on as many files as you like, as shown later in the tutorial. Of course, you also can, and should, change the script to format its output to best suit your needs. Listing 1 shows how the code works.

Listing 1

odfmetareader.sh

01 #! /bin/bash
02
03 rm -rf /tmp/odfmetareader
04 mkdir  /tmp/odfmetareader
05 cp $1  /tmp/odfmetareader/odf.zip
06 cd     /tmp/odfmetareader
07
08 unzip odf.zip >& /dev/null
09
10 echo "## METADATA DOC START     for document $1;"
11 echo "## METADATA ODF START     for document $1;"
12
13 # extract explicit ODF metadata
14
15 cat meta.xml | perl -e 'while (<>) {s/document-statistic//g; s/<(meta|dc):([^>]+)>/\n$2=/g; s/user-defined /user-defined-/g; s/<\/(meta|dc).*//g; s/ meta:value-type=/ value-type/g; s/ meta:/\n/g; s/\/=//g; s/<\/office:[^>]+>//g; print} print "\n"' | grep -v '<office:document' | grep -v '^<?xml version' | grep -v '^generator=' | grep '='
16
17 echo "## METADATA ODF END       for document $1;"
18 echo
19
20 # extract metadata about macros
21 if [ -d "Basic" ]
22 then
23   echo "## METADATA MACRO START   for document $1;"
24
25   MACRONUM=`find Basic -type f -name "*xml" | grep -v /script- | wc -l`
26
27   echo "macronumber=$MACRONUM"
28   for M in `find Basic -type f -name "*xml" | grep -v /script-`
29   do
30   echo macrofile:$M
31   grep 'sub ' $M
32   done
33   echo "## METADATA MACRO END for document $1;"
34   echo
35 fi
36
37 # extract metadata from images
38
39 if [ -d "Pictures" ]
40 then
41   for P in `find Pictures -type f`
42   do
43   N=`basename $P`
44   echo "## METADATA PICTURE START for document $1 / Picture $N;"
45   echo picturename: $N
46   exiftool $P | egrep '^(Artist|GPS)'
47   echo "## METADATA PICTURE END   for document $1 / Picture $N;"
48   done
49 fi
50 # final cleanup
51
52 echo
53 echo "## METADATA DOC END       for document $1;"
54 echo
55 #rm -rf /tmp/odfmetareader
56
57 exit

The overall flow is very simple: The script makes a copy of the given file and unzips it in the temporary folder /tmp/odfmetareader (lines 3-8). The final command on line 55 removes that folder, but I recommend leaving it commented until you have figured out (by looking into that same folder) the internal structure of ODF files.

The central part of Listing 1 prints out the variables in the meta.xml files and two lists: one of macros and one of pictures, with all their own embedded metadata.

The echo commands containing the ## METADATA string (e.g., lines 10 and 11) have the same purpose: They separate the several output sections (one hopes) making them more readable and easier to parse by other scripts.

Line 15 extracts all the metadata from the meta.xml file. It does seem like ancient Martian, but it is less obscure than it may seem at first sight. It is a concatenation of one long command in Perl and four invocations of the grep utility.

The Perl part is, basically, a series of regular expressions separated by semicolons that remove all the XML markup you don't need to see in the output. For example, this part

s/<\/(meta|dc).*//g;

replaces, with an empty string, every string that begins with </meta or </dc, plus all the characters that follow it until the end of the current line (that is what the .* part means). The four grep commands just remove header and footer lines in the XML file that don't contain any metadata. The best way to understand what line 15 actually does, and how to customize it for your needs, is to run the script on any ODF file and compare its output with the original content of the meta.xml file.

Native macros in ODF files are stored, if present, inside the Basic folder of the ZIP archive, and line 21 checks if this folder exists. If it does, the script finds all the macro files inside the folder and prints the value in the variable MACRONUM (lines 25-27). The loop in lines 28 to 25 finds and prints all the lines in the macro files that contain macro names.

The last loop of the script, in lines 39 to 49, checks if a Pictures folder exists. If the answer is yes, it scans all the pictures inside it (line 41), to print their names (lines 43-45) and then runs the exiftool command on them (line 46). exiftool is free software capable or reading and writing all the metadata stored inside today's digital photographs that use Exif and other similar standards.

When given a file name, as in line 46, exiftool just prints all the metadata in that file, one per line. The egrep command in line 46 discards all lines, except those that begin with either Artist or GPS, probably the most sensitive data.

Listing 2 shows a small excerpt, heavily edited for clarity, of the odfmetareader.sh output from the sample ODF document shown in Figure 5, which contains one macro and one photograph.

Listing 2

odfmetareader Results

01 ## METADATA ODF START     for document odf-sample-text.odt;
02 initial-creator=Marco Fioretti
03 creation-date=2018-07-22T17
04 date=2018-07-22T18:07
05 creator=Marco Fioretti
06 editing-duration=PT33M32S
07 editing-cycles=9
08 description=Let's see where all these metadata end up...
09 keyword=ODF
10 keyword=Metadata
11 keyword=text processing
12 keyword=text mining
13 subject=showing the way in which ODF format stores metadata
14 title=Just A Sample ODF Text Document
15 image-count="1"
16 word-count="81"
17 character-count="468"
18 user-defined-meta:name="Approved" value-type"boolean"=false
19 user-defined-meta:name="Status"=Confidential
20
21 ## METADATA MACRO START   for document odf-sample-text.odt;
22 macronumber=1
23 macrofile:Basic/Standard/samplemodule.xml
24 sub Main
25 ## METADATA MACRO END     for document odf-sample-text.odt;
26
27 ## METADATA PICTURE START for document odf-sample-text.odt / Picture sample-picture.jpg;
28 picturename: sample-picture.jpg
29 Artist                          : Marco Fioretti
30 GPS Latitude                    : 47 deg 30' 20.53" N
31 GPS Longitude                   : 19 deg 2' 43.75" E
Figure 5: Basic macros in ODF documents can be organized in groups, which correspond to subfolders in the Basic folder of an ODF file. The macro in this figure will be saved in the file Basic/Standard/sample.xml.

Publishing online ODF files (or office files in general, probably) without "cleaning" them first may mean letting everybody know where, and by whom, each photograph contained in the file was taken (as shown, starting in line 27). Sometimes this is OK; sometimes it is not.

The macro section (lines 21-25), as commented, lists number, location, and names of all the macros inside the document. The initial section (lines 1 to 19), is just a plain text version of the metadata shown in Figures 1 to 4. It is easy to imagine how many of the lines above, from editing cycles and duration to word count and keywords, may be filtered or fed to some other script to answer any kind of question.

As an example, the following lines show how you may discover which ODF files in a whole directory tree have Linux Magazine as the creator:

for F in `find . -type f | egrep '(odt|ods|odp)$`
  do
  FOUND=`odfmetareader $F | grep -i ^creator | grep -i -c 'Linux Magazine'`
  if [ $FOUND gt 0 ]
    then # = "there was at least one line with that string"
    echo found $F
  fi
done

Writing ODF Metadata

Extracting metadata from ODF files is great. Being able to erase or modify it is even better. You can learn how to do so by playing with the odfmetawriter script in Listing 3, which was written to order for didactical purposes. To begin, it only performs one operation per run for simplicity, always in the same way: Extract the file(s) that must be changed, process them, and then put them back in a copy of the zipped ODF file. Then, to give you an idea of how you might alter both explicit and "hidden" ODF metadata, the script can do the following:

Listing 3

odfmetawriter.sh

01 #! /bin/bash
02
03 if [ ! -e "$1" ]
04 then
05   echo "script launched on non-existing file: $1; aborting"
06   exit
07 fi
08
09 STARTINGDIR=`pwd`
10
11 rm -rf /tmp/odfmetawriter
12 mkdir /tmp/odfmetawriter
13 cp $1 /tmp/odfmetawriter/odf.zip
14 cp $1 /tmp/odfmetawriter/new-$1
15 cd    /tmp/odfmetawriter
16
17 unzip odf.zip >& /dev/null
18 cp meta.xml meta.orig.xml
19
20 case "$2" in
21   creator|title|description)
22   echo "Changing $2 to: $3"
23   sed -i -- "s/<dc:$2>.*<\/dc:$2>/<dc:$2>$3<\/dc:$2>/" meta.xml
24   zip -f new-$1 meta.xml
25   ;;
26
27   addkeyword)
28   sed -i -- "s/<meta:keyword>/<meta:keyword>$3<\/meta:keyword><meta:keyword>/" meta.xml
29   zip -f new-$1 meta.xml
30   ;;
31
32   addcustom)
33   sed -i -- "s/<meta:user-defined/<meta:user-defined meta:name=\"$3\">$4<\/meta:user-defined><meta:user-defined/" meta.xml
34   zip -f new-$1 meta.xml
35   ;;
36
37   renamefromtitle)
38   EXT="${1##*.}"
39   TITLE=`perl -e  'while (<>) {next unless m/.*<dc:title>(.*)<\/dc:title>/; $T = $1;} $T =~ s/\W+/-/g; print $T' meta.xml`
40   mv -i new-$1 $STARTINGDIR/$TITLE.$EXT
41   exit
42   ;;
43
44   watermark)
45     if [ -d "Pictures" ]
46   then
47     for P in `find Pictures -type f`
48     do
49     convert $P  -font Arial -pointsize 60 -draw "gravity center   fill yellow  text 1,11 '$3' " temp-watermarked
50     mv temp-watermarked $P
51     zip -f new-$1 $P
52     done
53   else
54     echo "No Pictures in this ODF Document!"
55     exit
56   fi
57   ;;
58
59   removepicsdata)
60     if [ -d "Pictures" ]
61   then
62     for P in `find Pictures -type f`
63     do
64     exiftool -all= $P
65     zip -f new-$1 $P
66     done
67   else
68     echo "No Pictures in this ODF Document!"
69     exit
70   fi
71   ;;
72
73   *)
74   echo "unknown or unsupported option, please retry: $2;"
75   rm -rf /tmp/odfmetawriter
76   exit
77   ;;
78 esac
79
80 mv -i new-$1 $STARTINGDIR/
81
82 #rm -rf /tmp/odfmetawriter
83
84 exit
  • Rewrite title, creator, or description
  • Add an extra keyword
  • Add a custom field
  • Rename the file to match the document title
  • Insert a textual watermark in all pictures
  • Remove Exif data from pictures

The script must be launched always in the same way:

#> odfmetawriter <ODF-file-name> <operation> <options>

The beginning and end are almost the same as odfmetareader: Create a temporary folder, work inside it, and remove it when done. Pay attention to line 14, though, which makes a copy of the file passed as an argument with the new- prefix: It is this file that will be "filled" with the new metadata and eventually (line 80) copied in the same directory where the script was launched.

The core of the script is the case statement (lines 20-78). It has seven branches: one for each of the operations listed above and a final one (lines 74-77) that exits with an error message in all other cases.

Lines 21 to 30 all do the same thing – that is, update or add a variable in the meta.xml file.

If the variable passed as a second argument ($2) is creator, title, or description, the first branch (lines 21-25) of the case statement finds the corresponding variable and, using the sed command, replaces its value with the string passed as the third argument.

The two other branches add keywords or custom fields (with a value equal to $3) when $2 is equal to addkeyword or, respectively, addcustom. They work almost in the same way as the first one, with the only difference being that they prepend the XML markup defining the new variable to the other variables of the same kind.

In all cases, after the meta.xml file has been "updated," it is put back in the copy of the ODF file (lines 24 and 29).

The fourth supported operation does not change anything in the file. When the $2 parameter is equal to renamefromtitle, the script:

  • Takes note of the original file extension (EXT, line 38)
  • Uses Perl to extract the title string from meta.xml, replace all of its non-alphanumeric characters with single dashes (line 39), and save the result in the TITLE variable
  • Makes a copy of the original file, with the name TITLE.EXT, in the original directory

The last two operations supported by odfmetawriter are insertion of the textual watermark passed as the third parameter inside all the pictures (lines 44-57) and removal of all Exif metadata from the same pictures (lines 59-71).

The watermark is inserted with the ImageMagick's convert tool. The code in line 49 is copied almost verbatim from the relevant ImageMagick documentation [1]. Line 64, instead, tells exiftool to give all Exif variables in the current picture an empty value [2]. As before, the modified pictures ($P) are zipped back in the right place, in the copy of the original document (lines 51 and 65).Running the following commands, in sequence, on the sample ODF document shown in Figure 6

#> odfmetawriter odf-sample.odt title 'New title for Linux Magazine'
#> odfmetawriter odf-sample.odt description 'Here is an ODT file with its metadatachanged by a script'
#> odfmetawriter odf-sample.odt addkeyword 'ODF metadata processing'
#> odfmetawriter odf-sample.odt renamefromtitle
#> odfmetawriter New-title-for-Linux-Magazine.odt watermark'Watermarked for Linux Magazine'
Figure 6: A sample ODF text file, with metadata and pictures inserted manually.

produces the results shown in Figure 7. (For simplicity, the renaming commands after each operation have been omitted.) As you can see for yourself, the metadata has the new values, and the picture is properly watermarked. Isn't ODF great to hack?

Figure 7: The same ODF text file, after the odfmetawriter script has automatically updated some metadata and watermarked the picture.

Code Limits

I already said this, but let me repeat it: The two scripts above do work, but they are not perfect or robust. As a minimum, they would need extra checks to refuse input files not in ODF format, or to handle properly non-alphabetic languages or strings with quotes inside them. In odfmetawriter, for example, addcustom will fail if there isn't already at least one custom field present. Also, odfmetawriter does not change the initial-creator of an ODF file. Another issue is dates: It is trivial to alter dates in the meta.xml file, but unless you do it right, you will end up with inconsistent documents (e.g., having ODF files with last-modified timestamps that are earlier than some of the revisions they contain). Finally, neither script is optimized for performance.

Still, look at the result in Figure 7: A quick and dirty mix of a few standard Linux commands and utilities is all you need to analyze or produce automatically any number of perfectly valid documents with just the metadata you want (or don't want). Is this cool, or what?

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News