donderdag 6 april 2017

Linux shell script to fix LibreOffice 5.1's docx "unknown error" word/document.xml issues

I investigated some issues that caused LibreOffice version 5.1.6.2 to error out when opening certain docx files created with Microsoft Office.

Here's the error:



File format error found at # SAXParseException: '[word/document.xml line 2]: unknown error', Stream 'word/document.xml', Line 2, Column 928831(row,col).

After a deep debugging session, it turns out this is caused by some values of the relativeHeight attributes in the word/document.xml file of the docx.

I made a script to workaround the relativeHeight issue by setting all relativeHeight attributes to zero which, according to the docx specification, means infinite.

After fixing this, I ran into another problem where LibreOffice would sometimes duplicate the w:themeColor attribute upon saving in docx format, thereby invalidating the XML. That is also checked and fixed by the code below.

I figured other people might find this useful, so here's my script:

#!/bin/sh
# Fix to workaround LibreOffice 5 docx issues
# Copyleft 2017 (c) Tom Van Braeckel <tomvanbraeckel@gmail.com>

# This fixes these errors I've been getting:

# File format error found at 
# SAXParseException: '[word/document.xml line 2]: unknown error', Stream 'word/document.xml', Line 2, Column 928831(row,col).
#
# Problematic LibreOffice version:
# --------------------------------
# Version: 5.1.6.2
# Build ID: 1:5.1.6~rc2-0ubuntu1~xenial1
# CPU Threads: 4; OS Version: Linux 4.11; UI Render: default; 
# Locale: en-US (en_US.UTF-8); Calc: group

tofix="$1"
if [ -z "$tofix" ]; then
echo "Usage: $0 <filetofix>"
echo "Example: $0 bla.dockx"
exit 1
fi
cwd=$(pwd)
tofixreal=$(readlink -f "$tofix")

tempdir=$(mktemp -d)
cd "$tempdir"

unzip "$tofixreal"

# Fix relativeHeight issue
sed -i "s/relativeHeight=\"[^\"]\+\"/relativeHeight=\"0\"/g" word/document.xml
# and then after saving in LibreOffice 5.2 docx format, we sometimes need this fix:
sed -i 's/w:themeColor="text1" w:themeColor="text1"/w:themeColor="text1"/g' word/document.xml

zip -r "$tofixreal" *

cd "$cwd"

echo "Done! The file $tofixreal has been cleaned from relativeHeight and themeColor issues."

To use this script, make sure it is executable and do:

./fixdocx.sh filename.dockx