OOXML to Formex 4 converter¶
Description¶
The convert_ooxml2formex()
converter is a function designed to convert tables from an Office Open XML (OOXML) document (which respects the schema defined in Office Open XML File Formats) in the Formex 4 format.
The conversion is done in the source XML document by replacing the tables of the OOXML format with those transformed in the Formex format. In other words, the general structure of the source XML document is retained except for tables.
The Ooxml2FormexConverter
converter is composed of:
a
OoxmlParser
parser that allows you to parse tables in OOXML format,The tutorial OOXML tables (Word) parser describes the usage of this parser and gives some examples.
a
FormexBuilder
builder that allows you to build tables in the Formex format.The tutorial Formex 4 tables builder describes the usage of this builder and gives some examples.
Conversion options¶
The tables parsing and building can be parameterized using the options described below:
Common parsing options:
encoding
(default: “utf-8”):XML encoding of the destination file.
OOXML parser options:
styles_path
(default:None
):Path to the stylesheet to use to resole table styles. In an uncompressed
.docx
tree structure, the stylesheet path isword/styles.xml
.
Formex 4 builder options:
use_cals
(default:False
):Generate additional CALS-like elements and attributes to simplify the layout of Formex document in typesetting systems.
cals_ns
(default: “https://lib.benker.com/schemas/cals.xsd”):Namespace to use for CALS-like elements and attributes (requires:
use_cals
). SetNone
(or “”) if you don’t want to use namespace.cals_prefix
(default: “cals”):Namespace prefix to use for CALS-like elements and attributes (requires:
use_cals
).width_unit
(default: “mm”):Unit to use for column widths (requires:
use_cals
). Possible values are: ‘cm’, ‘dm’, ‘ft’, ‘in’, ‘m’, ‘mm’, ‘pc’, ‘pt’, ‘px’.
Examples of conversions¶
Converting a .docx
document¶
You can use the convert_ooxml2formex()
converter
to convert a Word document, for instance, we have the following annex:
If you want to convert a .docx
file, you need first to decompress it in a temporary directory
in order to access the “word/document.xml” and “word/styles.xml” stored in the .docx
package.
To decompress the .docx
package and convert the tables, you can do:
>>> import os
>>> import zipfile
>>> from benker.converters.ooxml2formex import convert_ooxml2formex
>>> src_zip = "docs/_static/converters.ooxml2formex.sample1.docx"
>>> with zipfile.ZipFile(src_zip) as zf:
... zf.extractall(tmp_dir)
>>> src_xml = os.path.join(tmp_dir, "word/document.xml")
>>> styles_xml = os.path.join(tmp_dir, "word/styles.xml")
>>> dst_xml = os.path.join(tmp_dir, "converters.ooxml2formex.sample1.xml")
>>> options = {
... 'encoding': 'utf-8',
... 'styles_path': styles_xml,
... }
>>> convert_ooxml2formex(src_xml, dst_xml, **options)
The result is the “word/document.xml” document, but with tables replaced by the Formex TBL
elements.
Here is a sample of the result XML:
<?xml version='1.0' encoding='UTF-8'?>
<w:document xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
mc:Ignorable="w14 w15 w16se w16cid wp14">
<w:body>
<w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
<w:pPr>
<w:pStyle w:val="Titre1"/>
<w:jc w:val="center"/>
</w:pPr>
<w:r><w:t>ANNEX</w:t></w:r>
</w:p>
<w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
<w:pPr>
<w:pStyle w:val="Titre2"/>
<w:jc w:val="center"/>
</w:pPr>
<w:r><w:t>Annex 1</w:t></w:r>
<w:r><w:br/><w:t>Concessions granted by Switzerland</w:t></w:r>
</w:p>
<w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
<w:pPr>
<w:pStyle w:val="Corpsdetexte"/>
</w:pPr>
<w:r><w:t>The tariff concessions set out below are granted by Switzerland
for the following products originating in the European Union and are,
where applicable, subject to an annual quantity:</w:t></w:r>
</w:p>
<TBL NO.SEQ="0001" COLS="4">
<CORPUS>
<ROW TYPE="HEADER">
<CELL COL="1">
<w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
<w:pPr>
<w:pStyle w:val="Corpsdetexte"/>
<w:keepNext/>
<w:jc w:val="center"/>
<w:rPr><w:b/><w:bCs/></w:rPr>
</w:pPr>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Swiss tariff</w:t></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>heading</w:t></w:r>
</w:p>
</CELL>
<CELL COL="2">
<w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
<w:pPr>
<w:pStyle w:val="Corpsdetexte"/>
<w:jc w:val="center"/>
<w:rPr><w:b/><w:bCs/></w:rPr>
</w:pPr>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Description</w:t></w:r>
</w:p>
</CELL>
<CELL COL="3">
<w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
<w:pPr>
<w:pStyle w:val="Corpsdetexte"/>
<w:jc w:val="center"/>
<w:rPr><w:b/><w:bCs/></w:rPr>
</w:pPr>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Customs duty</w:t></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>applicable</w:t></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>(CHF/100 kg</w:t></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>gross weight)</w:t></w:r>
</w:p>
</CELL>
<CELL COL="4">
<w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
<w:pPr>
<w:pStyle w:val="Corpsdetexte"/>
<w:jc w:val="center"/>
<w:rPr><w:b/><w:bCs/></w:rPr>
</w:pPr>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Annual</w:t></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>quantity</w:t></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>(tonnes net</w:t></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>weight)</w:t></w:r>
</w:p>
</CELL>
</ROW>
<ROW>
<CELL COL="1">
<w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
Using CALS-like attributes and elements¶
The Formex table format is good to structure tables. The logical structure is similar to the one used for HTML tables but without CSS.
Some difficulties appears when you want to do the layout of Formex tables in typesetting systems: Formex tables doesn’t have much layout information:
no borders,
no horizontal of vertical alignment of the text,
no background color,
no indication of the column width,
etc.
To solve that, it is possible to generate CALS-like attributes and elements in the Formex. Of course, we can use a namespace and a namespace prefix for the CALS attributes and elements.
To convert the tables using CALS, you can do:
>>> dst_xml = os.path.join(tmp_dir, "converters.ooxml2formex.sample2.xml")
>>> options = {
... 'encoding': 'utf-8',
... 'styles_path': styles_xml,
... 'use_cals': True,
... 'cals_ns': "http://cals",
... 'cals_prefix': "cals",
... }
>>> convert_ooxml2formex(src_xml, dst_xml, **options)
The result is the “word/document.xml” document, but with tables replaced by the Formex TBL
elements.
Here is a sample of the result XML:
<TBL xmlns:cals="http://cals" NO.SEQ="0001" COLS="4">
<CORPUS cals:frame="none" cals:colsep="0" cals:rowsep="0" cals:pgwide="1">
<cals:colspec cals:colname="c1" cals:colwidth="24.04mm"/>
<cals:colspec cals:colname="c2" cals:colwidth="89.09mm"/>
<cals:colspec cals:colname="c3" cals:colwidth="31.96mm"/>
<cals:colspec cals:colname="c4" cals:colwidth="24.91mm"/>
<ROW TYPE="HEADER">
<CELL COL="1" cals:rowsep="1" cals:align="center">
<w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
<w:pPr>
<w:pStyle w:val="Corpsdetexte"/>
<w:keepNext/>
<w:jc w:val="center"/>
<w:rPr><w:b/><w:bCs/></w:rPr>
</w:pPr>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Swiss tariff</w:t></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>heading</w:t></w:r>
</w:p>
</CELL>
<CELL COL="2" cals:rowsep="1" cals:align="center">
<w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
<w:pPr>
<w:pStyle w:val="Corpsdetexte"/>
<w:jc w:val="center"/>
<w:rPr><w:b/><w:bCs/></w:rPr>
</w:pPr>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Description</w:t></w:r>
</w:p>
</CELL>
<CELL COL="3" cals:rowsep="1" cals:align="center">
<w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
<w:pPr>
<w:pStyle w:val="Corpsdetexte"/>
<w:jc w:val="center"/>
<w:rPr><w:b/><w:bCs/></w:rPr>
</w:pPr>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Customs duty</w:t></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>applicable</w:t></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>(CHF/100 kg</w:t></w:r>
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>gross weight)</w:t></w:r>
</w:p>
</CELL>
<CELL COL="4" cals:rowsep="1" cals:align="center">
<w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
<w:pPr>
<w:pStyle w:val="Corpsdetexte"/>
In the result, we can notice:
the presence of the namespace
xmlns:cals="http://cals"
.the additional attributes, like
cals:frame="none"
,cals:colsep="0"
,cals:rowsep="0"
…the additional
colspec
elements:<cals:colspec cals:colname="c1" cals:colwidth="24.04mm"/>
.
This kind of information is will be preserved if you use a Formex to CALS conversion (see the Formex 4 to CALS converter tutorial).