OOXML to Formex 4 converter

Description

The convert_ooxml2formex() converter is a function designed to convert tables from an Office Open XML (OOXML) document (which respects the schema defined in Office Open XML File Formats) in the Formex 4 format.

The conversion is done in the source XML document by replacing the tables of the OOXML format with those transformed in the Formex format. In other words, the general structure of the source XML document is retained except for tables.

The Ooxml2FormexConverter converter is composed of:

Conversion options

The tables parsing and building can be parameterized using the options described below:

Common parsing options:

encoding (default: “utf-8”):

XML encoding of the destination file.

OOXML parser options:

styles_path (default: None):

Path to the stylesheet to use to resole table styles. In an uncompressed .docx tree structure, the stylesheet path is word/styles.xml.

Formex 4 builder options:

use_cals (default: False):

Generate additional CALS-like elements and attributes to simplify the layout of Formex document in typesetting systems.

cals_ns (default: “https://lib.benker.com/schemas/cals.xsd”):

Namespace to use for CALS-like elements and attributes (requires: use_cals). Set None (or “”) if you don’t want to use namespace.

cals_prefix (default: “cals”):

Namespace prefix to use for CALS-like elements and attributes (requires: use_cals).

width_unit (default: “mm”):

Unit to use for column widths (requires: use_cals). Possible values are: ‘cm’, ‘dm’, ‘ft’, ‘in’, ‘m’, ‘mm’, ‘pc’, ‘pt’, ‘px’.

Examples of conversions

Converting a .docx document

You can use the convert_ooxml2formex() converter to convert a Word document, for instance, we have the following annex:

../_images/converters.ooxml2formex.sample1.jpeg

If you want to convert a .docx file, you need first to decompress it in a temporary directory in order to access the “word/document.xml” and “word/styles.xml” stored in the .docx package.

To decompress the .docx package and convert the tables, you can do:

>>> import os
>>> import zipfile

>>> from benker.converters.ooxml2formex import convert_ooxml2formex

>>> src_zip = "docs/_static/converters.ooxml2formex.sample1.docx"
>>> with zipfile.ZipFile(src_zip) as zf:
...     zf.extractall(tmp_dir)

>>> src_xml = os.path.join(tmp_dir, "word/document.xml")
>>> styles_xml = os.path.join(tmp_dir, "word/styles.xml")

>>> dst_xml = os.path.join(tmp_dir, "converters.ooxml2formex.sample1.xml")
>>> options = {
...     'encoding': 'utf-8',
...     'styles_path': styles_xml,
... }
>>> convert_ooxml2formex(src_xml, dst_xml, **options)

The result is the “word/document.xml” document, but with tables replaced by the Formex TBL elements.

Here is a sample of the result XML:

<?xml version='1.0' encoding='UTF-8'?>
<w:document xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
            xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
            mc:Ignorable="w14 w15 w16se w16cid wp14">
  <w:body>
    <w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
      <w:pPr>
        <w:pStyle w:val="Titre1"/>
        <w:jc w:val="center"/>
      </w:pPr>
      <w:r><w:t>ANNEX</w:t></w:r>
    </w:p>
    <w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
      <w:pPr>
        <w:pStyle w:val="Titre2"/>
        <w:jc w:val="center"/>
      </w:pPr>
      <w:r><w:t>Annex 1</w:t></w:r>
      <w:r><w:br/><w:t>Concessions granted by Switzerland</w:t></w:r>
    </w:p>
    <w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
      <w:pPr>
        <w:pStyle w:val="Corpsdetexte"/>
      </w:pPr>
      <w:r><w:t>The tariff concessions set out below are granted by Switzerland
        for the following products originating in the European Union and are,
        where applicable, subject to an annual quantity:</w:t></w:r>
    </w:p>
    <TBL NO.SEQ="0001" COLS="4">
      <CORPUS>
        <ROW TYPE="HEADER">
          <CELL COL="1">
            <w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
              <w:pPr>
                <w:pStyle w:val="Corpsdetexte"/>
                <w:keepNext/>
                <w:jc w:val="center"/>
                <w:rPr><w:b/><w:bCs/></w:rPr>
              </w:pPr>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Swiss tariff</w:t></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>heading</w:t></w:r>
            </w:p>
          </CELL>
          <CELL COL="2">
            <w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
              <w:pPr>
                <w:pStyle w:val="Corpsdetexte"/>
                <w:jc w:val="center"/>
                <w:rPr><w:b/><w:bCs/></w:rPr>
              </w:pPr>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Description</w:t></w:r>
            </w:p>
          </CELL>
          <CELL COL="3">
            <w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
              <w:pPr>
                <w:pStyle w:val="Corpsdetexte"/>
                <w:jc w:val="center"/>
                <w:rPr><w:b/><w:bCs/></w:rPr>
              </w:pPr>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Customs duty</w:t></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>applicable</w:t></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>(CHF/100 kg</w:t></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>gross weight)</w:t></w:r>
            </w:p>
          </CELL>
          <CELL COL="4">
            <w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
              <w:pPr>
                <w:pStyle w:val="Corpsdetexte"/>
                <w:jc w:val="center"/>
                <w:rPr><w:b/><w:bCs/></w:rPr>
              </w:pPr>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Annual</w:t></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>quantity</w:t></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>(tonnes net</w:t></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>weight)</w:t></w:r>
            </w:p>
          </CELL>
        </ROW>
        <ROW>
          <CELL COL="1">
            <w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">

Using CALS-like attributes and elements

The Formex table format is good to structure tables. The logical structure is similar to the one used for HTML tables but without CSS.

Some difficulties appears when you want to do the layout of Formex tables in typesetting systems: Formex tables doesn’t have much layout information:

  • no borders,

  • no horizontal of vertical alignment of the text,

  • no background color,

  • no indication of the column width,

  • etc.

To solve that, it is possible to generate CALS-like attributes and elements in the Formex. Of course, we can use a namespace and a namespace prefix for the CALS attributes and elements.

To convert the tables using CALS, you can do:

>>> dst_xml = os.path.join(tmp_dir, "converters.ooxml2formex.sample2.xml")
>>> options = {
...     'encoding': 'utf-8',
...     'styles_path': styles_xml,
...     'use_cals': True,
...     'cals_ns': "http://cals",
...     'cals_prefix': "cals",
... }
>>> convert_ooxml2formex(src_xml, dst_xml, **options)

The result is the “word/document.xml” document, but with tables replaced by the Formex TBL elements.

Here is a sample of the result XML:

    <TBL xmlns:cals="http://cals" NO.SEQ="0001" COLS="4">
      <CORPUS cals:frame="none" cals:colsep="0" cals:rowsep="0" cals:pgwide="1">
        <cals:colspec cals:colname="c1" cals:colwidth="24.04mm"/>
        <cals:colspec cals:colname="c2" cals:colwidth="89.09mm"/>
        <cals:colspec cals:colname="c3" cals:colwidth="31.96mm"/>
        <cals:colspec cals:colname="c4" cals:colwidth="24.91mm"/>
        <ROW TYPE="HEADER">
          <CELL COL="1" cals:rowsep="1" cals:align="center">
            <w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
              <w:pPr>
                <w:pStyle w:val="Corpsdetexte"/>
                <w:keepNext/>
                <w:jc w:val="center"/>
                <w:rPr><w:b/><w:bCs/></w:rPr>
              </w:pPr>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Swiss tariff</w:t></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>heading</w:t></w:r>
            </w:p>
          </CELL>
          <CELL COL="2" cals:rowsep="1" cals:align="center">
            <w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
              <w:pPr>
                <w:pStyle w:val="Corpsdetexte"/>
                <w:jc w:val="center"/>
                <w:rPr><w:b/><w:bCs/></w:rPr>
              </w:pPr>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Description</w:t></w:r>
            </w:p>
          </CELL>
          <CELL COL="3" cals:rowsep="1" cals:align="center">
            <w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
              <w:pPr>
                <w:pStyle w:val="Corpsdetexte"/>
                <w:jc w:val="center"/>
                <w:rPr><w:b/><w:bCs/></w:rPr>
              </w:pPr>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Customs duty</w:t></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>applicable</w:t></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>(CHF/100 kg</w:t></w:r>
              <w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:br/><w:t>gross weight)</w:t></w:r>
            </w:p>
          </CELL>
          <CELL COL="4" cals:rowsep="1" cals:align="center">
            <w:p w:rsidR="001E0B3A" w:rsidRDefault="00883CDC">
              <w:pPr>
                <w:pStyle w:val="Corpsdetexte"/>

In the result, we can notice:

  • the presence of the namespace xmlns:cals="http://cals".

  • the additional attributes, like cals:frame="none", cals:colsep="0", cals:rowsep="0"

  • the additional colspec elements: <cals:colspec cals:colname="c1" cals:colwidth="24.04mm"/>.

This kind of information is will be preserved if you use a Formex to CALS conversion (see the Formex 4 to CALS converter tutorial).