Formex 4 tables parser

Description

A FormexParser is used to parse the tables (actually, it parses CORPUS elements) of a Formex 4 document and generate a Table instances (memory representation of a table). The instance can then be serialize in another XML format, like CALS.

To use this class, you need to inherit the BaseBuilder class and create an instance of your class to used in the FormexParser parser.

Of course, for the sake of this demonstration we can used an instance of the class BaseBuilder, without implementing the generate_table_tree() method.

>>> from lxml import etree
>>> from benker.builders.base_builder import BaseBuilder
>>> from benker.parsers.formex import FormexParser

>>> builder = BaseBuilder()
>>> parser = FormexParser(builder)

For example, you can parse the following Formex 4 table:

<TBL COLS="9" NO.SEQ="0001" PAGE.SIZE="SINGLE.LANDSCAPE">
  <CORPUS>
    <ROW TYPE="HEADER">
      <CELL COL="1"><IE/></CELL>
      <CELL COL="2" COLSPAN="4">Identification des substances</CELL>
      <CELL COL="6" COLSPAN="3">Conditions</CELL>
      <CELL COL="9"><IE/></CELL>
    </ROW>
    <ROW TYPE="HEADER">
      <CELL COL="1">Numéro d’ordre</CELL>
      <CELL COL="2">Nom chimique/DCI/XAN</CELL>
      <CELL COL="3">Dénomination commune du glossaire des ingrédients</CELL>
      <CELL COL="4">Numéro CAS</CELL>
      <CELL COL="5">Numéro CE</CELL>
      <CELL COL="6">Type de produit, parties du corps</CELL>
      <CELL COL="7">Concentration maximale dans les préparations prêtes
        à l’emploi</CELL>
      <CELL COL="8">Autres</CELL>
      <CELL COL="9">Libellé des conditions d’emploi et des avertissements</CELL>
    </ROW>
    <ROW TYPE="ALIAS">
      <CELL COL="1">a</CELL>
      <CELL COL="2">b</CELL>
      <CELL COL="3">c</CELL>
      <CELL COL="4">d</CELL>
      <CELL COL="5">e</CELL>
      <CELL COL="6">f</CELL>
      <CELL COL="7">g</CELL>
      <CELL COL="8">h</CELL>
      <CELL COL="9">i</CELL>
    </ROW>
    <ROW>
      <CELL COL="1">31</CELL>
      <CELL COL="2">3,3'-(1,4-phénylène)bis(5,6-diphényl-1,2,4-triazine)</CELL>
      <CELL COL="3">phénylène bis-diphényltriazine</CELL>
      <CELL COL="4">55514-22-2</CELL>
      <CELL COL="5">700-823-1</CELL>
      <CELL COL="6"><IE/></CELL>
      <CELL COL="7">5 %</CELL>
      <CELL COL="8">Ne pas utiliser dans des applications pouvant conduire à
        l’exposition des poumons de l’utilisateur final par inhalation.</CELL>
      <CELL COL="9"><IE/></CELL>
    </ROW>
  </CORPUS>
</TBL>

And generate a Table instance:

>>> tree = etree.parse("docs/_static/parsers.formex.sample1.xml")
>>> fmx_table = tree.getroot()
>>> table = parser.parse_table(fmx_table)
>>> print(table)
+-----------+-----------------------------------------------+-----------------------------------+-----------+
|           |             Identific                         |             Condition             |           |
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Numéro d’ | Nom chimi | Dénominat | Numéro CA | Numéro CE | Type de p | Concentra |  Autres   | Libellé d |
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|     a     |     b     |     c     |     d     |     e     |     f     |     g     |     h     |     i     |
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|    31     | 3,3'-(1,4 | phénylène | 55514-22- | 700-823-1 |           |    5 %    | Ne pas ut |           |
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

Options

The FormexParser parser accept the following options:

  • formex_ns namespace to use for Formex elements and attributes. Usually, a Formex document has no namespace, but in some case, you can have “http://opoce”.

    For instance, if you have :

    <TBL COLS="2" xmlns="http://opoce">
      <CORPUS>
        <ROW TYPE="HEADER">
          <CELL COL="1">Région</CELL>
          <CELL COL="2">Vin</CELL>
        </ROW>
        <ROW>
          <CELL COL="1">Alsace</CELL>
          <CELL COL="2">Gewurztraminer</CELL>
        </ROW>
        <ROW>
          <CELL COL="1">Beaujolais</CELL>
          <CELL COL="2">Brouilly</CELL>
        </ROW>
      </CORPUS>
    </TBL>
    

    To parse this XML document, you can create a parser using the formex_ns option:

    >>> parser = FormexParser(builder, formex_ns="http://opoce")
    >>> tree = etree.parse("docs/_static/parsers.formex.sample2.xml")
    >>> fmx_table = tree.getroot()
    >>> table = parser.parse_table(fmx_table)
    >>> print(table)
    +-----------+-----------+
    |  Région   |    Vin    |
    +-----------+-----------+
    |  Alsace   | Gewurztra |
    +-----------+-----------+
    | Beaujolai | Brouilly  |
    +-----------+-----------+
    
  • cals_ns namespace to use for CALS-like elements and attributes. For the purpose of typesetting enhancement, a Formex document may contains CALS-like elements and attributes. This elements and attributes may use a different namespace. In order to parse them, you can use the cals_ns options.

    For instance, if you have :

    <TBL COLS="2" xmlns:cals="http://my.cals.ns">
      <CORPUS cals:colsep="1" cals:frame="all" cals:pgwide="1" cals:rowsep="1">
        <cals:colspec cals:colname="c1" cals:colwidth="80mm" cals:align="left"/>
        <cals:colspec cals:colname="c2" cals:colwidth="60mm" cals:align="center"/>
        <ROW TYPE="HEADER">
          <CELL TYPE="HEADER" COL="1">Header 1</CELL>
          <CELL TYPE="HEADER" COL="2">Header 2</CELL>
        </ROW>
        <ROW cals:rowsep="0" cals:valign="middle">
          <CELL COL="1">Cell A1</CELL>
          <CELL COL="2">Cell B1</CELL>
        </ROW>
        <ROW>
          <CELL COL="1" COLSPAN="2" cals:nameend="c2" cals:namest="c1">Cell A2-B2</CELL>
        </ROW>
        <ROW>
          <CELL COL="1" ROWSPAN="2" cals:morerows="1">Cell A3-A4</CELL>
          <CELL COL="2">Cell B3</CELL>
        </ROW>
        <ROW>
          <CELL COL="2">Cell B4</CELL>
        </ROW>
      </CORPUS>
    </TBL>
    

    To parse this XML document, you can create a parser using the cals_ns option:

    >>> parser = FormexParser(builder, cals_ns="http://my.cals.ns")
    >>> tree = etree.parse("docs/_static/parsers.formex.sample3.xml")
    >>> fmx_table = tree.getroot()
    >>> table = parser.parse_table(fmx_table)
    >>> print(table)
    +-----------+-----------+
    | Header 1  | Header 2  |
    +-----------+-----------+
    |  Cell A1  |  Cell B1  |
    +-----------------------+
    | Cell A2-B             |
    +-----------+-----------+
    | Cell A3-A |  Cell B3  |
    |           +-----------+
    |           |  Cell B4  |
    +-----------+-----------+
    

Supported values

The FormexParser parser can handle the following values: Formex styles.