Build journal rule
* `Previous requirements`_
* `Build main pattern`_
* `Add rule to xml file`_
* `Create pattern to parse inner values`_
Previous requirements
* Knowledge about regular expressions
* Knowledge of the structure of rules definition file
* Any text editor of your preference. (`SublimeText `_, `Notepad++ `_, etc...)
* Recommend use `RegexBuddy `_ you need pay, but can generate quickly and easy, patterns of regex.
* Reference sample of the journal which will create the rule.
Build main pattern
**NOTE:** To write this manual used a journal **Revista mexicana de biodiversidad** ISSN: **1870-3453**
#. Open RegexBuddy and check ``^$ match at line breaks``, chosse tab ``Test``, check ``Highlight``, in ``List All`` submenu check ``Update Automatically`` and choose ``Line by line`` like this:
.. image:: ./images/create_pattern_1.jpg
#. Paste sample of references in ``Test`` tab. The larger the sample will be obtain better results.
.. image:: ./images/create_pattern_2.jpg
#. Now it's time to use and increment your knowledge of regex. First you need identify patterns in references, like a field separators (``.``, ``;``, ``:``, etc). In this case can identify from left to right, 1 authors, 2 space, 3 date.
| authors | space | date |
|``Avise, J. C.`` |``|_|``|``2000``|
|``Balduzzi, A. P., P. De Luca y S. Sabato.`` |``|_|``|``1982``|
|``Bogler, D. J. y J. Francisco-Ortega.`` |``|_|``|``2004``|
|``Caputo, P., S. Cozzolino, P. De Luca, A. Moretti y D. W. Stevenson.``|``|_|``|``2004``|
|``Chaw, S. M., T. W. Walters, C. C. Chang, S. H. Hu y S. H. Chen.`` |``|_|``|``2005``|
|``DeSalle, R., M. G. Egan y M. Siddall.`` |``|_|``|``2005``|
|``DeSalle, R.`` |``|_|``|``2006``|
|``DeSalle, R.`` |``|_|``|``2007``|
And you can represent this by regex ``(^.+?)(\s*)((?:1[0-9]|20)[0-9]{2})``, groups is:
``(^.+?)`` .- ``^`` Begin of line match ``.`` any single caracter ``+`` betwen one and ulimited times ``?``, as few times is posible. This group identifies authors.
``(\s)`` .- Match a ``\s`` sigle white space character (space, tabs, and line breaks).
``((?:1[0-9]|20)[0-9]{2})`` .- Match the character ``1`` and numbers between ``[0-9]`` or ``20``, match two characters between ``[0-9]``. This group identifies dates between 1000 and 2099.
Now put the regex ``(^.+?)(\s*)((?:1[0-9]|20)[0-9]{2})`` in RegexBuddy and see the automatic highlight maresults.
.. image:: ./images/create_pattern_3.jpg
The next pattern is, dot and space, title, dot and space, publisher name, space, pages, and end with dot. And can represented by next regex ``(\.\s)(.+?)(\.\s)(.+?)(\s)([0-9:,-\s]+?)(\.$)`` with groups:
``(\.\s)`` .- Match ``\.`` dot and ``\s`` space.
``(.+?)`` .- Match ``.`` any single caracter ``+`` betwen one and ulimited times ``?``, as few times is posible.
``(\.\s)`` .- Match ``\.`` dot and ``\s`` space.
``(.+?)`` .- Match ``.`` any single caracter ``+`` betwen one and ulimited times ``?``, as few times is posible.
``(\s)`` .- Match ``\s`` space.
``(\s[0-9:,-\s]+?)`` .- Match any character in class ``[0-9:,-\s]``, ``+`` betwen one and ulimited times ``?``, as few times is posible.
``(\.$)`` .- Match ``\.`` and ``$`` end of line.
And the complete pattern is ``(^.+?)(\s*)((?:1[0-9]|20)[0-9]{2})(\.\s)(.+?)(\.\s)(.+?)(\s)([0-9:,-\s]+?)(\.$)`` when put the complete pattern in RegexBuddy show matches in references. Whit this pattern matche with 900 references
.. image:: ./images/create_pattern_4.jpg
In conclusion we have the following expression ``(^.+?)(\s*)((?:1[0-9]|20)[0-9]{2})(\.\s)(.+?)(\.\s)(.+?)(\s)([0-9:,-\s]+?)(\.$)`` and correspond to a contribution in serial with groups:
1. ``(^.+?)`` .- Authors
2. ``(\s*)`` .- Space
3. ``((?:1[0-9]|20)[0-9]{2})`` .- Date
4. ``(\.\s)`` .- Dot and space
5. ``(.+?)`` .- Title
6. ``(\.\s)`` .- Dot and space
7. ``(.+?)`` .- Publisher name
8. ``(\s)`` .- Space
9. ``([0-9:,-\s]+?)`` .- Pages
10. ``(\.$)`` .- Dot and end line
Add rule to xml file
* **NOTE**: ``value``, ``prevalue`` and ``postvalue`` elements contains backreference(s) group(s) in this form ``${backreference}`` by example to backreference group 1 is ``${1}`` and complete sintaxis is ``${1}``
* First need add if not exist, main node of journal.
.. code-block:: xml
Revista mexicana de biodiversidadother(....+)${1}
In all cases recommend use the main regex is ``(....+)`` and main tag of reference rule in this case **ocitat**, to use **multiple** and **option**'s in each pattern to identify references.
* Add your **regex** pattern to a **option**, in **multiple** element node.
.. code-block:: xml
* Create the **struct** of regex pattern and add **tagname** elements
.. code-block:: xml
The first **tagname** element is **ocontrib** that contain **authors** and this **tagname** is only container of each **author** so attributte **tag="true"** in tag name not used. We will see later how to parse this element.
.. code-block:: xml
Second **tagname** on **ocontrib** is **date** with default attribute **dateiso** (```` without value parse date from **value** if format is YYYY)
.. code-block:: xml
Third **tagname** element is **title** with default attribute is **language** and default attribute value **en**
.. code-block:: xml
Next **tagname** element is **oiserial** and inner **tagname** is **sertitle** with default attribute is **language** and default attribute value **en**
.. code-block:: xml
And last **tagname** element is **pages**
Create pattern to parse inner values
The advantage of **RegexMarkup** is can processing inner elements regardless level of these. In this example can parse each **author** in **authors**.
All authors appear in previous group 1:
| authors |
|``Avise, J. C.`` |
|``Balduzzi, A. P., P. De Luca y S. Sabato.`` |
|``Bogler, D. J. y J. Francisco-Ortega.`` |
|``Caputo, P., S. Cozzolino, P. De Luca, A. Moretti y D. W. Stevenson.``|
|``Chaw, S. M., T. W. Walters, C. C. Chang, S. H. Hu y S. H. Chen.`` |
|``DeSalle, R., M. G. Egan y M. Siddall.`` |
|``DeSalle, R.`` |
|``DeSalle, R.`` |
.. code-block:: xml
Now need create a new regex pattern to parse each author in group 1, the easy way is:
* Choose ``List All`` → ``List all matches of group 1`` on ``Test`` tab.
.. image :: ./images/create_pattern_5.jpg
* Copy and paste from text area 3, to text area 2 and delete the regex pattern from text area 1
.. image :: ./images/create_pattern_6.jpg
* Now all is ready to construct pattern to parse each author, in this case only have two groups, group 1 is author and group 2 is the field separator and regex pattern is: ``(.+?,\s[A-Z.\s]+?|.+?)(\sand\s|\sy\s|,\s|\s\(.+?\)\.?|\.\s$|$)``
``(.+?,\s[A-Z.\s]+?|.+?)`` .- Match ``.`` any single caracter ``+`` betwen one and ulimited times ``?`` as few times is posible, ``,`` coma ``\s``, any character in class ``[A-Z.\s]``, ``+`` betwen one and ulimited times ``?``, as few times is posible
**or** match ``.`` any single caracter ``+`` betwen one and ulimited times ``?`` as few times is posible. The first option is to match with first author, that always appear like this: ``Balduzzi`` first name ``,|_|`` coma and space ``A. P.`` and surname composed by capital letters, dots and space. Second option is to match with remaining authors.
``(\sand\s|\sy\s|,\s|\s\(.+?\)\.?|\.\s$|$)`` .- Match ``\s`` space ``and`` the word "and" ``\s`` space,
**or** ``\s`` space ``y`` the character "y" ``\s`` space,
**or** ``,`` coma ``\s`` space ,
**or** ``\s`` space ``\(`` open parenthesis ``.`` any single caracter ``+`` betwen one and ulimited times ``?`` as few times is posible ``\(`` close parenthesis ``\.`` dot can be appear ``?`` zero or one time,
**or** ``\.`` dot ``\s`` can be appear ``?`` zero or one time,
**or** ``\.`` dot ``\s`` space ``$`` end of line,
**or** ``$`` end of line. In this case each option is one posible field separathor.
Test pattern in **RegexBuddy**
.. image:: ./images/create_pattern_7.jpg
* When regex pattern is ready, continue completing the rule in XML file.
.. code-block:: xml
* Add new **regex** into **authors** **tagname**
* Added **struct** to **regex**
* And the content of **struct** is **oauthor** with default attribute **role** and default value **nd**
Second example is parse **surname** and **fname** in **oauthor**
* Create new new regex pattern to parse **surname** and **fname** in **oauthor**
* Again choose ``List All`` → ``List all matches of group 1`` on ``Test`` tab.
.. image :: ./images/create_pattern_8.jpg
* Copy and paste from text area 3, to text area 2 and delete the regex pattern from text area 1
.. image :: ./images/create_pattern_9.jpg
* In this can we have two options.
* The first option is when autor appear like this: ``Caputo`` **surname** ``,|_|`` coma and space ``P.`` and **fname** and can parsed with regex pattern ``(.+?)(,\s|,?\s|,\s?)([A-Zh.\s]+$)``
``(.+?)`` .- Match ``.`` any single caracter ``+`` betwen one and ulimited times ``?``, as few times is posible.
``(,\s|,?\s|,\s?)`` .- Match ``,\s`` coma and space, **or** ``,`` coma ``?`` apear zero or one time ``\s`` space **or** ``,`` coma ``\s`` space ``?`` apear zero or one time.
``([A-Zh.\s]+$)`` Match any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` at ``$`` end of line.
Test pattern in **RegexBuddy**
.. image:: ./images/create_pattern_10.jpg
* Second option is when autor appear like this: ``D. L.`` **fname** ``|_|`` space ``Erickson`` and **surname** and can parsed with regex pattern ``([A-Zh.]+?\s[A-Zh.]+?\s[A-Zh.]+?|[A-Zh.]+?\s[A-Zh.]+?|[A-Zh.]+?|.+?)(\s)(.+$)``
``([A-Zh.]+?\s[A-Zh.]+?\s[A-Zh.]+?|[A-Zh.]+?\s[A-Zh.]+?|[A-Zh.]+?|.+?)`` .- Match any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` as few times is posible ``\s`` space, any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` as few times is posible ``\s`` space, any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` as few times is posible, **or** any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` as few times is posible ``\s`` space, any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` as few times is posible, **or** any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` as few times is posible, **or** ``.`` any single caracter ``+`` betwen one and ulimited times ``?``, as few times is posible.
``(\s)`` .- Match ``\s`` space.
``(.+$)`` .- Match ``.`` any single caracter ``+`` betwen one and ulimited times ``?``, as few times is posible.
Test pattern in **RegexBuddy**
.. image:: ./images/create_pattern_11.jpg
* When regex pattern is ready, continue completing the rule in XML file.
.. code-block:: xml
Added **multiple** with two options
* First **option** contains **regex** pattern ``(.+?)(,\s|,?\s|,\s?)([A-Zh.\s]+$)`` with **struct** **fname** and **surname**
* Second **option** contains **regex** pattern ``([A-Zh.]+?\s[A-Zh.]+?\s[A-Zh.]+?|[A-Zh.]+?\s[A-Zh.]+?|[A-Zh.]+?|.+?)(\s)(.+$)`` with **struct** **surname** and **fname**