Build journal rule ================== * `Previous requirements`_ * `Build main pattern`_ * `Add rule to xml file`_ * `Create pattern to parse inner values`_ Previous requirements --------------------- * Knowledge about regular expressions * Knowledge of the structure of rules definition file * Any text editor of your preference. (`SublimeText `_, `Notepad++ `_, etc...) * Recommend use `RegexBuddy `_ you need pay, but can generate quickly and easy, patterns of regex. * Reference sample of the journal which will create the rule. Build main pattern ------------------ **NOTE:** To write this manual used a journal **Revista mexicana de biodiversidad** ISSN: **1870-3453** #. Open RegexBuddy and check ``^$ match at line breaks``, chosse tab ``Test``, check ``Highlight``, in ``List All`` submenu check ``Update Automatically`` and choose ``Line by line`` like this: .. image:: ./images/create_pattern_1.jpg #. Paste sample of references in ``Test`` tab. The larger the sample will be obtain better results. .. image:: ./images/create_pattern_2.jpg #. Now it's time to use and increment your knowledge of regex. First you need identify patterns in references, like a field separators (``.``, ``;``, ``:``, etc). In this case can identify from left to right, 1 authors, 2 space, 3 date. +-----------------------------------------------------------------------+-------+--------+ | authors | space | date | +=======================================================================+=======+========+ |``Avise, J. C.`` |``|_|``|``2000``| +-----------------------------------------------------------------------+-------+--------+ |``Balduzzi, A. P., P. De Luca y S. Sabato.`` |``|_|``|``1982``| +-----------------------------------------------------------------------+-------+--------+ |``Bogler, D. J. y J. Francisco-Ortega.`` |``|_|``|``2004``| +-----------------------------------------------------------------------+-------+--------+ |``Caputo, P., S. Cozzolino, P. De Luca, A. Moretti y D. W. Stevenson.``|``|_|``|``2004``| +-----------------------------------------------------------------------+-------+--------+ |``Chaw, S. M., T. W. Walters, C. C. Chang, S. H. Hu y S. H. Chen.`` |``|_|``|``2005``| +-----------------------------------------------------------------------+-------+--------+ |``DeSalle, R., M. G. Egan y M. Siddall.`` |``|_|``|``2005``| +-----------------------------------------------------------------------+-------+--------+ |``DeSalle, R.`` |``|_|``|``2006``| +-----------------------------------------------------------------------+-------+--------+ |``DeSalle, R.`` |``|_|``|``2007``| +-----------------------------------------------------------------------+-------+--------+ And you can represent this by regex ``(^.+?)(\s*)((?:1[0-9]|20)[0-9]{2})``, groups is: ``(^.+?)`` .- ``^`` Begin of line match ``.`` any single caracter ``+`` betwen one and ulimited times ``?``, as few times is posible. This group identifies authors. ``(\s)`` .- Match a ``\s`` sigle white space character (space, tabs, and line breaks). ``((?:1[0-9]|20)[0-9]{2})`` .- Match the character ``1`` and numbers between ``[0-9]`` or ``20``, match two characters between ``[0-9]``. This group identifies dates between 1000 and 2099. Now put the regex ``(^.+?)(\s*)((?:1[0-9]|20)[0-9]{2})`` in RegexBuddy and see the automatic highlight maresults. .. image:: ./images/create_pattern_3.jpg The next pattern is, dot and space, title, dot and space, publisher name, space, pages, and end with dot. And can represented by next regex ``(\.\s)(.+?)(\.\s)(.+?)(\s)([0-9:,-\s]+?)(\.$)`` with groups: ``(\.\s)`` .- Match ``\.`` dot and ``\s`` space. ``(.+?)`` .- Match ``.`` any single caracter ``+`` betwen one and ulimited times ``?``, as few times is posible. ``(\.\s)`` .- Match ``\.`` dot and ``\s`` space. ``(.+?)`` .- Match ``.`` any single caracter ``+`` betwen one and ulimited times ``?``, as few times is posible. ``(\s)`` .- Match ``\s`` space. ``(\s[0-9:,-\s]+?)`` .- Match any character in class ``[0-9:,-\s]``, ``+`` betwen one and ulimited times ``?``, as few times is posible. ``(\.$)`` .- Match ``\.`` and ``$`` end of line. And the complete pattern is ``(^.+?)(\s*)((?:1[0-9]|20)[0-9]{2})(\.\s)(.+?)(\.\s)(.+?)(\s)([0-9:,-\s]+?)(\.$)`` when put the complete pattern in RegexBuddy show matches in references. Whit this pattern matche with 900 references .. image:: ./images/create_pattern_4.jpg In conclusion we have the following expression ``(^.+?)(\s*)((?:1[0-9]|20)[0-9]{2})(\.\s)(.+?)(\.\s)(.+?)(\s)([0-9:,-\s]+?)(\.$)`` and correspond to a contribution in serial with groups: 1. ``(^.+?)`` .- Authors 2. ``(\s*)`` .- Space 3. ``((?:1[0-9]|20)[0-9]{2})`` .- Date 4. ``(\.\s)`` .- Dot and space 5. ``(.+?)`` .- Title 6. ``(\.\s)`` .- Dot and space 7. ``(.+?)`` .- Publisher name 8. ``(\s)`` .- Space 9. ``([0-9:,-\s]+?)`` .- Pages 10. ``(\.$)`` .- Dot and end line Add rule to xml file -------------------- * **NOTE**: ``value``, ``prevalue`` and ``postvalue`` elements contains backreference(s) group(s) in this form ``${backreference}`` by example to backreference group 1 is ``${1}`` and complete sintaxis is ``${1}`` * First need add if not exist, main node of journal. .. code-block:: xml :linenos: ... Revista mexicana de biodiversidad other (....+) ${1} ... ... In all cases recommend use the main regex is ``(....+)`` and main tag of reference rule in this case **ocitat**, to use **multiple** and **option**'s in each pattern to identify references. * Add your **regex** pattern to a **option**, in **multiple** element node. .. code-block:: xml :linenos: ... ${1} * Create the **struct** of regex pattern and add **tagname** elements .. code-block:: xml :linenos: ... ... The first **tagname** element is **ocontrib** that contain **authors** and this **tagname** is only container of each **author** so attributte **tag="true"** in tag name not used. We will see later how to parse this element. .. code-block:: xml :linenos: ... ... Second **tagname** on **ocontrib** is **date** with default attribute **dateiso** (```` without value parse date from **value** if format is YYYY) .. code-block:: xml :linenos: ... ... Third **tagname** element is **title** with default attribute is **language** and default attribute value **en** .. code-block:: xml :linenos: ... ... Next **tagname** element is **oiserial** and inner **tagname** is **sertitle** with default attribute is **language** and default attribute value **en** .. code-block:: xml :linenos: ... ... And last **tagname** element is **pages** Create pattern to parse inner values ------------------------------------ The advantage of **RegexMarkup** is can processing inner elements regardless level of these. In this example can parse each **author** in **authors**. All authors appear in previous group 1: +-----------------------------------------------------------------------+ | authors | +=======================================================================+ |``Avise, J. C.`` | +-----------------------------------------------------------------------+ |``Balduzzi, A. P., P. De Luca y S. Sabato.`` | +-----------------------------------------------------------------------+ |``Bogler, D. J. y J. Francisco-Ortega.`` | +-----------------------------------------------------------------------+ |``Caputo, P., S. Cozzolino, P. De Luca, A. Moretti y D. W. Stevenson.``| +-----------------------------------------------------------------------+ |``Chaw, S. M., T. W. Walters, C. C. Chang, S. H. Hu y S. H. Chen.`` | +-----------------------------------------------------------------------+ |``DeSalle, R., M. G. Egan y M. Siddall.`` | +-----------------------------------------------------------------------+ |``DeSalle, R.`` | +-----------------------------------------------------------------------+ |``DeSalle, R.`` | +-----------------------------------------------------------------------+ .. code-block:: xml :linenos: ${1} ${2} Now need create a new regex pattern to parse each author in group 1, the easy way is: * Choose ``List All`` → ``List all matches of group 1`` on ``Test`` tab. .. image :: ./images/create_pattern_5.jpg * Copy and paste from text area 3, to text area 2 and delete the regex pattern from text area 1 .. image :: ./images/create_pattern_6.jpg * Now all is ready to construct pattern to parse each author, in this case only have two groups, group 1 is author and group 2 is the field separator and regex pattern is: ``(.+?,\s[A-Z.\s]+?|.+?)(\sand\s|\sy\s|,\s|\s\(.+?\)\.?|\.\s$|$)`` ``(.+?,\s[A-Z.\s]+?|.+?)`` .- Match ``.`` any single caracter ``+`` betwen one and ulimited times ``?`` as few times is posible, ``,`` coma ``\s``, any character in class ``[A-Z.\s]``, ``+`` betwen one and ulimited times ``?``, as few times is posible **or** match ``.`` any single caracter ``+`` betwen one and ulimited times ``?`` as few times is posible. The first option is to match with first author, that always appear like this: ``Balduzzi`` first name ``,|_|`` coma and space ``A. P.`` and surname composed by capital letters, dots and space. Second option is to match with remaining authors. ``(\sand\s|\sy\s|,\s|\s\(.+?\)\.?|\.\s$|$)`` .- Match ``\s`` space ``and`` the word "and" ``\s`` space, **or** ``\s`` space ``y`` the character "y" ``\s`` space, **or** ``,`` coma ``\s`` space , **or** ``\s`` space ``\(`` open parenthesis ``.`` any single caracter ``+`` betwen one and ulimited times ``?`` as few times is posible ``\(`` close parenthesis ``\.`` dot can be appear ``?`` zero or one time, **or** ``\.`` dot ``\s`` can be appear ``?`` zero or one time, **or** ``\.`` dot ``\s`` space ``$`` end of line, **or** ``$`` end of line. In this case each option is one posible field separathor. Test pattern in **RegexBuddy** .. image:: ./images/create_pattern_7.jpg * When regex pattern is ready, continue completing the rule in XML file. .. code-block:: xml :linenos: ${1} (.+?,\s[A-Z.\s]+?|.+?)(\sand\s|\sy\s|,\s|\s\(.+?\)\.?|\.\s$|$) nd ${1} ${2} ${2} * Add new **regex** into **authors** **tagname** * Added **struct** to **regex** * And the content of **struct** is **oauthor** with default attribute **role** and default value **nd** Second example is parse **surname** and **fname** in **oauthor** * Create new new regex pattern to parse **surname** and **fname** in **oauthor** * Again choose ``List All`` → ``List all matches of group 1`` on ``Test`` tab. .. image :: ./images/create_pattern_8.jpg * Copy and paste from text area 3, to text area 2 and delete the regex pattern from text area 1 .. image :: ./images/create_pattern_9.jpg * In this can we have two options. * The first option is when autor appear like this: ``Caputo`` **surname** ``,|_|`` coma and space ``P.`` and **fname** and can parsed with regex pattern ``(.+?)(,\s|,?\s|,\s?)([A-Zh.\s]+$)`` ``(.+?)`` .- Match ``.`` any single caracter ``+`` betwen one and ulimited times ``?``, as few times is posible. ``(,\s|,?\s|,\s?)`` .- Match ``,\s`` coma and space, **or** ``,`` coma ``?`` apear zero or one time ``\s`` space **or** ``,`` coma ``\s`` space ``?`` apear zero or one time. ``([A-Zh.\s]+$)`` Match any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` at ``$`` end of line. Test pattern in **RegexBuddy** .. image:: ./images/create_pattern_10.jpg * Second option is when autor appear like this: ``D. L.`` **fname** ``|_|`` space ``Erickson`` and **surname** and can parsed with regex pattern ``([A-Zh.]+?\s[A-Zh.]+?\s[A-Zh.]+?|[A-Zh.]+?\s[A-Zh.]+?|[A-Zh.]+?|.+?)(\s)(.+$)`` ``([A-Zh.]+?\s[A-Zh.]+?\s[A-Zh.]+?|[A-Zh.]+?\s[A-Zh.]+?|[A-Zh.]+?|.+?)`` .- Match any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` as few times is posible ``\s`` space, any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` as few times is posible ``\s`` space, any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` as few times is posible, **or** any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` as few times is posible ``\s`` space, any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` as few times is posible, **or** any character in class ``[A-Zh.\s]``, ``+`` betwen one and ulimited times ``?`` as few times is posible, **or** ``.`` any single caracter ``+`` betwen one and ulimited times ``?``, as few times is posible. ``(\s)`` .- Match ``\s`` space. ``(.+$)`` .- Match ``.`` any single caracter ``+`` betwen one and ulimited times ``?``, as few times is posible. Test pattern in **RegexBuddy** .. image:: ./images/create_pattern_11.jpg * When regex pattern is ready, continue completing the rule in XML file. .. code-block:: xml :linenos: ... nd ${1} ${2} ... Added **multiple** with two options * First **option** contains **regex** pattern ``(.+?)(,\s|,?\s|,\s?)([A-Zh.\s]+$)`` with **struct** **fname** and **surname** * Second **option** contains **regex** pattern ``([A-Zh.]+?\s[A-Zh.]+?\s[A-Zh.]+?|[A-Zh.]+?\s[A-Zh.]+?|[A-Zh.]+?|.+?)(\s)(.+$)`` with **struct** **surname** and **fname**