SFM Lexicons in Toolbox

This page is about The Linguist's Toolbox: http://lingtransoft.info/apps/toolbox .

Some Toolbox background

Toolbox is a non-relational database program that stores each database as a text file (a flatfile database; unlike XML, the data itself is stored non-hierarchically, although clean data can be interpreted according to a separately-defined hierarchy). Users are expected to fully understand and manually edit the syntax of this file format and the hierarchical structure of whatever standard they are trying to follow. This transparency gives users the flexibility to arrange things in custom ways and to directly rearrange structures as needed. Its extreme customizability means that Toolbox can be used to manage your collection of book reviews, or your small library catalog, etc. But it also creates a high learning curve and gives users the flexibility to shoot themselves in the foot by entering their data in inconsistent or non-standard ways. And it means that leveraging the data requires manual configuration: each link field must be configured with a jump path in order to work, and the interlinear process must be manually configured too.

The general text file format that is used is SFM (standard format markers). An SFM file consists of a many fields, one after another, where each field consists of a backslash code at the beginning of a line, followed by a space, followed by any amount of data up until the next line-initial backslash code is found. One field marker, known as the "record marker", is considered special and Toolbox will always understand it to begin a new record. For example, most dictionaries use \lx as their record marker.

That's pretty much it for the specification of SFM itself, so 'standard format' is something of a misnomer. Actually, almost every SFM dictionary file is unique and non-standard, differing from others in terms of which markers are used and in what hierarchical structure. Note: although SFM data is interpreted according to some external hierarchy, there are no closing tags or indentation indicating that hierarchy in the data itself (unlike XML). Keeping the right .TYP file in the same location as the SFM file can help, assuming that the user has followed its structure, but even this is just an approximation. (E.g. MDF allows nt to occur under lx, se, sn, or rf. The mdf.typ file approximates this by saying nt occurs under sn.)

We will be mostly concerned with one specific type of SFM file, the MDF (standard hierarchy) dictionary file, since this is the best SFM format to choose for publishing a dictionary or importing it into FLEx or WeSay. It is also the best SFM format for archiving a dictionary, although using a newer format such as LIFT XML would be more future-friendly than SFM. (Archiving both a final version of the SFM file and a corresponding new LIFT file is a good idea.)

Note: The steps in this document are fairly high-level. For specific details about doing SFM cleanup, please follow up by reading the main page about tutorials:preparing_legacy_data_for_flex|Preparing legacy SFM data. Also, as it mentions there, it's a good idea to keep of log of what you've done (and plan to do), in order.

Editing more safely in Toolbox

Toolbox does not provide tools for enforcing good, consistent structure on a file, but if you are unable to switch over to a safer editing environment yet (FLEx or WeSay), there are some things you can do to enforce good habits on yourself.

Show hierarchy (after verifying that it's defined correctly)

Probably the most important thing to do is to turn on View, Marker Hierarchy. This provides you with a visual clue as to where you are in the hierarchy. What it displays is based on the structure defined in your .TYP file; to see and edit these settings for a given field, right-click the field marker and look at the “Under what in the Hierarchy”. (If these settings are incorrect, then showing the visual hierarchy will be misleading.)


Don't omit structural markers (e.g. sn)

:!: undefined EXPAND:!:

Omitting non-structural markers is fine, and even recommended for all but the most common fields. But it's helpful to always explicitly mark the beginning of every sense with an sn field, even though MDF allows sn to be inferred from the existence of one of its children (such as ge). Also, although one ps is allowed to contain multiple sn, the data will be easier to migrate if ps and sn are always one-to-one (and always right next to each other). So, this explicit record…

\lx bank
\ps n
\ge riverbank
\ps n
\ge financial_institution
\ps v
\ge tilt 

is much better than this equivalent record:

\lx bank
\ps n
\ge riverbank
\ge financial_institution
\ps v
\ge tilt 

Note that you can make this transition without having to manually add the sn fields into each old entry. Solid is able to help automate that part of the process, although you need to first make sure that everything is otherwise very clean.

The following screenshot shows how viewing hierarchy can visually clue you in to the need to rearrange specific fields so that they no longer violate the hierarchy.


Show marker names

:!: undefined EXPAND:!:

De-wrap all hard-wrapped lines

It's fine to have text of a field (e.g. a very long definition) autowrap on-screen, but once the file is saved, it's not a good idea to have actual hard newlines saved into that field. Most text editors can wrap text on-screen without saving extra newlines, and as of v1.5.5, Toolbox can too. Everyone should probably be turning on this feature; here's how. Close Toolbox, open your .prj file (not the actual dictionary file), and add this line, \SaveWithoutNewlines, anywhere between the first and last lines.

From http://www.sil.org/computing/toolbox/versions.htm

(In previous versions of Toolbox, one alternative was to turn off autowrap (Database menu), make the window really wide, set the wrap margin, and then Reshape Entire File. But this still wouldn't get the really long lines.)

Alternatively, you can use an external utility such as joinlines, when Toolbox is not running. Or, here's a regular expression that handles Windows' newlines. (For me, this works a little better than Toolbox, giving just one space even when both trailing and leading spaces had existed.)

{CODE} \\r\\n {CODE}


With this (note the leading space):

Warning: If you had been deliberating adding newlines into fields and wanted those to be preserved, de-wrapping will blow those away too. But inserting those is generally a bad practice. (Still, if you really need newlines within a field, you can use the unicode character U+2028, Line Separator, in just those cases. That character is far more likely to be preserved by software.)

Use unicode, and avoid mixed encodings

:!: undefined EXPAND:!:

Have Toolbox enforce all range sets (after defining clean lists for these)

:!: undefined EXPAND:!:

Avoid using redundant fields (e.g. pn in addition to ps)

Maintaining both an English part of speech (ps) and a national part of speech (pn) is generally not a good practice, since they always are (or at least ought to be) completely redundant. Toolbox doesn't have a central list-managing feature like FLEx or WeSay does, but you can simulate this using only the English part of speech field. You either need the dictionary editor(s) to use only English in ps (“adj”), or only some other meaningful label (“kata sifat”), or a combination (“adj - ks”). You can then use a global replace later on if you want to publish those differently. (You can even split those back into a ps and a pn easily at publication time–in a copy of the database.)

One safe way to get rid of pn is to go through each of the most frequent parts of speech first and blow away pn where it matches ps. For example:

Replacing this:
^\\ps adj\r\n\\pn ks$
With this:
\\ps adj

You can then go through the remaining pn fields one by one and deal with them. Wherever pn exists and ps does not, you can manually convert the pn into a ps with the corresponding English value.

Handle variants properly, and also distinguish them from subentries

This is tedious to fix or rearrange later because the information is scattered around the lexicon. So, this section includes some heads-ups about future data migration issues (FLEx import issues) you may want to be aware of. For more information, search for the terms “variants” and “minor entries” on the Preparing legacy SFM data. page.

In a nutshell…

  •     It's bad to have peer variants (two lx's, each with a va pointing to the other lx).
  •     If you create minor entries for variants, always include an mn field. Don't put a va in the minor entry. Including sense data is fine, but maybe not subentries.
  •     For subentries, consider not creating minor entries, or else using a distinct field (such as mnse). Avoid including sense data.

In a little more detail…

Variants: In MDF, the va field is reserved for use in the main entry, and the mn field is only for use if you also create a corresponding minor entry. When two words are variants of each other, you're supposed to choose one as main and optionally create a minor entry for the other. Creating peer variants (va's under lx pointing to each other) is technically not valid MDF. (FLEx will import them but will create a blank minor entry stub for each va, rather than importing the two full entries and arbitrarily picking one to be main and the other minor.)

If you have a pair of main entries set up as “peer variants”, but one in the pair has little or no information, consider changing its va field to mn.

Subentries: Minor entries for variants are common. It's less common to create a minor entry whose mn field points to an se field in the main entry. But if you really want to do this, consider using a distinct field such as mnse, which you can later search for to verify that you don't have sense data or subentries in this kind of minor entry. (The sense data would get imported incorrectly, as a new sense on the main entry. The subentry would import as a subentry of a subentry, which a few people need but most should avoid.)

Importing variants of main entries (lx) or senses (sn) into FLEx: For best results, it helps if you've already done one of the following for each variant, or both.

(a) Don't create a minor entry. Putting a va field under the lx, se, or sn field is sufficient.

(b) Create a minor entry and be sure to include an mn field. (Putting a va field in the main entry is usually optional. It's only required if it's a variant of a specific subentry or sense.)

Use N.N for subsenses

If your lexicon has subsenses (senses nested inside other senses), it's best to number these as 1.1, 1.2, etc. rather than as 1a, 1b, etc. That's the format FLEx will expect. (MDF doesn't care because the senses are stored as flat; any hierarchy is in the mind of the reader, based on the manually added sense numbers.)

Run Solid periodically

Once Solid has been set up and is able to validate an SFM file as having “no errors”, it's easy for the end user to periodically launch Solid (after having closed Toolbox) and have it identify any new structural errors that have cropped up since the last check. This can help the user to stop bad habits earlier rather than later.

Caveat: There are currently some significant bugs and quirks that make it difficult to safely edit in Solid. To learn how to work around those, see the section about Solid quirks under Preparing legacy SFM data .

Sharing or backing up a squeaky clean version of the project

Check it with Solid

:!: undefined EXPAND :!:

Consider using standard English abbreviations in the part of speech field (see FLEx's abbreviations)

:!: undefined EXPAND :!:

Using the same abbreviations that FLEx provides by default (when you add a standard category to the Categories list) makes the data more likely to be properly understood in the future, whether by FLEx or by humans.

Include all the project files and zip them up for safety

:!: undefined EXPAND :!:

Also include a document explaining how you've used and structured each marker

:!: undefined EXPAND :!:

Specific techniques for specific needs

  •     Working with Philippine dictionaries that use PLB SFM. There is a tentative Solid template (with draft documentation here) that can help you to validate a lexicon that is in this format, which is very unusual one, and difficult to map to FLEx (or to MDF for that matter). The Migrating PLB SFM to FieldWorks FLEx document (attached) describes how to go about doing that.
  •     Inserting DDP domains into SFM based on Louw and Nida word codes


Contributors to this page: admin and languist .
Page last modified on Monday December 16, 2013 11:21:16 GMT-0000 by admin.


Creative Commons License
All content on this LingTranSoft wiki are by SIL International are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.