Loading...
 

Preparing Legacy Data

You are welcome to participate in the discussion area dedicated to the topic of data conversion.

Cleaning up (Toolbox) SFM dictionaries for publication, for FLEx, or for WeSay

Usually, the bulk of the work (by far) when doing any of these things is to clean up the SFM so that it is consistent and complies with a standard (MDF, ideally). You can learn a bit more about the MDF standard as used in Toolbox on this page: SFM Lexicons in Toolbox

For a more general intro to the clean-up tool we will be using, Solid, see the Solid page. For other tools that can help with SFM cleanup, see SFM scripts and CC tables. For details on importing fields that have more than one possible parent field they can occur under, see Importing note fields and subentry fields into FLEx.
Warning: Migrating SFM dictionaries is usually a very complicated task full of quirks and pitfalls. It's much better to let an SFM migration specialist do it (or supervise it) than to tackle it alone as an 'amateur' with real data. Once the data has been converted incorrectly, it is usually very difficult to undo the mistakes that were made during the migration. It is usually necessary to redo the migration based on the original project, provided the converted project has not been edited much yet.

A sample email

Dear CUSTOMER

Yes, I would be happy to help you with your lexicon. Here's the process I typically follow:
- You right-click your Toolbox project folder and Send To a zip file, then email it to me. (Another option is something like megashares.com) The main thing it needs to include is the lexicon, but these kinds of files are useful too, so that I can upgrade your whole project: .PRJ .TYP, .LNG. Question: is all of your data in one lexicon file? Do you have interlinear texts as well? (Only the baseline and free translations will be preserved through the import.)
- You stop editing until I send it back.
- I do the vast majority of my work on the SFM project itself, cleaning it up and standardizing it. (Toolbox itself can't do this, so I'll use a tool called Solid for this.) A consistent lexicon file is easier to publish as is, or to import into FLEx and publish from there. I send you an update copy of the zipped project, and you replace your copy with it. I also send you a log file explaining what I've done, and probably asking questions.
- You answer questions I'll likely have along the way. We might need to phone or Skype if any is tricky.
- I do an initial import into FLEx and send you that project backup too.
- You download the latest version of FieldWorks (8.0.5 beta  or newer) and restore that project into it. (Ideally, you would also watch some of the tutorial videos.) You review how the data looks in FLEx and verify whether this is ok as the new "master copy". If not, we do another round of edits and another import. Worst case, we fall back to the SFM project and Toolbox (at which point I'd probably introduce you to Solid), but to justify these efforts, I would ask that you agree up front to give FLEx a fair trial.
- Someone else takes over in helping you with the publication process (using Webonary with FLEx, or perhaps Lexique Pro with Toolbox).

Anyway, that's the approach I generally use--would that work for you? Please let me know what you think, and what your timeline looks like.

Question: do you have any special characters? If so, we'll almost definitely want to make sure they are all stored in unicode, and that you end up with a keyboarding solution that uses unicode.

Serving together,

ME

Quick Reference

For those already familiar with the details, here a cheat sheets of steps in typical order. Cleanup comes first:

  • Get info from customer: full SFM project, agreement to not edit, ISO language code, list of special characters and custom markers, preferred output (root- or stem-based) and any existing sample "printout".
  • Auto-analyze the file (SFMTools.py). Share relevant info with customer and gauge their interest in gory details (and even Skype screen sharing).
  • Use Solid to get to know the data file and look for problems. Consider tweaking the .solid settings various ways just to learn, then deleting the .solid file and starting fresh.
  • Start keeping backups and a log (your actions separated by backup numbers; questions for customers), and cleaning up obvious data errors.
  • Convert to unicode (now or later). If the user will likely use Toolbox for any further editing, updating their project's language (WS) settings to reflect this. Avoid mixed encodings!
  • In a bogus "aaa" record at the top, document anything unusual, and all custom fields.
  • Remove trailing spaces.
  • Unwrap hard-wrapped lines.
  • Create an empty FLEx project with WSes and custom fields set up. Also create a plain "variant" type to map \va to.
  • Using the auto-report, start making \ps field values consistent, typically with the customer's help. Either make those values match the FLEx category abbr. or vice versa.
  • If acceptable, eliminate \pn from the file (or at least ignore it). It's usually redundant. Safest option: replace "\ps verb\n\pn verbo" with "\ps verb", etc., then manually review any that potentially disagree.
  • Back up the blank project, do a quick and dirty test import, and check for problems. (E.g. FLEx can't import \ln fields properly.) Do NOT invest time in the .map file (you should even avoid saving it).
  • Make Solid's settings very tight, then loosen them where the data needs it and you're sure it won't hinder publication/migration. Scrutinize any marker that you've told Solid to allow under multiple parents.
  • LOUDLY log any temporary loosening of settings, when hiding minor problems to deal with major one first.
  • Start replacing non-standard markers with MDF markers, if acceptable to user.
  • Start checking links. (Run/review the auto-analysis script, or else dry-run-import into FLEx and check residue.) Ideally, fix these in the SFM now. Otherwise, put it off untill totally committed to a *final* FLEx import.
  • Go back through the log, fixing any unfixed data or unreverted Solid settings.
  • Once there are "no errors" found in Solid, make all inferred markers real.

If migrating the data to FLEx (i.e. not just cleaning up for publication), do these things too:

  • Optional: get user to commit to try out switching over to FLEx.
  • Make FLEx-specific tweaks that may not be beautiful MDF. (Split semicolon-separated \re fields into multiple fields, etc.)
  • Finalize the complete set of markers, then start seriously setting up the mappings for FLEx import (.map file).

Getting Started

It is very important to get as much information as possible from the original lexicographer explaining what s/he meant by each of the fields and what its intended position(s) in the hierarchy should be. If you can't get a nice chart from that person, you can try creating one yourself. (Indicate each field's full name, and the target field and writing system to migrate it to. Ideally, you could enter this info directly into Solid; feature request: 1085. ) There may be clues from the database's type (.typ) file, which Toolbox can display, although there's no guarantee that the lexicographer was actually following that system (since Toolbox doesn't enforce or even check this). 

Looking through a lot of records and absorbing their patterns can go a long way–highly recommended! It's particularly important to note how variants, and perhaps complex forms (subentries), are being handled. Toolbox itself can be helpful at this stage, since it can display the .typ file's info and can search all fields, all fields of a given writing system, or just a specific field. (There is also a trick for removing hard-wrapping of lines when saving the file to disk--see the Toolbox page above.) And its Browse view lets you zero in on a particular field and see how it's used in various contexts. Solid can do this too. Without Toolbox or Solid, you can do some of this in Linux using a command such as:

egrep "^.lf " datafile.sfm | sort | uniq -c> lf.lst

Things to check (e.g. using Solid or regular expressions) and ask about (focusing here on standard MDF):

  • What standard was the lexicographer trying to follow? (See the .typ file.) What did he or she understand the hierarchy of that standard to be? It can be important to understand this before going about “fixing” problems.
  • Are any structural markers ever omitted? (People tend to omit sn and rf, and sometimes ps.) Note that pn is not officially a valid stand-in for ps; it is intended to be a child of ps and not a parent of anything else.
  • Does sn consistently appear below ps, or always above it, or is it inconsistent?
  • Are multiple values ever included in a single field (typically separated using a semicolon)? These usually need to be split into multiple fields (e.g. multiple re or sy fields, etc.; multiple ge or de are more ambiguous.
  • Are any fields ambiguous due to omissions? For example, if the file ever includes multiple glosses (ge) for a single sense, and that same file ever omits the sn marker, then two ge fields in a row might represent either one sense or two senses.
  • Are any fields ambiguous as to which field is their parent? For example, fields that apply to the whole entry (“entry-level” fields) may occur toward the end of the record. This is fine with field like dt, assuming it never applies to a specific se (“subentry-level”) or sn ("sense-level"), but a field like va could be ambiguous. Does it describe a variant of the root lx or of the last sense of the last subentry? Be cautious about using Solid's “move up” quick fix until you know the answer to this question. Note: the field with the most levels in MDF is nt: it can occur under any of these: lx se sn rf.
  • Are link fields used very extensively? Note that links to specific homographs or specific senses need to include one or two numbers. E.g. "\sy blah3 2" means "sense two of homograph 3 of blah is my synonym". (Omitting the numbers from sy will probably result in an extra new homograph, since the importer can't know which target was intended.)
  • Does hm ever occur more than once per lx? It shouldn't, but some people do this, in which case you'll need to copy lx down above each hm, resulting in additional new entries.
  • Does hm ever occur right after se (subentry)? It shouldn't; instead of an hm field, the format in that case is to directly number the se field: "\se blahing2".

Note that checking for problems and fixing those problems sometimes require very different tools. For example, Solid is able to detect many structural errors, but only the more common ones can be fixed using the built-in Quick Fix tools. Also, some errors don't actually have to be fixed. For errors that do need to be fixed and are too numerous to fix manually from within Solid, one typical approach is to identify the pattern using Solid, save, use a script or a regular expression to do the actual fixing, then reload in Solid. Notepad++ is a light and free editor that supports regular expressions pretty well. For full regex support (e.g. DOTALL) something like Eclipse is even better. Also, here's a Python regex tool (runs from the command line or a batch file): Running a saved set of find/replace operations (CC or regex) Once an SFM file is consistent and compliant, tasks such as importing into FLEx become much, much simpler.

Keeping a log

This may sound like overkill, but with such a complex and potentially drawn-out process, it can be very helpful to keep track of what you did (and are planning to do), in order. Include lines indicating when a particular backup was made. This way, if you need to revert to a backup, you'll know exactly what needs to be redone afterwards.

The log is also a great place to put questions to ask the lexicographer, and as optional reading in case s/he wants to know exactly what was done along the way. (Since much of process involves cleaning up the original SFM file, it could be important for the lexicographer to understand those steps and value the cleaned up SFM file, especially if s/he ultimately decides to revert to the SFM project rather than using the migrated project.)

Decision: Migrate the SFM using Solid or FLEx?

If you are using Solid to check the SFM file anyway (recommended), then it may theoretically be less effort to export to LIFT XML directly from Solid, especially if you've needed to deal with various fields that aren't standard MDF. If you've gotten it down to “no errors” in Solid and have mapped all the fields to LIFT fields, then Solid understands the whole structure better than FLEx ever will and can, in theory, precisely generate an unambiguous LIFT XML file. This file could be directly used by WeSay, or imported into FLEx with almost no configuration effort (whereas the SFM importer requires lots of configuration).

Another reason to prefer this route is that the SFM importer for FLEx currently may incorrectly copy or move ps fields values into an adjacent sense unless you provide a non-empty ps for every sense. (This problem does not occur for entries containing only one sense and no subentries.) Likewise, for some fields it can only import into a single level in the hierarchy (LT-10905), so a field like nt must apply only to senses or only to entries or only to subentries. (Typical workaround: manually, or with a script, split out nt_lx and nt_se, and maybe nt_rf . See: Importing notes and subentry fields into FLEx.) Also, if you have added GUIDs to your SFM file in hopes of being able to merge SFM data with the newer formats, your only hope (a slim one) is to use Solid to export it to LIFT; FLEx cannot import GUIDs from SFM.

In the current reality, however (Dec 2013), it is almost always better to just work around the above limitations and use the more-tested FLEx SFM importer, even if your ultimate goal is a LIFT file for WeSay. (It's trivially easy to export to LIFT from FLEx.) Here are some reasons:

  • Inline formatting. If you've used inline formatting to bold or italicize portions of your SFM fields, you can tell the FLEx importer how to interpret these.
  • Links to specific homographs and senses. If your link fields (e.g sy) ever refer to something more specific than a non-homograph lx field, the FLEx importer will attempt to handle those. (But see LT-10733 for serious gotchas; basically, you need the lx and se (esp. se) homographs numbered in ascending order in exactly the order they appear in the file.) The format for referring to the second sense of the third homograph of “bank” is as follows: \sy bank3 2 (See LT-12151for more info on link syntax.)
  • Subentries (bug 1083). Solid's LIFT export currently underspecifies complex forms when it creates them based on SFM subentries (se), which basically means that they export as implicit variants.
  • MDF/FLEx fields that Solid doesn't know about yet. Some fields such as lt and mn cannot yet be mapped to a standard field using Solid.
  • Other fields such as rf and hm. Certain fields that Solid does know about may still fail to export properly from Solid's LIFT export.
  • LIFT has mostly focused on the more basic fields thus far, and many other important fields aren't handled compatibly between Solid and FLEx. LT-13767

If none of those issues apply to you, you may be able to just use Solid's LIFT export feature. If WeSay happily displays the fields it understands, that may be enough. But if you don't first import it into FLEx to check all of the fields, you may be in for a bad surprise in the future (e.g. the subentry/variant problem above).

Update: It sounds like Lexique Pro may also be able to export SFM dictionaries to LIFT. If you have tried doing this with real data (especially data that includes subentries and variants), please share your results. -Jon Coombs

Cleaning up an SFM dictionary using Solid

Whichever option you choose for the actual migration, it's a very good idea to use Solid to do most of the checking work as you clean up the file.

In order to just do checking in Solid, this is really the only goal:

  • You have defined a realistic hierarchy and Solid has announced that the dictionary file matches that hierarchy with no structural errors. (It's also nice, but optional, to set all fields to “unicode” and get no “upper ASCII” errors.)

To use Solid to actually export to LIFT, you'll also need to make sure that:

  • Every field is assigned to the correct writing system and set to “unicode”.
  • There are no “upper ASCII” errors.
  • Every field is assigned to a corresponding LIFT field at a corresponding level in the hierarchies.

Again, you don't need to worry about these if you'll be using the FLEx SFM importer. However, you do still need to be sure that your structure isn't too loose and hasn't deviated too far from MDF and FLEx's own structure.

If using the FLEx importer, prepare the ps field carefully to occur once per sense

If you will be importing using FLEx's SFM importer, be aware that it expects the standard MDF relationship between ps and sn but cannot accurately handle multiple sn below ps. Below, the “num” part of speech will be interpreted as part of sense “two” (although it will then be "copied down" into sense "three" as well):

Standard MDF (but showing hierarchy as indentation)
\lx ababa
  \ps
    \sn
      \de one
    \sn
      \de two
  \ps num
    \sn
      \de three

Since FLEx thinks in terms of field bundles rather than field hierarchy, ps and sn are simply siblings, and then there's some hard-coding that copies ps down to senses that are have no ps field (or an empty ps field):

FLEx's misinterpretation:
\lx ababa
  sense 1:
    \ps
    \sn
    \de one
  sense 2:
    \sn
    \de two
    \ps num
  sense 3:
    \ps num  <-- copied down
    \sn
    \de three

 

To reliably import ps, there must instead be exactly one non-empty ps for every sense. (LT-9353 LT-10739) Note: the exceptions are: an entry with only one sense; an entry with multiple senses that all have no ps (e.g. affixes aren't words and thus might not have a ps). To work around this limitation you will need to choose one of these approaches:

1a. Target standard MDF (ps above sn, recommended). Use the standard structure but use a script to redundantly copy ps so that each and every sn has a ps above it (leaving no sibling sn fields). Tell Solid that sn can only occur once per ps. (Tell the FLEx importer that only ps can begin a sense.)

Standard non-concise MDF (showing hierarchy as indentation)
\lx ababa
  \ps unknown
    \sn
      \de one
  \ps unknown
    \sn
      \de two
  \ps num
    \sn
      \de three

If Solid finds very many cases of multiple sn per ps, you'll probably need a script or a cc table to copy ps down above its additional children. (Simply allowing sn to infer its parent ps would work in the example above to produce empty ps fields, but it can't supply a ps with a value.)

1b. Target “FLEx-structured MDF” (sn above ps, an inverted structure which is not recommended unless the data is already this way). Replace all empty ps with some bogus value (e.g. below, this prevents “num” from leaking into sense “two”). Tell the FLEx importer that only sn can begin a sense. (Be sure to back up the dictionary before inverting the order of all of its ps and sn fields, since this is a difficult process to reverse and produces non-standard MDF, which the MDFormatter might not understand.)

Modified MDF (showing hierarchy as indentation)
\lx ababa
  \sn
    \ps zz
    \de one
  \sn
    \ps zz
    \de two
  \sn
    \ps num
    \de three

Note that the relative order of ps and sn is the main difference between the “MDF Unicode” template in Solid and the “FLEx-Friendly MDF Unicode” template (which ended up being a misnomer with regard to ps, despite its mimickry of FLEx's structure). But the latter has also been tightened up considerably in other ways, to block things that are allowed in standard MDF but cannot be easily imported into FLEx.

2. Once you have one ps per sense, make an inventory of all the values that occur in ps, and standardize them. Ideally, make each one match the abbreviation of a corresponding category in FLEx (in the Grammar area). You'll probably need to add the categories in FLEx before running the import. Remember to back up the empty container before you do the "dry run" import!

Note: alternately, you could import ps into a custom field and use filtering plus bulk edit to fill in Category based on that. But that's more work, and be aware that if any occurrences of ps are empty, you should probably rename ps to something the importer can't possibly recognize, such as xpxs, to avoid the hard-coded "copy down" feature.

Check the structure loosely, then tightly

The initial focus when using Solid to check a file should be purely structural: make sure the data's hierarchy matches the standard MDF hierarchy (or perhaps the FLEx-Friendly one) as closely as possible.

Warning: Targeting the “MDF alternate hierarchy” is usually a bad idea. Very few MDF dictionaries actually use this structure, and it won't import properly into FLEx because FLEx's SFM importer currently cannot import subentries of senses.

It's nearly always best to target “MDF Unicode” initially. Once you have achieved “no errors”, you can try switching to the stricter “FLEx-friendly MDF Unicode”, especially if choosing structure #1b above.

Reported “errors” can be dealt with either by changing the data (recommended), or by relaxing the rules (often okay). To know whether it's “okay”, you need to know what will impact migration and what won't. For example, \bw will import at the Entry level even if it's been incorrectly interspersed among sense-level fields. However, even though in standard MDF the nt field can correctly be placed at various levels, FLEx can only import a particular note field at one particular level (LT-10905). See Importing notes and subentry fields into FLEx.

Make inferred markers real

Run the “make inferred markers real” Quick Fix to insert missing fields like sn, rf, etc. But first make sure no errors are showing. (Otherwise, Solid may infer things in the wrong places.) This is a wonderful feature in Solid which can make FLEx much less likely to misinterpret the data (more on this below).

More tips
  • Be strict! For each marker, allow as few “parent” markers as possible. Don't allow a field to repeat unless it needs to, and if it does, make those occur “together”. (Child fields are ignored, so two rf fields which each have an xv child are “together”. An intervening de would be strange at best.)
  • Avoid using “infer parent marker” except where absolutely necessary and safe. It can be quite useful for inferring the root of a sense (sn) or example (rf), but in most cases you will get better results if you don't infer. (Inferring one field temporarily is sometimes helpful for clearing out distractions while investigating other fields and their problems.)
  • It is almost always a bad idea to allow "multiple with intervening markers". E.g. if two reversal (re) fields are separated by something else (say, a definition or example), that suggests that a separate sense may have been intended. Solid can help you or the lexicographer wade through those and fix them manually.
  • Quick Fixes are risky! They can be very helpful for moving fields to a different position in the hierarchy, or for converting the implied markers into actual markers in the file. But be extremely careful with these, and try to only use them after everything else is validating with no structural errors. And always have a backup before you apply them.
  • If you want to review what a quick fix (or script) just did to the data file, you can use a diff tool (such as WinMerge Portable from portableapps.com) to compare it to the backup you made just before running the fix/script. Note that the very first time Solid saves your file, it may make some changes such as standardizing the newline character(s). You'll probably also want to remove trailing spaces and remove all hard-wrapped lines up front, and save this as your starting point, so that you won't have to wade through these trivial changes when reviewing other changes in the diff tool.
  • A field at the very end of a subentry is usually ambiguous, especially to FLEx's SFM importer. It might apply to the last sense of the last subentry, the whole last subentry, or like date (dt), it might apply to the whole entry. If necessary, you can use the Move Up quick fix to move a field up so that FLEx can't misinterpret it. (See comments on 454.)
  • There is no “Save As” feature in Solid yet, and there are two files involved (e.g. dict.txt and dict.solid; cf. 884) whenever you save. So, one convenient practice is to copy/paste those two files before each major edit and note the backup number in your migration log. E.g. files for “made backup 2” might look like this: “dict - Copy (2).txt”, “dict - Copy (2).solid”. With TeraCopy installed, the filenames are similar but nicer (dict_2.txt, dict_2.solid). The best solution is to use a version control tool such as Mercurial (TortoiseHg) or Git.
  • The top-left pane shows all field markers that are currently in use, and a count of how many times each is used (cf. 159 and 325). Sort by the Count column and consider eliminating the lowest-frequency fields. (It's likely that some of these are typos.) Fewer fields means fewer things to configure for publication/migration and less to check afterwards.
  • Note that FLEx's SFM importer also lists all fields and their counts. In Toolbox, you can see all defined markers for the database type by going to Database, Properties; any bolded marker is being used somewhere in the current dictionary file. In the Linux command line, you can type the following to get a list of markers as a file (sfm.lst):
cat datafile.sfm | sed -e “s/ .*$//” | sort | uniq -c> sfm.lst
Some bugs and quirks

Solid is an essential niche product for checking SFM structure during the cleanup process, but it's not highly polished since only a few techies use it much. Those techies have to know how to work around some issues:

  • 572 : Must use lx. If you're targeting a structure whose record marker is not lx you'll need to temporarily replace those all with lx, since Solid is hard-coded to that. You can change it back later.
  • 1082 : I don't recommend the “push ps down” quick fix, especially not if you have any subentries. (There might be other quirks with this quick fix too.) Consider the other options mentioned above concerning a required ps for every sn. This also means that the FLEx-Friendly template works better once you put ps above sn again.
  • The quick fixes are generally hard-coded to MDF markers, or else you have to manually tell them which fields to target. That is, if you have something else such as \ms mapped to Sense instead of \sn, some quick fixes won't work, and the ones that do will require you to specify ms.
  • 494 : Structural errors and non-unicode fields are both colored red (except for this issue: 256). This can make it hard to figure out the cause of a structural problem. TIP: if the indentation for a red field looks correct and doesn't mess up the indentation of fields below it, there's a good chance it's just marked red because of “upper ASCII” characters in it. Be aware that “contains bad unicode data” is a true encoding error, whereas “contains upper ascii” is just a warning that the data is not yet in unicode (cf 552).
  • 141 : Sharing a .solid file across computers is slightly incomplete. (The writing system labels come along, but without their settings.)

Migrating to LIFT using Solid (experimental)

  • Make sure the file does not contain any subentries. If it does, abort! Or, read this first to learn how to modify the exported file so FLEx will see these as complex forms rather than as variants (1083)
  • Make sure there is no data in \rf or \hm that you wish to keep, or else map them to custom fields. (Are any other fields silently dropped? \sn perhaps?)
  • Make sure there are no minor entries (entries containing an \mn field) in the file. If there are, I believe the solution for exporting them as variants is to remove them and instead insert a \va field into the main entry.
  • If the file contains standard fields that exist in MDF and FLEx but are not yet mappable in Solid to standard fields, consider importing using FLEx instead: \a (allomorph) \lt \st \sd. Or, map them to custom fields and use bulk edit later on in FLEx. (But note that semantic domains in FLEx cannot be bulk edited.)
  • Make sure there are no “upper ASCII” errors. (You may actually be able to export non-unicode data from Solid to LIFT, but that's probably a bad idea, since both WeSay and FLEx expect unicode.)
  • Make sure the whole lexicon validates with “no errors” and contains no non-unicode data.
  • Make sure every field is assigned to the correct writing system and set to “unicode”. TIP: If you started with one of the provided Solid templates, you'll probably need to replace the pseudo writing systems (vern, nat, reg) with actual writing systems. Use the Change Writing Systems button to do this.
  • Make sure every field is assigned to a corresponding LIFT field at a corresponding level in the hierarchies. You'll probably need to assign some fields to “Custom”–this is even true of some fields that are standard in MDF (and the FLEx importer) but which Solid doesn't really know about (lt, mn, etc)
  • Any fields you do not care to export still need to be mapped. Just map them to “Ignore”. This is a safety feature–you are deliberately choosing to lose data.
  • Press the Export button and hit Save.
  • Verify the results! It usually only really works to do this by importing that LIFT data into a blank FLEx project. (Verifying it with WeSay alone is usually insufficient, unless WeSay is able to fully display all of your fields properly. This is currently not the case with a large number of MDF markers including va, se, and most link fields.)

Migrating to FLEx

Before getting started, please read the document provided in the FLEx Help menu under Resources, “Technical Notes on SFM Database Import.doc”. This document explains a lot about the capabilities and limitations of the FLEx SFM importer.

Is it perfect MDF already?

Again, cleaning up the SFM file itself, targeting MDF, is the best starting point. That's basically step zero when importing, although it's also possible to target something other SFM structure and to then manually reconfigure the FLEx importer (or the Solid exporter) to understand that different structure.

Is the data all in Unicode already?

If not, you need to either convert it before importing, or provide the importer with the encoding conversion maps for any fields that are still in a legacy encoding. (These can be applied during import, but you need to have them on hand.)

Are there inline character formatting codes in this data?

Make sure you know what these mean (noting which fields they occur in can help) so you can tell the importer how to treat them. The normal use case for inline formatting is when embedding publishable data from a different writing system into a field. Cobuild-style definitions make heavy use of this approach, by embedding into the analysis-language definition the vernacular word/phrase that's being defined. 

Warning: when using regular expressions to work with inline formatting, be sure to do non-greedy matching. For example, if I had vernacular text and italicized text in these two formats…

I have v%a word@ and i%some thing@

…and wanted to switch to these formats…

I have %a word%* and @some thing@*

…then I would want to use the following regex to handle the v% cases. (The ? makes it non-greedy. And like many other situations, you also do *not* want to let the dot “match newlines” here. If there are hard-wrapped fields with formatting spanning line breaks, then those should first be de-wrapped.)

v%(.+?)@
%\1%\*

Without that question mark, it would greedily match all the way to the end, giving a garbled result instead:

I have %a word@ and i%some thing%*

TIPS:

  • For maximum easy flexibility, use XML-style tags that Lexique Pro automatically recognizes, such as bold, italic, and underline.
  • Create new styles in FLEx and map the inline codes to those styles. This way you can specifically format those without impacting everything else in FLEx's Dictionary view that might be using, say, the "Emphasized text" style.
Assess part of speech, and other fixed-content fields

Certain fields tend to have “list” content in them. As the Toolbox page mentions, these can be set up with range sets to ensure consistency, and you can use values that correspond verbatim to values in FLEx (e.g. the abbreviations used for parts of speech). This may be the best option if there are many values in the list. (Ideally, you just do this if the content is “clean” and you have prepared the matching list in FLEx to receive this data. Otherwise, the lists will be messy and need to be massaged later.)

Along with ps, some other fixed-content fields to check in this way include: pn, lf, sd. Keep an eye out for others in the data you are working with.

One useful find/replace trick (using a good text editor) is to temporarily use a code that wasn't already in the file, such as zzps or |ps as you deal with every list option. You could:

Replace all

\\ps adj.$

with

\\zzps adj

then replace all

 \\ps adjective$

with

\\zzps adj

and so on until you run out of \ps fields, at which point you'd replace all

\\zzps\b

back to

\\ps

But be wary of glibly assuming that similar-looking categories are necessarily typos for the same thing. Verify with the linguist beforehand if possible, or at least keep a detailed log.

Eliminate or ignore redundant part of speech translations (pn)

See the Toolbox page to see how to do this. In a nutshell, pn is redundant with ps and more trouble than it's worth to maintain it in the editable SFM file. It's generally best to get rid of it and only generate it as needed for publications.

Decide how to handle citation forms (lc)

Ideally, you would import the lc field directly into Citation Form. If, however, you are primarily using the lx form when linking from one field to another, then you'll need to temporarily import lc into a custom field so that it won't block those links. (The importer assumes that lc is the form you've linked to, if it exists.) Then, after the import, you should bulk copy into Citation Form and then delete the custom field. Warning: just remapping may not be enough; to avoid hard-coding that recognizes "lc", you should also rename the field marker to something that doesn't begin with "lc"; e.g. "cit" should work well.

Handling reversals, and avoiding redundancy

In MDF, a non-empty \ge field is used for reversals when \re is missing or empty, and \re * is used to block the use of \ge in that way. Quote from the MDF manual:

“this field is used only if the gloss form in the \ge field is not suitable for a reversal… If an asterisk is placed in this field \re *, then the relevant entry, subentry, or sense will not be included in the reversed finderlist (e.g. good for limiting access to taboo words).”

In FLEx, when dealing with data that has not yet been through the publication process, it's usually best to only copy brief (perhaps only single-word) glosses whenever Reversals is blank.

If an MDF lexicon has been carefully prepared for publication in the past, then all non-blocked glosses should be valid reversals, so after migrating to FLEx you should be able to…

  • Confidently Bulk copy from Gloss into Reversal in ANY sense where the latter is blank. And then…
  • Bulk delete any asterisks found in Reversal, so FLEx won't try to publish a reversal entry for the asterisk "word".

However, you might consider preserving this “block reversals = true” information in a custom sense-level field, for filtering during future bulk copying. Especially if the user's workflow will be to add/edit new glosses and then periodically bulk copy them into Reversals, rather than manually keeping them in sync at all times). So, the most efficient option may be to replace all “\re *” with “\blockre *” in the SFM prior to import, bringing those asterisks into a custom sense-level field in FLEx. (Ditto for \blockrn .)

AVOIDING REDUNDANCY: Given that it's extra hassle to modify the same information identically in Gloss, Definition, and Reversal, consider filling in only the Gloss field whenever the three are identical, leaving the others blank (i.e. delaying all the bulk copying above) until publication time. (This simulates the convenience of MDF's gloss field in the safe editing environment of FLEx.) Gloss is the best field to not leave blank, because all Find dialogs, and the interlinear, rely heavily on it. (You may be able to similarly avoid a redundant copy in Word Gloss in interlinear, by avoiding approving parser-guessed analyses that are correct and don't need disambiguation. In fact, I hide and ignore the Word Gloss field altogether.)

See comments under https://jira.sil.org/browse/LT-14341Question

Handling variants

See SFM Lexicons in Toolbox for additional information. It's best if the lexicographer has been doing things properly and also considering FLEx, since rearranging these with scripts is very hard to automate reliably. But in a nutshell:

MDF doesn't specify the type of a given variant (dialectal, free, etc.). So, when importing into FLEx, you may want to create a generic variant type (perhaps named “variant” and labeled ”??”) and import them all as that. The linguist can make things more specific later in FLEx.

  • It's bad to have peer variants (two lx's, each with a va pointing to the other lx). For example, this will prevent their subentries from being publishable.
  • If there are minor entries, they should always include an mn field (or similar) to clearly identify them, and should never have a va field.
  • Variants of subentries (or of senses) now import correctly (as of FW 7.2.7). Thus, it is no longer necessary to split out distinct vase and vasn fields (LT-13792), although that can still be helpful if you want to do more precise checking using Solid. The corollary to this fix is a new requirement: that any of those fields found after any sense or subentry fields will be linked accordingly. To force them to always link to the main entry you now have to move them up above the first sense of the main entry. (As with fields like et, you can first flag specific cases as vase and vasn, and then use the Move Up quick fix in Solid to move all plain va fields up directly under lx.)
  • Subentries of variants import correctly but their sense data cannot currently be published by FLEx in root-based mode (LT-14537).
  • Subentries which are themselves variants are actually ok; e.g. nombasa below. Just consider avoiding that their roots (basa) be variants (LT-14537), and avoid putting an mn field on the subentry (FLEx always applies mn to the whole entry). Here is an example. Notice that below cf is being used twice (instead of one va and one mn), so basa will *not* import as a minor entry representing a variant of baca. Also, “se nombasa” doesn't have an mn below it.
\lx baca
\cf basa
\se nombaca
\va nombasa
\ps v
\sn
\ge read

\lx basa
\cf baca
\se nombasa
\ps v
\sn
\ge read
\rf
\xv Roaku nombasa sura, bara naria kareba belo.
\xe My friend is reading a letter; maybe there is good news.

Note that the “normal” way of handling this would be as follows, but it would publish differently.

\lx baca
\va basa
\se nombaca
\va nombasa
\ps v
\sn
\ge read

\lx basa
\mn baca

\lx nombasa
\mn nombaca
\ps v
\sn
\ge read
\rf
\xv Roaku nombasa sura, bara naria kareba belo.
\xe My friend is reading a letter; maybe there is good news.
Handling minor entries

A minor entry is a small entry that contains little or nothing beyond a link to the main entry it is pointing to; that link (an mn field) identifies that entry as being minor rather than main. In MDF, and in FLEx import, the only way to create a subentry is with a subentry field (se) under the root's lx entry, but you can additionally provide a minor entry to help find it. With variants, however, FLEx import allows you to create a variant in three ways: with a minor entry alone, with a va field alone (under the root's lx entry), or with both. Thus, minor entries are always assumed to be for variants unless a matching se is found.

FLEx tries to identify minor entries and to treat them as (usually redundant) extensions of their main entry. For variants, the main entry can create a whole variant entry by including a single va field, but then any data specific to that variant must be specified in its minor entry. For subentries, conversely, all significant data should be entered under se under the main entry, and the minor entry should be kept as bare as possible. The key with either kind of minor entry is that the main entry pointed to (via mn) must actually exist. The wordform in the mn field should match in one of two ways in order for FLEx to interpret the minor entry correctly.

  • (1) Its mn matches a main lx or se and its lx optionally matches a va field in that entry (under lx, se, or sn). The minor entry will become a variant entry in FLEx (linked to the appropriate entry/subentry/sense). Any sense info in the minor entry (“far apart from” below) is simply imported into the variant entry. Here's an example of a "navaraka" variant of a "naavaraka" subentry, defined correctly at both ends. (I would usually make the va and mn be explicitly vase and mnva.)
\lx ava
\ps
\sn
\ge
\se naavaraka
\va navaraka
\ps adj
\sn
\ge to be far apart from
\re apart, far

\lx navaraka
\mn naavaraka
\ps adj
\sn
\ge far apart from
  • (2) Its mn matches the lx field of the targeted main entry AND its lx matches an se (subentry) under that target entry. The minor entry will be discarded (and thus no spurious homograph will be created), and any of its sense info will be silently added as a new sense of the subentry. Here's an example that meshes properly with the example above. (I would usually make the mn be explicitly mnse).
\lx naavaraka
\mn ava

Warning: If something more generic, such as cf, is used to link back to the main entry, then there's no way for FLEx to distinguish that this is in fact a minor entry, so it will be seen as a main entry (a homograph). If minor entries are intended, then this format is incorrect:

\lx navaraka
\cf naavaraka
\ps adj
\sn
\ge far apart from

\lx naavaraka
\cf ava

Warning: In a root-based dictionary with variants of subentries (e.g. navaraka above), a minor entry's mn field will often point (a) to the root lx (e.g. mn ava, so the reader will know where to look alphabetically) rather than (b) to the specific subentry (e.g. mn naavaraka, which is what the FLEx importer needs). Any cases of (a) should be changed to (b) prior to FLEx import (otherwise they'll all be listed as variants of those lx roots). This will make the SFM more precise (though less 'friendly' on paper).

Since the two main behaviors of mn above are quite different, it's a good idea to differentiate those minor entries using fields such as mnva and mnse, and in the latter case to either (a) remove all their sense info, or (b) remove them from the file entirely (you might save them in a separate file just in case). This is tedious if there are more than a few, unless you can run a script to do it. (An 'ideal' or 'overkill' solution for mnse would be to first sync/merge the contents of any unique sense fields over to the main entry and then in the minor entry remove them, or replace them with import-blocked fields such as gex, dex etc. But ask first–the user may not mind if you simply delete those fields, or the whole minor entry, if you can verify that it only has a summary definition and maybe a part of speech. That is, it might be safe to assume that “far apart from” above is basically redundant, and just delete or block it.)

After import, each complex form's “Show Minor Entry” box will be ticked regardless of whether a minor entry existed in the SFM or not. So, it is generally safe to delete a minor entry prior to an import, unless it has sense info unique to it, or unless you want to use a more precise technique (below).

Minor entries with double links (root-based dictionaries only)

Once the data is properly in FLEx, some reversals and minor entries can be nicely configured to publish both references together, such as this reversal entry, “apart, far → naavaraka (ava)”. Hopefully that automatic-double-linking-with-only-one-link-to-manage will give us the best of both worlds here (though there currently are still limitationssee LT-14487and these double-links are not yet available for cross-references, LT-14488.)

See also: the workaround for \mdiff (below).

Precision import of minor entries (workaround)

It is not difficult to keep track of which variants/complex forms did and didn't have minor entries in the SFM. If you wish to do so, you just need to insert an extra field (e.g. \isminor yes) next to each mn field; and then map that field to a custom entry-level field in FLEx named hasminor. After import, you can use bulk edit to clear all the “Show Minor Entry” checkboxes in all entries, then filter for non-blank hasminor and only tick those entries' boxes. Here's a regex for inserting an isminor field just above each mn, mnse, mnva, etc., if one doesn't already exist:

\\(?!isminor)(.*\r\n)\\mn
\\\1\\isminor yes\r\n\\mn
Importing "lookup aids" as pseudo minor entries

Some lexicographers also provide a third kind of "minor entry"; these are lookup aids that may or may not represent real words. Perhaps many non-native speakers would incorrectly look for navaraka under V, even though a "varaka" root doesn't really exist. For that kind of little pseudo minor entry:

  • (3) Its mn matches the lx field of the main root entry and its own lx probably is a perfect match for nothing in the main entry. This does not correspond to a typical FLEx concept of “minor entry”.
\lx varaka
\mdiff ava

This third case does not have a wordform matching a subentry, nor should it be be treated as a true variant. If we use the mn marker with the default mapping it will import as a variant of the root (ava), which isn't what we want. There are at least two possible solutions:

  • Recommended: Treat it as a full entry by replacing mn with something unique such as mdiff. Map mdiff to a custom cross-reference of an asymmetrical type (e.g. “Entry/Sense Pair - 2 relation names”), and only publish the link info on the 'minor' entry's end. This is quite straightforward to do, and since the “varaka” wordform doesn't match any other lx anyway, failing to ID it as a true FLEx “minor entry” won't result in a spurious homograph. You won't get a handy Shown Minor Entry checkbox, but deleting the little entry (or excluding its headword from publication) should be quite sufficient.
  • Treat it as a variant–even though it isn't–but import it as a custom variant type named “fake variant”, and block this custom type from being published in the list of variant forms on the main entry. Simply switching mn to something like mdiff or mnvanot in the minor entry won't do the trick–such fields can't be mapped to distinct variant types–so you'd need to use a distinct marker such as vanot over in the main entry itself. CONS: Managing the Dictionary view settings could become tedious. So could writing a script to insert vanot as needed. (BTW, such a script would need a starting point to look for in each case, so using something unique like mdiff or mnvanot over in the minor entry could still be helpful, more so than a generic cf field.)

Note that if LT-14487 / LT-14488 do get implemented (the "double linking" above), or if all publications will be purely electronic, then it would be better for these “lookup aids” to have the mdiff cross-reference point directly to the desired subentry, where applicable, rather than pointing vaguely to the root. WORKAROUND: Meanwhile, if the user wants this badly enough to maintain two cross-reference links (one to the complex form, one to the root), then a pair of fields could be imported instead, as follows. (I've done this for two different root-based projects recently.) Probably best to use two distinct custom relations, so the latter can be more easily deleted if it becomes superfluous. Note that this workaround breaks down where there are multiple links, so it wouldn't work well for things like lexical relations:

\lx varaka
\mdiff naavaraka
\cfroot ava
Using minor entries to create subentries of subentries

It is generally best to avoid this structure, as it can create a lot of white space and otherwise complicate publication layout. (It's hard enough to get single-level root-based publications outputting consistently from FLEx.) But if you want to do so, the key to creating a “grandchild entry” is to create two related main entries that each have one “child”, as in the first two lx entries below. (Notice the avoidance of redundant definitions; only one is needed per wordform.)

\lx establish
\de test one
\se establishment
\de test two

\lx establishment
\mn establish
\se establishmentarian
\de test three

\lx establishmentarian
\mn establishment
\se establishmentarianism
\de test four

The data above imports as follows.

establish  test one
  establishment  (der.) test two
    establishmentarian  (der.) test three
      establishmentarianism  (der.) test four 

NOTE: The same technique cannot be used for importing subentries of senses. That is currently not supported, though it can be approximated somewhat. You probably only need this if you have Alternate MDF data (very rare), or PLB SFM data (Philippines). See the document at the end of the Toolbox page.

Does the linguist want everything that is referenced (i.e. every "link target") to be created as a new entry, if it didn't already have an entry in the SFM lexicon? E.g. for cross-references and lexical relations (e.g., synonyms), should a new entry be created for a referenced word if it is not already there? (FLEx must have real targets for all real links. If the linguist doesn't want that, you can import those references into a custom field instead; that is, as plain text instead of as links. But that's a last resort, since then none of the 'good' links will work as actual links either, nor will their spellings be auto-maintained.)

Prepare FLEx to receive the data.
  • Create an empty project with a clear name.
  • Set up all the writing systems needed for this data.
  • If there is any variety among the types of variants and subentries in the SFM file, create a generic variant type (such as ”??” or “unknown”) and a generic complex form type (ditto; this is for subentries). When mapping va and se, assign them these types. This way, the lexicographer can assign these progressively later on. Or, split va/se into multiple fields that can be mapped distinctly (e.g. \seco might be for compound words).
  • If importing parts of speech directly (not into a custom field first), create the categories that are found in the data, and make sure their abbreviations in FLEx match verbatim the values in the SFM file.
  • Likewise, create new values for Status, Lexical Relations, etc., if needed. Make sure each name or abbreviation matches what is in the data.
  • Create any custom fields that will be needed.
  • Set up the Dictionary view's Config.
  • Add useful columns to the Lexicion Edit and Bulk Edit views.
  • BACK UP the empty project AFTER you have created all these things but BEFORE importing! Usually an import will need to be tried over and over until just the right combination of good settings and clean data align. It's much faster to prepare to redo an import by restoring from a pristine but configured backup than to re-create it all from scratch. (Using Bulk Delete is slower and doesn't really undo an import, as lots of junk can remain in the Lists area.) It is really helpful if that empty database has all these things already done to it—it is quite difficult (and error-prone) to have to do them over and over.
  • Open the SFM import wizard and go through each screen carefully. This will create a file such as dict-import-settings.map in the same location as your lexicon (e.g. dict.txt or dict.db). If you cancel the wizard, it asks whether it should remember your settings; the .map file is the place where it remembers them.
  • If you made all inferred markers real in Solid, you should be able to check one and only one checkbox per object when telling the wizard which field(s) can begin each object. (Checking multiple fields increases the likelihood of FLEx misinterpreting the data.)
Try the import, check the import preview, make any changes, and try it again.

There may be messages about invalid UTF-8 data. Double check the writing systems and the contents.

If the import preview reports entries that need checking, investigate what the problem is. It is usually a hierarchy problem:

  • Maybe the key markers need to be adjusted (in the import wizard).
  • Maybe the structure of the database still needs further adjusting.

Make a backup of the successful import. At this point, the "customer" will need to download the same version of FLEx (or newer), restore the backup, and verify whether or not the data FLEx is acceptable as the new "master file". The old SFM project should no longer be used for editing once that decision is made.

Have the customer go through the import residue in FLEx, cleaning things up as needed.

 

License

Creative Commons License
All content on this LingTranSoft wiki are by SIL International are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Similar Content