So, you want to start a new GPC analysis on a language? Great! Depending on the complexity of the writing system it can be a big job. But the work you put into it will pay dividends in making the words of the language accessible for computer based literacy acquisition activities within SynPhony.
If the language has a large amount of complexity or is an opaque writing
system then you can benefit from computer readable pronunciation
information for each word (or as many as possible) in some kind of
phonetic transcription.
Where can you find this
information? I don't have the answer to that. Try searching on the
internet, asking at the education department in a university that teaches
the language, ask on an online linguistic forum, etc. It might take a
while to track down (if it exists at all) but it will be worth it.
Ideally, you should be a native speaker
of the language you are analyzing. If that is not the case then you
should at least have access to a native speaker of the language as
they have all the pronunciation rules built into their heads and can
make appropriate decisions and spot inconsistencies much quicker
and more accurately.
I use Toolbox to manage my GPC database. It is a flat-file database program designed specifically for developing dictionaries, but it lends itself to almost any kind of data because it does not dictate any specific field markers. I also use UltraEdit as my text editor as well as Consistent Changes. It is a flexible scripting program that is a good way to make systematic changes in your database.
When you have a wordlist you can create a Toolbox lexicon file. Toolbox reads standard text files that contain text markers to indicate fields. Each field marker must start on a new line and is separated from the data by a space. Field markers are usually 1-4 characters long, but could be longer. Each record in the database starts with a field marker that is designated as the record marker. For dictionary files you can use \lx as the record marker, but in actuality you could use anything. So, for example, several records that only contain words would look like this:
\lx word1
\lx word2
\lx word3
However, we want to add more information than only words. So if we want to add part of speech data, pronunciation and our grapheme phoneme analysis in this file we could add them to additional fields like this:
\lx word1
\ps noun
\ph phonetic_form_in_ipa
\gpc this is where the gpc form of word1 would occur
\lx word2
\ps verb
\ph phonetic_form_in_ipa
\gpc this is where the gpc form of word2 would occur
You could use a plain text editor to edit a file like this, however, Toolbox offers several features that make it a tool of choice for this kind of work. You can filter for data in any field, make changes like search and replace that are restricted to one field only, and you have a good export utility. However, I often do some edits on my Toolbox files with a plain text editor outside of Toolbox when it is the easier or better tool for a particular edit (best if it has macro capability). In addition to Toolbox I also use a scripting program that can make changes to my database called Consistent Changes. It has easy syntax and is quite powerful. I can create scripts for you if you let me know what kind of changes you need. For English I used this program to create a script that did a lot of the GPC analysis for me. However, it still left a lot of work to do manually and I had a fairly good phonetic form available which was integrated in the same file. It would be impossible to do if that information were in another file. A sample record from my database looks like this:
\lx should've
\ps
\cmu SH UH1 D AH0 V
\cvc cvcvc
\str 10
\websyl 'shou.ld've
\ph ʃʊˈdǝ.v
\wid 27525
\nt insertion
\syll 2
\gpc sh_sh,book_ou,d_ld,',v_ve
\ss
\sd
\cpwd 00008
\et
\exp
\cob _
\mr 202 @+ve
As you can see, some fields are blank and some have data. This is a part of life in data management and it can be filled over time.The kinds of fields and the data they contain is completely up to you. Each field contains one kind of data and you can decide:
1) which characters to use to name that field and
2) what kind of data you will put in there.
You should make sure that if the alphabet contains non-roman characters that you use utf8 as your encoding. When you get a wordlist you can start with a plain list of one word per line. Then with a text-editor you can add your field codes using a search and replace. Search for every new line and replace with the new line character and the field marker.
Then you can add extra fields to every record by searching for the record marker and replacing with the fields you want to add. Then with a plain text editor (preferably one with a macro capability) you can copy the word in the record marker field into the \gpc field. Once it is there we can start to manipulate it to do our gpc analysis.
You can also copy the word from the record field to the \gpc field using a Consistent Changes script. I use this program a lot and can write you special scripts if needed. As the need arises I will add useful Consistent Changes scripts linked to this page for starting a project.
I would suggest that the minimum set of field markers for such a project be as follows:
\lx word
\ps part_of_speech_info
\ph phonetic_form (if the writing system is not transparent)
\syll number_of_syllables
\str stress_pattern_of_the_word
\freq how_often_the_word_appears_in_print
\gpc grapheme_phoneme_representation_of_the_word
You may add additional fields if you wish.