Title: | A Naive IPA Tokeniser |
---|---|
Description: | It provides users with functions to parse International Phonetic Alphabet (IPA) transcriptions into individual phones (tokenisation) based on default IPA symbols and optional user specified multi-character phones. The tokenised transcriptions can be used for obtaining counts of phones or for searching for words matching phonetic patterns. |
Authors: | Stefano Coretta [aut, cre] |
Maintainer: | Stefano Coretta <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-03-07 10:34:01 UTC |
Source: | https://github.com/stefanocoretta/phonetisr |
This function counts occurrences of phones and includes basic phonetic features.
featurise(phlist)
featurise(phlist)
phlist |
A list of phones or the output of |
A tibble.
ipa <- c("ada", "buba", "kiki", "sa\u0283a") ip_ph <- phonetise(ipa) featurise(ip_ph)
ipa <- c("ada", "buba", "kiki", "sa\u0283a") ip_ph <- phonetise(ipa) featurise(ip_ph)
Given a vector of characters, it returns those which are not part of the IPA.
get_no_ipa(chars)
get_no_ipa(chars)
chars |
A vector of characters. |
A vector.
get_no_ipa(c("a", "\0283", ">"))
get_no_ipa(c("a", "\0283", ">"))
List of IPA symbols
ipa_symbols
ipa_symbols
A data frame with 143 rows and 12 variables:
IPA symbol.
Unicode code.
Unicode name.
IPA name.
The phonetic type of the symbol.
General character type (consonant
, vowel
, diacritic
).
Vowel openness.
Vowel height.
Vowel backness.
Vowel rounding.
Consonant voicing.
Consonant place of articulation.
Consonant manner of articulation.
Is the consonant lateral?
Is the phone sonorant?
The Swadesh list in Klingon.
kl_swadesh
kl_swadesh
A data frame with 195 rows and 4 variables:
Swadesh list item number.
English gloss.
Klingon transliteration.
IPA transcription.
Given a vector of phonetised strings, find phones.
ph_search(phlist, phonex)
ph_search(phlist, phonex)
phlist |
The output of |
phonex |
A phonetic expression. Supported shorthands are |
A list.
ipa <- c("p\u02B0a\u0303k\u02B0", "t\u02B0um\u0325", "\u025Bk\u02B0\u026F", "pun") ph <- c("p\u02B0", "t\u02B0", "k\u02B0", "a\u0303", "m\u0325") ipa_ph <- phonetise(ipa, multi = ph) ph_search(ipa_ph, "#CV") # partial matches are also returned ph_search(ipa_ph, "p") # use regular expressions ph_search(ipa_ph, "p\u02B0?V")
ipa <- c("p\u02B0a\u0303k\u02B0", "t\u02B0um\u0325", "\u025Bk\u02B0\u026F", "pun") ph <- c("p\u02B0", "t\u02B0", "k\u02B0", "a\u0303", "m\u0325") ipa_ph <- phonetise(ipa, multi = ph) ph_search(ipa_ph, "#CV") # partial matches are also returned ph_search(ipa_ph, "p") # use regular expressions ph_search(ipa_ph, "p\u02B0?V")
phonetise()
tokenises strings of IPA symbols (like phonetic transcriptions
of words) into individual "phones". The output is a list.
phonetise( strings, multi = NULL, regex = NULL, split = TRUE, sep = " ", sanitise = TRUE, ignore_stress = TRUE, ignore_tone = TRUE, diacritics = FALSE, affricates = FALSE, v_sequences = FALSE, prenasalised = FALSE, all_multi = FALSE, sanitize = sanitise ) phonetize( strings, multi = NULL, regex = NULL, split = TRUE, sep = " ", sanitise = TRUE, ignore_stress = TRUE, ignore_tone = TRUE, diacritics = FALSE, affricates = FALSE, v_sequences = FALSE, prenasalised = FALSE, all_multi = FALSE, sanitize = sanitise )
phonetise( strings, multi = NULL, regex = NULL, split = TRUE, sep = " ", sanitise = TRUE, ignore_stress = TRUE, ignore_tone = TRUE, diacritics = FALSE, affricates = FALSE, v_sequences = FALSE, prenasalised = FALSE, all_multi = FALSE, sanitize = sanitise ) phonetize( strings, multi = NULL, regex = NULL, split = TRUE, sep = " ", sanitise = TRUE, ignore_stress = TRUE, ignore_tone = TRUE, diacritics = FALSE, affricates = FALSE, v_sequences = FALSE, prenasalised = FALSE, all_multi = FALSE, sanitize = sanitise )
strings |
A character vector with a list of words in IPA. |
multi |
A character vector of one or more multi-character phones as strings. |
regex |
A string with a regular expression to match several multi-character phones. |
split |
If set to |
sep |
A character to be used as the separator of the phones if |
sanitise |
Whether to remove all non-IPA characters ( |
ignore_stress |
If |
ignore_tone |
If |
diacritics |
If set to |
affricates |
If set to |
v_sequences |
If set to |
prenasalised |
If set to |
all_multi |
If set to |
sanitize |
Alias of |
A list of phonetised strings.
# using unicode escapes for CRAN policy ipa <- c("p\u02B0a\u0303k\u02B0", "t\u02B0um\u0325", "\u025Bk\u02B0\u026F") ph <- c("p\u02B0", "t\u02B0", "k\u02B0", "a\u0303", "m\u0325") phonetise(ipa, multi = ph) ph_2 <- ph[4:5] # Match any character followed by <\u02B0> with ".\u02B0". phonetise(ipa, multi = ph_2, regex = ".\u02B0") # Same result. phonetise(ipa, regex = ".(\u0303|\u0325|\u02B0)") # Don't split strings and use "." as separator phonetise(ipa, multi = ph, split = FALSE, sep = ".")
# using unicode escapes for CRAN policy ipa <- c("p\u02B0a\u0303k\u02B0", "t\u02B0um\u0325", "\u025Bk\u02B0\u026F") ph <- c("p\u02B0", "t\u02B0", "k\u02B0", "a\u0303", "m\u0325") phonetise(ipa, multi = ph) ph_2 <- ph[4:5] # Match any character followed by <\u02B0> with ".\u02B0". phonetise(ipa, multi = ph_2, regex = ".\u02B0") # Same result. phonetise(ipa, regex = ".(\u0303|\u0325|\u02B0)") # Don't split strings and use "." as separator phonetise(ipa, multi = ph, split = FALSE, sep = ".")