Package 'phonetisr' reference manual

Title:	A Naive IPA Tokeniser
Description:	It provides users with functions to parse International Phonetic Alphabet (IPA) transcriptions into individual phones (tokenisation) based on default IPA symbols and optional user specified multi-character phones. The tokenised transcriptions can be used for obtaining counts of phones or for searching for words matching phonetic patterns.
Authors:	Stefano Coretta [aut, cre]
Maintainer:	Stefano Coretta <[email protected]>
License:	MIT + file LICENSE
Version:	0.1.0
Built:	2025-03-07 10:34:01 UTC
Source:	https://github.com/stefanocoretta/phonetisr

Add features to list of phones

Description

This function counts occurrences of phones and includes basic phonetic features.

Usage

featurise(phlist)
featurise(phlist)

Arguments

phlist

A list of phones or the output of phonetise().

Value

A tibble.

Examples

ipa <- c("ada", "buba", "kiki", "sa\u0283a")
ip_ph <- phonetise(ipa)
featurise(ip_ph)

ipa <- c("ada", "buba", "kiki", "sa\u0283a")
ip_ph <- phonetise(ipa)
featurise(ip_ph)

Get non-IPA characters.

Description

Given a vector of characters, it returns those which are not part of the IPA.

Usage

get_no_ipa(chars)
get_no_ipa(chars)

Arguments

chars

A vector of characters.

Value

A vector.

Examples

get_no_ipa(c("a", "\0283", ">"))

get_no_ipa(c("a", "\0283", ">"))

List of IPA symbols

Description

List of IPA symbols

Usage

ipa_symbols
ipa_symbols

Format

A data frame with 143 rows and 12 variables:

IPA: IPA symbol.
unicode: Unicode code.
uni_name: Unicode name.
ipa_name: IPA name.
phon_type: The phonetic type of the symbol.
type: General character type (consonant, vowel, diacritic).
height_ipa: Vowel openness.
height: Vowel height.
backness: Vowel backness.
rounding: Vowel rounding.
voicing: Consonant voicing.
place: Consonant place of articulation.
manner: Consonant manner of articulation.
lateral: Is the consonant lateral?
sonorant: Is the phone sonorant?

Klingon Swadesh list

Description

The Swadesh list in Klingon.

Usage

kl_swadesh
kl_swadesh

Format

A data frame with 195 rows and 4 variables:

id: Swadesh list item number.
gloss: English gloss.
translit: Klingon transliteration.
ipa: IPA transcription.

Search phones

Description

Given a vector of phonetised strings, find phones.

Usage

ph_search(phlist, phonex)
ph_search(phlist, phonex)

Arguments

`phlist`	The output of `phonetise()`.
`phonex`	A phonetic expression. Supported shorthands are `C` for consonant, `V` for vowel, and `⁠#⁠` for word boundary.

Value

A list.

Examples

ipa <- c("p\u02B0a\u0303k\u02B0", "t\u02B0um\u0325", "\u025Bk\u02B0\u026F", "pun")
ph <- c("p\u02B0", "t\u02B0", "k\u02B0", "a\u0303", "m\u0325")
ipa_ph <- phonetise(ipa, multi = ph)
ph_search(ipa_ph, "#CV")

# partial matches are also returned
ph_search(ipa_ph, "p")

# use regular expressions
ph_search(ipa_ph, "p\u02B0?V")
ipa <- c("p\u02B0a\u0303k\u02B0", "t\u02B0um\u0325", "\u025Bk\u02B0\u026F", "pun")
ph <- c("p\u02B0", "t\u02B0", "k\u02B0", "a\u0303", "m\u0325")
ipa_ph <- phonetise(ipa, multi = ph)
ph_search(ipa_ph, "#CV")

# partial matches are also returned
ph_search(ipa_ph, "p")

# use regular expressions
ph_search(ipa_ph, "p\u02B0?V")

Tokenise IPA strings

Description

phonetise() tokenises strings of IPA symbols (like phonetic transcriptions of words) into individual "phones". The output is a list.

Usage

phonetise(
  strings,
  multi = NULL,
  regex = NULL,
  split = TRUE,
  sep = " ",
  sanitise = TRUE,
  ignore_stress = TRUE,
  ignore_tone = TRUE,
  diacritics = FALSE,
  affricates = FALSE,
  v_sequences = FALSE,
  prenasalised = FALSE,
  all_multi = FALSE,
  sanitize = sanitise
)

phonetize(
  strings,
  multi = NULL,
  regex = NULL,
  split = TRUE,
  sep = " ",
  sanitise = TRUE,
  ignore_stress = TRUE,
  ignore_tone = TRUE,
  diacritics = FALSE,
  affricates = FALSE,
  v_sequences = FALSE,
  prenasalised = FALSE,
  all_multi = FALSE,
  sanitize = sanitise
)
phonetise(
  strings,
  multi = NULL,
  regex = NULL,
  split = TRUE,
  sep = " ",
  sanitise = TRUE,
  ignore_stress = TRUE,
  ignore_tone = TRUE,
  diacritics = FALSE,
  affricates = FALSE,
  v_sequences = FALSE,
  prenasalised = FALSE,
  all_multi = FALSE,
  sanitize = sanitise
)

phonetize(
  strings,
  multi = NULL,
  regex = NULL,
  split = TRUE,
  sep = " ",
  sanitise = TRUE,
  ignore_stress = TRUE,
  ignore_tone = TRUE,
  diacritics = FALSE,
  affricates = FALSE,
  v_sequences = FALSE,
  prenasalised = FALSE,
  all_multi = FALSE,
  sanitize = sanitise
)

Arguments

`strings`	A character vector with a list of words in IPA.
`multi`	A character vector of one or more multi-character phones as strings.
`regex`	A string with a regular expression to match several multi-character phones.
`split`	If set to `TRUE` (the default), the tokenised strings are split into phones (i.e. the output is a vector with one element per phone). If set to `FALSE`, the string is not split and the phones are separated with the character defined in `sep`.
`sep`	A character to be used as the separator of the phones if `split = FALSE` (default is `⁠ ⁠`, space).
`sanitise`	Whether to remove all non-IPA characters (`TRUE` by default).
`ignore_stress`	If `TRUE` (the default), stress marks are not parsed.
`ignore_tone`	If `TRUE` (the default), tone marks and letters are not parsed.
`diacritics`	If set to `TRUE`, parses all valid diacritics as part of the previous character (`FALSE` by default).
`affricates`	If set to `TRUE`, parses homorganic stop + fricative as affricates.
`v_sequences`	If set to `TRUE`, collapses vowel sequences (`FALSE` by default).
`prenasalised`	If set to `TRUE`, parses prenasalised consonants as such (`FALSE` by default).
`all_multi`	If set to `TRUE`, `diacritics`, `affricates`, `v_sequences` and `prenasalised` are all set to `TRUE`.
`sanitize`	Alias of `sanitise`.

Value

A list of phonetised strings.

Examples

# using unicode escapes for CRAN policy
ipa <- c("p\u02B0a\u0303k\u02B0", "t\u02B0um\u0325", "\u025Bk\u02B0\u026F")
ph <- c("p\u02B0", "t\u02B0", "k\u02B0", "a\u0303", "m\u0325")

phonetise(ipa, multi = ph)

ph_2 <- ph[4:5]

# Match any character followed by <\u02B0> with ".\u02B0".
phonetise(ipa, multi = ph_2, regex = ".\u02B0")

# Same result.
phonetise(ipa, regex = ".(\u0303|\u0325|\u02B0)")

# Don't split strings and use "." as separator
phonetise(ipa, multi = ph, split = FALSE, sep = ".")

# using unicode escapes for CRAN policy
ipa <- c("p\u02B0a\u0303k\u02B0", "t\u02B0um\u0325", "\u025Bk\u02B0\u026F")
ph <- c("p\u02B0", "t\u02B0", "k\u02B0", "a\u0303", "m\u0325")

phonetise(ipa, multi = ph)

ph_2 <- ph[4:5]

# Match any character followed by <\u02B0> with ".\u02B0".
phonetise(ipa, multi = ph_2, regex = ".\u02B0")

# Same result.
phonetise(ipa, regex = ".(\u0303|\u0325|\u02B0)")

# Don't split strings and use "." as separator
phonetise(ipa, multi = ph, split = FALSE, sep = ".")

Package 'phonetisr'

Help Index

Add features to list of phones

Description

Usage

Arguments

Value

Examples

Get non-IPA characters.

Description

Usage

Arguments

Value

Examples

List of IPA symbols

Description

Usage

Format

Klingon Swadesh list

Description

Usage

Format

Search phones

Description

Usage

Arguments

Value

Examples

Tokenise IPA strings

Description

Usage

Arguments

Value

Examples