LanguageIdentification.jl Documentation

LanguageIdentification.jl Documentation

Adding LanguageIdentification.jl

julia> using Pkg
julia> Pkg.add("LanguageIdentification")    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed LanguageIdentification ─ v1.0.0
    Updating `~/work/LanguageIdentification.jl/LanguageIdentification.jl/docs/Project.toml`
  [35248bf2] ~ LanguageIdentification v1.0.1 `~/work/LanguageIdentification.jl/LanguageIdentification.jl` ⇒ v1.0.0
    Updating `~/work/LanguageIdentification.jl/LanguageIdentification.jl/docs/Manifest.toml`
  [35248bf2] ~ LanguageIdentification v1.0.1 `~/work/LanguageIdentification.jl/LanguageIdentification.jl` ⇒ v1.0.0
Precompiling project...
  ✓ LanguageIdentification
  1 dependency successfully precompiled in 1 seconds. 24 already precompiled.
  1 dependency precompiled but a different version is currently loaded. Restart julia to access the new version

Documentation

LanguageIdentification.initialize — Method

initialize(; languages=supported_languages(), ngram=1:4, cutoff=0.85, vocabulary=1000:5000)

Initialize the language detector with the given parameters. Different parameters have different balances among accuracy, speed, and memory usage.

Arguments

languages::Vector{String}: A list of languages to be used for language detection. If this argument is not provided, all the languages returned by the supported_languages function will be used.
ngram::Union{Int, AbstractVector}: Specifies the length of UTF-8 byte n-grams to be utilized for language detection. An integer value can be provided to use a single n-gram size, while a range can be provided to use multiple n-gram sizes. The default value is 1:4, and the maximum value allowed is 7.
cutoff::Float64: The cutoff value of the cumulative probability of the n-grams to use for language detection. The default value is 0.85, and it must be between 0 and 1.
vocabulary::Union{Int, AbstractRange}: The size range of the vocabulary of each language. The default value is 1000:5000.

source

LanguageIdentification.langid — Method

langid(text, languages::Vector{String}, profiles::Vector{Dict{Vector{UInt8}, Float32}}; ngram=NGRAM)

Return the language of the given text based on the provided language profiles.

Arguments

text: A string or a collection of strings to be analyzed for language identification.
languages::Vector{String}: The list of languages to choose from. Omitting this argument will use all supported languages.
profiles::Vector{Dict{Vector{UInt8}, Float32}}: The language profiles to use for identification. Omitting this argument will use the default profiles.
ngram::Union{Int, AbstractVector}: The length of utf-8 byte n-grams to use for language detection. The default value is the value set in initialize, and should not exceed that value.

Returns

The language of the given text.

source

LanguageIdentification.langprob — Method

langprob(text, languages::Vector{String}, profiles::Vector{Dict{Vector{UInt8}, Float32}}; topk=5, ngram=NGRAM)

Returns the probability distribution of the language of the given text based on the provided language profiles.

Arguments

text: A string or a collection of strings to be analyzed for language identification.
languages::Vector{String}: A list of languages to choose from. If this argument is not provided, all the languages returned by the supported_languages function will be used.
profiles::Vector{Dict{Vector{UInt8}, Float32}}: The language profiles to use for identification. If this argument is not provided, the default profiles will be used.
topk::Int: The number of candidates to return. The default value is 5.
ngram::Union{Int, AbstractVector}: The length of utf-8 byte n-grams to use for language detection. The default value is the value set in initialize, and should not exceed that value.

Returns

A list of the topk languages and their probabilities.

source

LanguageIdentification.supported_languages — Method

supported_languages() -> Vector{String}

Return a vector containing all the languages (ISO 639-3 codes) that are supported by this package.

source

LanguageIdentification.vocabulary_sizes — Method

The function vocabulary_sizes() returns the sizes of the vocabulary for each language that was loaded by the initialize function.

source

Index

LanguageIdentification.initialize
LanguageIdentification.langid
LanguageIdentification.langprob
LanguageIdentification.supported_languages
LanguageIdentification.vocabulary_sizes