LanguageIdentification.jl Documentation
Adding LanguageIdentification.jl
julia> using Pkg
julia> Pkg.add("LanguageIdentification")
Updating registry at `~/.julia/registries/General.toml` Resolving package versions... Installed LanguageIdentification ─ v1.0.0 Updating `~/work/LanguageIdentification.jl/LanguageIdentification.jl/docs/Project.toml` [35248bf2] ~ LanguageIdentification v1.0.1 `~/work/LanguageIdentification.jl/LanguageIdentification.jl` ⇒ v1.0.0 Updating `~/work/LanguageIdentification.jl/LanguageIdentification.jl/docs/Manifest.toml` [35248bf2] ~ LanguageIdentification v1.0.1 `~/work/LanguageIdentification.jl/LanguageIdentification.jl` ⇒ v1.0.0 Precompiling project... ✓ LanguageIdentification 1 dependency successfully precompiled in 1 seconds. 24 already precompiled. 1 dependency precompiled but a different version is currently loaded. Restart julia to access the new version
Documentation
LanguageIdentification.initialize
— Methodinitialize(; languages=supported_languages(), ngram=1:4, cutoff=0.85, vocabulary=1000:5000)
Initialize the language detector with the given parameters. Different parameters have different balances among accuracy, speed, and memory usage.
Arguments
languages::Vector{String}
: A list of languages to be used for language detection. If this argument is not provided, all the languages returned by thesupported_languages
function will be used.ngram::Union{Int, AbstractVector}
: Specifies the length of UTF-8 byte n-grams to be utilized for language detection. An integer value can be provided to use a single n-gram size, while a range can be provided to use multiple n-gram sizes. The default value is1:4
, and the maximum value allowed is7
.cutoff::Float64
: The cutoff value of the cumulative probability of the n-grams to use for language detection. The default value is0.85
, and it must be between0
and1
.vocabulary::Union{Int, AbstractRange}
: The size range of the vocabulary of each language. The default value is1000:5000
.
LanguageIdentification.langid
— Methodlangid(text, languages::Vector{String}, profiles::Vector{Dict{Vector{UInt8}, Float32}}; ngram=NGRAM)
Return the language of the given text based on the provided language profiles.
Arguments
text
: A string or a collection of strings to be analyzed for language identification.languages::Vector{String}
: The list of languages to choose from. Omitting this argument will use all supported languages.profiles::Vector{Dict{Vector{UInt8}, Float32}}
: The language profiles to use for identification. Omitting this argument will use the default profiles.ngram::Union{Int, AbstractVector}
: The length of utf-8 byte n-grams to use for language detection. The default value is the value set ininitialize
, and should not exceed that value.
Returns
- The language of the given text.
LanguageIdentification.langprob
— Methodlangprob(text, languages::Vector{String}, profiles::Vector{Dict{Vector{UInt8}, Float32}}; topk=5, ngram=NGRAM)
Returns the probability distribution of the language of the given text based on the provided language profiles.
Arguments
text
: A string or a collection of strings to be analyzed for language identification.languages::Vector{String}
: A list of languages to choose from. If this argument is not provided, all the languages returned by thesupported_languages
function will be used.profiles::Vector{Dict{Vector{UInt8}, Float32}}
: The language profiles to use for identification. If this argument is not provided, the default profiles will be used.topk::Int
: The number of candidates to return. The default value is 5.ngram::Union{Int, AbstractVector}
: The length of utf-8 byte n-grams to use for language detection. The default value is the value set ininitialize
, and should not exceed that value.
Returns
- A list of the
topk
languages and their probabilities.
LanguageIdentification.supported_languages
— Methodsupported_languages() -> Vector{String}
Return a vector containing all the languages (ISO 639-3 codes) that are supported by this package.
LanguageIdentification.vocabulary_sizes
— MethodThe function vocabulary_sizes()
returns the sizes of the vocabulary for each language that was loaded by the initialize
function.