LanguageIdentification.jl Documentation
Adding LanguageIdentification.jl
julia> using Pkgjulia> Pkg.add("LanguageIdentification")Updating registry at `~/.julia/registries/General.toml` Resolving package versions... Installed LanguageIdentification ─ v1.0.0 Updating `~/work/LanguageIdentification.jl/LanguageIdentification.jl/docs/Project.toml` [35248bf2] ~ LanguageIdentification v1.0.1 `~/work/LanguageIdentification.jl/LanguageIdentification.jl` ⇒ v1.0.0 Updating `~/work/LanguageIdentification.jl/LanguageIdentification.jl/docs/Manifest.toml` [35248bf2] ~ LanguageIdentification v1.0.1 `~/work/LanguageIdentification.jl/LanguageIdentification.jl` ⇒ v1.0.0 Precompiling project... ✓ LanguageIdentification 1 dependency successfully precompiled in 1 seconds. 24 already precompiled. 1 dependency precompiled but a different version is currently loaded. Restart julia to access the new version
Documentation
LanguageIdentification.initialize — Methodinitialize(; languages=supported_languages(), ngram=1:4, cutoff=0.85, vocabulary=1000:5000)Initialize the language detector with the given parameters. Different parameters have different balances among accuracy, speed, and memory usage.
Arguments
languages::Vector{String}: A list of languages to be used for language detection. If this argument is not provided, all the languages returned by thesupported_languagesfunction will be used.ngram::Union{Int, AbstractVector}: Specifies the length of UTF-8 byte n-grams to be utilized for language detection. An integer value can be provided to use a single n-gram size, while a range can be provided to use multiple n-gram sizes. The default value is1:4, and the maximum value allowed is7.cutoff::Float64: The cutoff value of the cumulative probability of the n-grams to use for language detection. The default value is0.85, and it must be between0and1.vocabulary::Union{Int, AbstractRange}: The size range of the vocabulary of each language. The default value is1000:5000.
LanguageIdentification.langid — Methodlangid(text, languages::Vector{String}, profiles::Vector{Dict{Vector{UInt8}, Float32}}; ngram=NGRAM)Return the language of the given text based on the provided language profiles.
Arguments
text: A string or a collection of strings to be analyzed for language identification.languages::Vector{String}: The list of languages to choose from. Omitting this argument will use all supported languages.profiles::Vector{Dict{Vector{UInt8}, Float32}}: The language profiles to use for identification. Omitting this argument will use the default profiles.ngram::Union{Int, AbstractVector}: The length of utf-8 byte n-grams to use for language detection. The default value is the value set ininitialize, and should not exceed that value.
Returns
- The language of the given text.
LanguageIdentification.langprob — Methodlangprob(text, languages::Vector{String}, profiles::Vector{Dict{Vector{UInt8}, Float32}}; topk=5, ngram=NGRAM)Returns the probability distribution of the language of the given text based on the provided language profiles.
Arguments
text: A string or a collection of strings to be analyzed for language identification.languages::Vector{String}: A list of languages to choose from. If this argument is not provided, all the languages returned by thesupported_languagesfunction will be used.profiles::Vector{Dict{Vector{UInt8}, Float32}}: The language profiles to use for identification. If this argument is not provided, the default profiles will be used.topk::Int: The number of candidates to return. The default value is 5.ngram::Union{Int, AbstractVector}: The length of utf-8 byte n-grams to use for language detection. The default value is the value set ininitialize, and should not exceed that value.
Returns
- A list of the
topklanguages and their probabilities.
LanguageIdentification.supported_languages — Methodsupported_languages() -> Vector{String}Return a vector containing all the languages (ISO 639-3 codes) that are supported by this package.
LanguageIdentification.vocabulary_sizes — MethodThe function vocabulary_sizes() returns the sizes of the vocabulary for each language that was loaded by the initialize function.