LanguageIdentification.jl Documentation

Adding LanguageIdentification.jl

julia> using Pkg
julia> Pkg.add("LanguageIdentification") Updating registry at `~/.julia/registries/General.toml` Resolving package versions... Installed LanguageIdentification ─ v1.0.0 Updating `~/work/LanguageIdentification.jl/LanguageIdentification.jl/docs/Project.toml` [35248bf2] ~ LanguageIdentification v1.0.1 `~/work/LanguageIdentification.jl/LanguageIdentification.jl` ⇒ v1.0.0 Updating `~/work/LanguageIdentification.jl/LanguageIdentification.jl/docs/Manifest.toml` [35248bf2] ~ LanguageIdentification v1.0.1 `~/work/LanguageIdentification.jl/LanguageIdentification.jl` ⇒ v1.0.0 Precompiling project... ✓ LanguageIdentification 1 dependency successfully precompiled in 1 seconds. 24 already precompiled. 1 dependency precompiled but a different version is currently loaded. Restart julia to access the new version

Documentation

LanguageIdentification.initializeMethod
initialize(; languages=supported_languages(), ngram=1:4, cutoff=0.85, vocabulary=1000:5000)

Initialize the language detector with the given parameters. Different parameters have different balances among accuracy, speed, and memory usage.

Arguments

  • languages::Vector{String}: A list of languages to be used for language detection. If this argument is not provided, all the languages returned by the supported_languages function will be used.
  • ngram::Union{Int, AbstractVector}: Specifies the length of UTF-8 byte n-grams to be utilized for language detection. An integer value can be provided to use a single n-gram size, while a range can be provided to use multiple n-gram sizes. The default value is 1:4, and the maximum value allowed is 7.
  • cutoff::Float64: The cutoff value of the cumulative probability of the n-grams to use for language detection. The default value is 0.85, and it must be between 0 and 1.
  • vocabulary::Union{Int, AbstractRange}: The size range of the vocabulary of each language. The default value is 1000:5000.
source
LanguageIdentification.langidMethod
langid(text, languages::Vector{String}, profiles::Vector{Dict{Vector{UInt8}, Float32}}; ngram=NGRAM)

Return the language of the given text based on the provided language profiles.

Arguments

  • text: A string or a collection of strings to be analyzed for language identification.
  • languages::Vector{String}: The list of languages to choose from. Omitting this argument will use all supported languages.
  • profiles::Vector{Dict{Vector{UInt8}, Float32}}: The language profiles to use for identification. Omitting this argument will use the default profiles.
  • ngram::Union{Int, AbstractVector}: The length of utf-8 byte n-grams to use for language detection. The default value is the value set in initialize, and should not exceed that value.

Returns

  • The language of the given text.
source
LanguageIdentification.langprobMethod
langprob(text, languages::Vector{String}, profiles::Vector{Dict{Vector{UInt8}, Float32}}; topk=5, ngram=NGRAM)

Returns the probability distribution of the language of the given text based on the provided language profiles.

Arguments

  • text: A string or a collection of strings to be analyzed for language identification.
  • languages::Vector{String}: A list of languages to choose from. If this argument is not provided, all the languages returned by the supported_languages function will be used.
  • profiles::Vector{Dict{Vector{UInt8}, Float32}}: The language profiles to use for identification. If this argument is not provided, the default profiles will be used.
  • topk::Int: The number of candidates to return. The default value is 5.
  • ngram::Union{Int, AbstractVector}: The length of utf-8 byte n-grams to use for language detection. The default value is the value set in initialize, and should not exceed that value.

Returns

  • A list of the topk languages and their probabilities.
source

Index