Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

arXiv:2507.16284v2 Announce Type: replace
Abstract: The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80% accuracy on texts shorter than 150 characters and reaches 100% accuracy for longer texts. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.

Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

Our Services

Home

How it work

News

Pricing

Support

Help Center

Report an Issue

Give Feedback

Privacy Policy

User Account

Follow Us