HinglishLM

Experimenting with Language Models on Hinglish data

Hinglish—an informal code-mixed blend of Hindi and English—is prevalent across social media, messaging apps, and online forums. The primary motivation for this project was to have a hinglish chatbot which could be use for whatsapp chats but I found it worth investigating hinglish which is a bit different form of code mixing. Hinglish-LM is a fun project focused on evaluating language model architectures for Hinglish through custom dataset curation and architectural ablations


Why?

  • Frequent language switching is observed in Hinglish.
  • Variations like “kya”, “kyaaa”, “kia”
  • No grammar: very informal structure of sentences

Hinglish Dataset Curation

  • Compiled a Hinglish dataset from Reddit and other sources, incorporating real-world conversational data with diverse linguistic variations.
  • Datasets combined:

Custom Model Training

  • Pretrained and instruction-finetuned a decoder-only transformer from scratch, specifically optimized for Hinglish.

Architectural Ablations

  • Conducted in-depth evaluations of embedding techniques:
    • Sinusoidal vs. Learnable Positional Embeddings
    • Rotary Positional Embeddings (RoPE) vs. ALiBi Embeddings

Finetuning on whatsApp style chats:

  • Used conversational data to fine tune the language model
  • Some of the conversations produced:

Tokenizer Comparisons

  • Assessed performance differences between:
    • tiktoken tokenizer (optimized for GPT-like architectures)
    • BPE tokenizer trained on Hinglish corpus (capturing subword units more effectively)
  • The model still hallucinates a lot. Prehaps the dataset needs to be refined more. Can work on combined embedding approach for reducing gaps between varied tokens