NeuralTranslate: EN→HI

EN English Source

0 chars

0 words

0 / 500 chars

HI Hindi Output

Translation will appear here…

Try:

Beam k

System Log

›System initialized. Ready for inference.

›Model: Custom Transformer (Vaswani et al., 2017)

›Awaiting input…

From-Scratch Architecture

Encoder-decoder Transformer built without HuggingFace model code. Every module hand-implemented per the original paper.

Gradient Accumulation

Simulates large effective batch sizes on limited GPU memory via multi-step gradient accumulation.

Beam Search Decoding

Configurable beam width expands candidate sequences to find higher-probability translations than greedy decoding.

Neural Network Architecture

Transformers are deep learning models that process sequential data using attention mechanisms, allowing them to weigh the importance of different parts of input simultaneously, unlike older RNN/LSTM models that process data step-by-step.

Self-Attention Mechanism

The core innovation: each word in a sentence can directly "attend" to every other word. This lets the model understand context, like knowing "bank" means river bank or financial bank based on surrounding words.

Parallel Processing

Unlike sequential models, Transformers process all tokens at once, enabling massive parallelization on GPUs. This is why they scale to billions of parameters and massive datasets.

Step 1: Encoding

English text is split into tokens (words/subwords), converted to numerical vectors, and passed through encoder layers. Each layer builds increasingly rich representations of meaning and grammar.

Step 2: Decoding

The decoder generates Hindi text one token at a time, attending to both the encoded English representation and previously generated Hindi words. Each step predicts the most probable next word.

Vocabulary & Tokens

The model uses a fixed vocabulary (~10,000 words for each language). Rare words are broken into subword units (e.g., "unhappiness" → "un" + "happiness"), enabling translation of unseen words.

What is BLEU?

Bilingual Evaluation Understudy: an automated metric that compares machine translations to human reference translations. Scores range from 0 to 100, with higher being better.

How to Interpret Scores

20 to 30: Understandable translations with errors. 30 to 40: Good quality, minor issues. 40 to 50: Very good, near-human. 50+: Professional quality. Our score of 23.64 indicates functional translations with room for improvement.

How It's Calculated

BLEU measures n-gram precision (how many 1 to 4 word sequences match references), penalizes overly short translations, and geometrically averages scores. It's fast but imperfect; it doesn't measure fluency or meaning directly.

Greedy Decoding

At each step, picks the single most probable next word. Fast and simple, but can miss better overall translations, like always choosing the locally best option without considering the global picture.

Beam Search

Maintains multiple candidate translations (controlled by "beam width k"). At each step, expands all candidates and keeps the top-k overall. Higher k = more thorough search but slower. Typically produces more fluent, accurate translations.

Why Beam Search Wins

Translation isn't just picking words; it's finding the best sequence. Beam search explores multiple paths, avoiding dead-ends greedy decoding might fall into. Try both modes above to see the difference!

Multi-Head Attention

Instead of one attention mechanism, the model runs 8 parallel "heads," each learning different relationships (e.g., subject-verb agreement, pronoun resolution). Results are combined for richer understanding.

Positional Encoding

Since Transformers process all tokens simultaneously (not sequentially), they need explicit position information. Sinusoidal patterns are added to embeddings so the model knows word order: "dog bites man" vs "man bites dog."

Encoder-Decoder Stack

The model stacks 6 encoder layers and 6 decoder layers. Each layer has multi-head attention, feed-forward networks, and residual connections. Deeper networks learn more abstract representations but require more training.

d_model 256

Encoder / Decoder layers 6 + 6

Attention heads 8

FFN inner dim (d_ff) 2,048

Vocab size (shared BPE) 30,000

Max sequence length 512

Dropout 0.1

Encoder embedding 7.68M

Encoder 6 blocks 7.89M

Decoder embedding 7.68M

Decoder 6 blocks 9.47M

Output projection 7.71M

Total Parameters 40,433,968

What is Cross-Attention?

In a Transformer decoder, cross-attention connects the decoder to the encoder. At each step of generating a Hindi word, the model looks back at every English word and decides which ones matter most. The heatmap below visualizes these "attention weights" — brighter cells mean stronger connections.

Reading the Heatmap

Each row is an English input token. Each column is a Hindi output token. A bright cell at row "water" and column "पानी" means the model strongly attended to "water" when producing "पानी". The strongest connections are highlighted so you can trace the model's reasoning at a glance.

Why 6 Layers × 8 Heads?

This model has 6 decoder layers, each with 8 attention heads. The heatmap below averages all 48 attention maps (6 × 8) into a single view. Deeper layers capture abstract relationships (e.g., word order, grammar), while earlier layers focus on surface-level word matching.

Greedy vs Beam Search

Both decoding modes produce an attention heatmap. Greedy picks the most likely word at each step (faster). Beam search (Ctrl+Enter) explores multiple candidate translations and returns alternative outputs — useful for comparing translation nuances.

Transformer-Based English - हिन्दी Translation Engine

Transformer-Based
English - हिन्दी
Translation Engine