NeuralTranslate v1.0 · EN→HI
01 Theory 02 Implementation

Transformer-Based
English - हिन्दी
Translation Engine

Attention Is All You Need PyTorch · FastAPI BLEU 23.64
EN English Source
0 chars
0 words
0 / 500 chars
HI Hindi Output
Translation will appear here…
Try:
Beam k
System Log

System initialized. Ready for inference.

Model: Custom Transformer (Vaswani et al., 2017)

Awaiting input…

~
Neural Network Architecture
Transformers are deep learning models that process sequential data using attention mechanisms, allowing them to weigh the importance of different parts of input simultaneously, unlike older RNN/LSTM models that process data step-by-step.
Self-Attention Mechanism
The core innovation: each word in a sentence can directly "attend" to every other word. This lets the model understand context, like knowing "bank" means river bank or financial bank based on surrounding words.
Parallel Processing
Unlike sequential models, Transformers process all tokens at once, enabling massive parallelization on GPUs. This is why they scale to billions of parameters and massive datasets.
Step 1: Encoding
English text is split into tokens (words/subwords), converted to numerical vectors, and passed through encoder layers. Each layer builds increasingly rich representations of meaning and grammar.
Step 2: Decoding
The decoder generates Hindi text one token at a time, attending to both the encoded English representation and previously generated Hindi words. Each step predicts the most probable next word.
Vocabulary & Tokens
The model uses a fixed vocabulary (~10,000 words for each language). Rare words are broken into subword units (e.g., "unhappiness" → "un" + "happiness"), enabling translation of unseen words.
What is BLEU?
Bilingual Evaluation Understudy: an automated metric that compares machine translations to human reference translations. Scores range from 0 to 100, with higher being better.
How to Interpret Scores
20 to 30: Understandable translations with errors. 30 to 40: Good quality, minor issues. 40 to 50: Very good, near-human. 50+: Professional quality. Our score of 23.64 indicates functional translations with room for improvement.
How It's Calculated
BLEU measures n-gram precision (how many 1 to 4 word sequences match references), penalizes overly short translations, and geometrically averages scores. It's fast but imperfect; it doesn't measure fluency or meaning directly.
Greedy Decoding
At each step, picks the single most probable next word. Fast and simple, but can miss better overall translations, like always choosing the locally best option without considering the global picture.
Beam Search
Maintains multiple candidate translations (controlled by "beam width k"). At each step, expands all candidates and keeps the top-k overall. Higher k = more thorough search but slower. Typically produces more fluent, accurate translations.
Why Beam Search Wins
Translation isn't just picking words; it's finding the best sequence. Beam search explores multiple paths, avoiding dead-ends greedy decoding might fall into. Try both modes above to see the difference!
Multi-Head Attention
Instead of one attention mechanism, the model runs 8 parallel "heads," each learning different relationships (e.g., subject-verb agreement, pronoun resolution). Results are combined for richer understanding.
Positional Encoding
Since Transformers process all tokens simultaneously (not sequentially), they need explicit position information. Sinusoidal patterns are added to embeddings so the model knows word order: "dog bites man" vs "man bites dog."
Encoder-Decoder Stack
The model stacks 6 encoder layers and 6 decoder layers. Each layer has multi-head attention, feed-forward networks, and residual connections. Deeper networks learn more abstract representations but require more training.
d_model 256
Encoder / Decoder layers 6 + 6
Attention heads 8
FFN inner dim (d_ff) 2,048
Vocab size (shared BPE) 30,000
Max sequence length 512
Dropout 0.1
Encoder embedding 7.68M
Encoder 6 blocks 7.89M
Decoder embedding 7.68M
Decoder 6 blocks 9.47M
Output projection 7.71M
Total Parameters 40,433,968
What is Cross-Attention?
In a Transformer decoder, cross-attention connects the decoder to the encoder. At each step of generating a Hindi word, the model looks back at every English word and decides which ones matter most. The heatmap below visualizes these "attention weights" — brighter cells mean stronger connections.
Reading the Heatmap
Each row is an English input token. Each column is a Hindi output token. A bright cell at row "water" and column "पानी" means the model strongly attended to "water" when producing "पानी". The strongest connections are highlighted so you can trace the model's reasoning at a glance.
Why 6 Layers × 8 Heads?
This model has 6 decoder layers, each with 8 attention heads. The heatmap below averages all 48 attention maps (6 × 8) into a single view. Deeper layers capture abstract relationships (e.g., word order, grammar), while earlier layers focus on surface-level word matching.
Greedy vs Beam Search
Both decoding modes produce an attention heatmap. Greedy picks the most likely word at each step (faster). Beam search (Ctrl+Enter) explores multiple candidate translations and returns alternative outputs — useful for comparing translation nuances.