PrimeRL
← Back to Writing

Getting Started with Prime RL: Building Your First Environment

TLDR: The RL ecosystem trains almost exclusively on English. IndicIFEval: a benchmark from AI4Bharat covering 14 Indic languages with fully deterministic, rule-based constraint checkers - is sitting on HuggingFace waiting to be used. I turned it into a Prime RL environment in one day. Here's the exact build, and why the reward design is the only thing that actually matters. Go check out the environment here: indic-ifeval on the Hub

As we move more towards democratization of AI, more and more the need comes up to train the models that are more native-languages friendly, than the current ecosystem. This involves not only covers what the model is trained on and whether or not speaks the language, but it is also important the model understand the instructions and convert them into actual actions and tool calls.

Let's assume that you want to train a model on Hindi... You search HuggingFace. You find IndicIFEval --- a benchmark from the team at AI4Bharat with ~800 human-verified examples per language, verifiable rule-based constraints, two complementary subsets, a published paper. Released February 2026.

The first instinct is to look at the GitHub repo. The evaluation logic lives inside an lm-evaluation-harness integration --- custom config scripts, bare imports, sys.path.append() calls everywhere. It was built to run inside a specific harness, not to be imported or packaged.

Congratulations. You just found exactly why there are almost no Indic RL environments.

The benchmark exists. The constraint checkers exist. The dataset exists. The reward function doesn't. That's the only job left --- and it's one day of work.

First: Credit Where It's Due

Before the build, a shoutout to the AI4Bharat team.

IndicIFEval is not a quick dataset dump. It's ~800 human-verified examples per language across 14 Indic languages, built with native speakers as annotators for each language, with two complementary subsets — one translated and localized from English IFEval, one natively generated from Indic content. The constraint checkers are language-aware, handling script-specific tokenization and word boundaries correctly for languages like Hindi, Tamil, and Urdu that behave very differently from English.

This is serious benchmark infrastructure. The fact that it exists and is open-source is the entire reason this RL environment was possible to build.

If you train models using this environment, cite the paper: IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages.

What Prime RL's Environment Model Actually Is

Prime Intellect's verifiers library has a clean contract: write a load_environment() function that returns a SingleTurnEnv. That env wraps two things — a dataset and a rubric.

The rubric is where your reward functions live. A Rubric takes a list of async scoring functions and a list of weights:
python
rubric = vf.Rubric(funcs=[strict_score, loose_score], weights=[1.0, 0.0])

Each scoring function receives (completion, answer, info) and returns a float between 0 and 1. That's the entire interface. The prime CLI handles evaluation loops, rollouts, model API calls, and reporting. You own the reward definition. Nothing else.

This is the right abstraction. The hard part of building RL environments isn't infrastructure — it's finding tasks where "correct" is unambiguous enough to compute without a judge model.

Why IndicIFEval Is the Right Signal

RLVR (RL with verifiable rewards) has a well-known problem: most interesting tasks require a judge model to score, which introduces noise, latency, and a second source of failure. Math and code are the canonical exceptions — deterministic checkers, clean signal.

IndicIFEval is another exception, and it's underexploited.

Each example in the dataset embeds one or more verifiable constraints directly into the prompt:

  • "Your essay must begin with the word 'आसानी'."
  • "Do not use any commas."
  • "Write at least 300 words. Highlight 3 sections in markdown."

The checker either finds the required word at position one of paragraph one or it doesn't. The comma counter either finds commas or it doesn't. The word counter returns a number. These are not judgment calls.

The score is the fraction of constraints the model satisfied --- 0 to 1, computed deterministically.

The research finding that makes this useful for RL training: models handle formatting constraints reasonably well, but struggle significantly with lexical and cross-lingual constraints in Indic languages. There's real headroom. The reward signal can measure improvement in exactly the dimensions where models are weak. That's the profile you want.

The Packaging Problem

The IndicIFEval checkers aren't on PyPI. They live as scripts inside an lm-evaluation-harness integration, designed to run with sys.path manipulation. Two problems to solve before they're importable:

Problem 1: Vendoring.

Copy the checker files into the environment package and make them a proper Python module tree:

shell
verification/
├── __init__.py
├── trans/          # checkers for the translated subset
   ├── __init__.py
   ├── instructions_registry.py
   ├── utils.py
   └── instructions/
       ├── __init__.py
       └── {lang}_instructions.py  (15 languages)
└── ground/         # checkers for the natively generated subset
    ├── __init__.py
    ├── instructions_registry.py
    ├── utils.py
    └── instructions/
        ├── __init__.py
        └── {lang}_instructions.py  (14 languages)

Problem 2: Imports.

The upstream code uses bare module imports throughout. Convert every import hi_instructions_util to from . import hi_instructions_util. Remove every sys.path.append(). Tedious but mechanical — maybe two hours.

One more thing: the constraint checker has a print() statement inside the scoring loop. Suppress it or your eval output becomes unreadable:

python
with contextlib.redirect_stdout(io.StringIO()):
    out = test_fn(inp, text, row["language"])

Writing the Reward Functions

With the checkers vendored and importable, the reward functions are six lines each:

python
async def strict_score(completion, answer, info) -> float:
    utils = _UTILS[info["subset"]]
    return _score(utils, utils.test_instruction_following_strict, completion, info)

async def loose_score(completion, answer, info) -> float:
    utils = _UTILS[info["subset"]]
    return _score(utils, utils.test_instruction_following_loose, completion, info)

_score extracts text from the completion, runs the checker, and returns the fraction of constraints passed:

python
def _score(utils_mod, test_fn, completion, row):
    inp = _make_input_example(utils_mod, row)
    text = _extract_text(completion)
    with contextlib.redirect_stdout(io.StringIO()):
        out = test_fn(inp, text, row["language"])
    n = len(out.follow_instruction_list)
    return 0.0 if n == 0 else sum(out.follow_instruction_list) / n

strict_score vs loose_score: Both run the same underlying checkers. The difference is preprocessing. strict_score checks the raw completion as-is — this is the RL reward, weight 1.0. loose_score tries 8 normalized variants of the response (strips markdown bold/italic, trims whitespace, normalizes punctuation) — diagnostic only, weight 0.0.

The gap between the two tells you something useful. In the smoke test: strict averaged 0.83, loose averaged 1.0 on the same completions. The model was satisfying the constraints in intent but failing on surface formatting. That's a learnable gap — the exact kind of signal RL training can close.

Wiring It Into verifiers

python
def load_environment(language="hi", subset="trans", num_examples=-1, seed=42):
    dataset = _format_dataset(language, subset, num_examples, seed)
    rubric = vf.Rubric(funcs=[strict_score, loose_score], weights=[1.0, 0.0])
    return vf.SingleTurnEnv(dataset=dataset, rubric=rubric)

The dataset formatter loads the HuggingFace split (each language is its own split name, e.g. split="hi"), optionally shuffles and truncates, and builds the prompt/answer/info schema verifiers expects. No system prompt. Each example is a single user message with the instruction verbatim from the dataset.

The trans subset has ~490 examples per language including English. The ground subset has ~250–450 per language — no English, natively generated. The two subsets have subtly different checker implementations (e.g. ParagraphFirstWordCheck behaves differently between them). The environment handles this automatically by routing to the correct utils module per subset.

Total code: 89 lines.

Testing and Shipping

shell
cd environments/indic_ifeval
uv pip install -e .

# Smoke test — 2 examples, 1 rollout
prime eval run indic-ifeval -n 2 -r 1

# Different language and subset
prime eval run indic-ifeval -n 5 -r 1 -a '{"language": "ta", "subset": "ground"}'

# Push to the Hub
prime env push

The CLI builds a wheel from the package, uploads it, registers it on the Environments Hub. Anyone can now pull and train against it with one command. No infra. No GPU required to build or test.

The Broader Point

The open RL ecosystem has deep coverage of English math, English code, and English instruction-following. Coverage drops sharply the moment you leave English — not because non-English tasks are harder to define, but because nobody has done the packaging work.

IndicIFEval was the obvious first target: published benchmark, clean deterministic checkers, 14 languages, real performance gaps. The AI4Bharat team did the hard research work — building the dataset, hiring native annotators, writing the evaluation logic, publishing the paper. The RL packaging on top of that took one day.

The same pattern applies anywhere a benchmark has rule-based, auto-verifiable evaluation logic. Find the checkers. Vendor them. Write the reward function. Push.

Prime Intellect's Hub has active bounties ($100–$5000+) for new environments. The gap in non-English coverage is wide open.

References:

  1. indic-ifeval on the Hub
  2. IndicIFEval paper (arXiv 2602.22125)
  3. AI4Bharat/IndicIFEval on HuggingFace
  4. Prime Intellect Docs
  5. verifiers library