Ranking for MIR 2024

Net Score

MIR scoring: (3 × correct - incorrect) / 3

193.33 pts

193.33 pts

193.33 pts

192.33 pts

189.66 pts

186.66 pts

186.66 pts

186.66 pts

Best human

186.66 pts

185.66 pts

182.00 pts

111

179.33 pts

126

178.00 pts

137

176.33 pts

144

174.00 pts

151

174.00 pts

152

173.66 pts

171.33 pts

170.66 pts

166

169.66 pts

172

168.33 pts

178

156.33 pts

213

155.66 pts

214

144.00 pts

242

111.66 pts

266

71.00 pts

287

60.66 pts

293

45.00 pts

302

44.66 pts

303

7.33 pts

318

Average:153.10 pts

(320 modelos)

Correct Answers

Total number of correct answers

195

195

195

194

192

190

190

190

Best human

190

GLM 4.5

187

100

186

110

136

182

142

180

150

180

151

178

163

178

177

168

176

174

175

185

215

216

217

218

157

242

265

293

Llama 3 8B Lunaris

294

295

303

UnslopNemo 12B

304

305

306

Rnj 1 Instruct

307

317

Morph V3 Fast

318

Total:52045

Average:162.64

(320 modelos)

Incorrect Answers

Total number of incorrect answers

GPT-5 Chat

GPT-5.1

Gemini 2.5 Pro

o3 Pro

o3 Deep Research

GPT-5 Image Mini

GPT-5.4

DeepSeek V3 0324

GPT-5.4 Pro

Qwen3 Max

Qwen3 Max Thinking

o1-pro

Claude Sonnet 4

Claude Opus 4.6

Llama 4 Maverick

Kimi K2 Thinking

Best human

GLM 4.5

100

117

130

Devstral 2 2512

132

Palmyra X5

133

141

GPT-4 Turbo

142

143

Aurora Alpha

144

145

154

159

160

Sonar Pro

161

Pixtral Large 2411

166

179

Sonar

180

181

182

184

Jamba Large 1.7

185

186

Qwen3 32B

187

188

204

GLM 4 32B

205

206

211

212

GLM 4.5 Air

213

214

219

Olmo 3.1 32B Think

220

221

Medgemma

222

261

Qwen3 4B

262

263

281

300

LFM2-8B-A1B

301

302

303

304

Total:9173

Average:28.66

(320 modelos)

Accuracy Percentage

Proportion of correct answers over total

97.5%

97.5%

97.5%

97.0%

96.0%

95.0%

95.0%

95.0%

Best human

95.0%

94.5%

94.5%

GLM 4.5

94.5%

93.5%

100

93.0%

110

91.5%

91.5%

136

91.0%

142

90.0%

150

90.0%

151

89.0%

163

89.0%

88.5%

168

88.0%

174

87.5%

185

215

216

217

218

78.5%

242

65.5%

265

43.5%

295

37.0%

305

Average:81.3%

(320 modelos)

Average response time

Average time for the model to respond to each question

1.8s

1.9s

3.2s

GPT-5.1-Codex

3.2s

3.6s

3.6s

LFM2-24B-A2B

3.7s

5.1s

7.0s

7.3s

7.3s

Nova 2 Lite

7.3s

10.7s

144

13.6s

13.6s

186

15.7s

17.4s

224

17.7s

225

20.8s

237

20.9s

238

21.1s

239

22.3s

244

22.4s

245

26.0s

263

28.1s

270

28.3s

271

85.5s

312

177.4s

Average:17.7s

(318 modelos)

Average cost per question

Average cost in USD per evaluated question

LFM2-8B-A1B

$0.0000

$0.0001

MythoMax 13B

$0.0001

$0.0001

Mistral 7B Instruct

Ministral 3 8B 2512

$0.0003

$0.0003

UnslopNemo 12B

$0.0003

DeepSeek V3.2

$0.0005

$0.0005

$0.0005

$0.0006

$0.0006

Llama 4 Maverick

$0.0006

$0.0007

116

$0.0007

117

118

119

120

122

123

$0.0009

127

$0.0009

128

$0.0012

$0.0014

$0.0014

$0.0015

170

$0.0026

193

$0.0030

202

$0.0030

203

$0.0031

$0.0293

289

Average:$0.0100

(301 modelos)

Average confidence

Average confidence level reported by the model

GPT-5 Chat

o3 Pro

o1-pro

Claude Sonnet 4.5

Grok 4

99.8%

Aion-2.0

99.8%

99.8%

99.5%

99.4%

115

99.4%

116

99.2%

98.7%

98.4%

176

98.0%

197

97.6%

212

94.2%

268

93.9%

272

93.3%

273

93.3%

274

92.7%

281

87.0%

296

84.1%

300

84.0%

301

81.6%

302

57.2%

314

Average:95.7%

(318 modelos)

Total Cost

Total cost in USD to evaluate all questions

LFM2-8B-A1B

$0.00

$0.01

MythoMax 13B

$0.01

$0.02

Mistral 7B Instruct

Ministral 3 8B 2512

$0.07

$0.07

UnslopNemo 12B

$0.07

DeepSeek V3.2

$0.09

$0.09

$0.09

$0.11

$0.11

Llama 4 Maverick

$0.12

$0.14

116

$0.15

117

$0.15

118

$0.15

119

120

122

123

$0.17

127

$0.18

128

$0.24

$0.29

$0.29

$0.30

170

$0.52

193

$0.59

202

$0.60

203

$0.62

$5.87

289

Total:$990.18

Average:$3.28

(301 modelos)

Reasoning Tokens

Tokens used in the reasoning process

238K

Kimi K2 Thinking

243K

252K

256K

257K

GLM 4.6V

258K

260K

270K

281K

282K

Kimi K2.5

287K

287K

QwQ 32B

291K

296K

Solar Pro 3

307K

363K

Qwen3.5 397B A17B

364K

Total:94.9M

Average:855K

(111 modelos)

Output Tokens

Tokens generated in responses

64K

Command R+ (08-2024)

Saba

73K

73K

Aion-RP 1.0 (8B)

73K

78K

84K

Pixtral 12B

GPT-4.1

93K

98K

100K

102

108K

127K

142K

172

142K

173

144K

174

153K

187

153K

188

179K

215

196K

222

280K

265

336K

275

343K

278

352K

279

443K

292

562K

306

1.6M

318