Grading

Basic principles

In all Science Fights, the Jury evaluates the Team performances by publicly showing integer scores called Grades or G.

The Grades G from all n Jurors in a Group are used to calculate the Average Point P. Two extreme Grades, one maximum and one minimum, are replaced with one grade equal to their arithmetic mean. In the next step, P is determined as the arithmetic mean of the new data set of n−1 Grades. This procedure has the advantage of weighting outliers less heavily. P is rounded to the nearest 0.1 points. An example below (see a pdf file) illustrates this procedure with some real data from the 2nd IYNT 2014.



Each Team completes three performances (Reporter, Opponent, and Reviewer) in each Science Fight and earns three Average Points P which are summed up to obtain the Sum of Points SP. In case the Team violates particular rules of the IYNT, Yellow Cards are issued and the SP is consequently reduced.

The values of SP in each Group are used to calculate the Criterion of Victory V which is set to V=1 for the Team with the highest SP and for one or two Teams which have an SP that differs from the top result by no more than 2 points (SPSPmax−2.) For the Teams in the Group which have (SPmax−10)≤SP<(SPmax−2), the Criterion of Victory is set to V=½. For the Teams which have SP<SPmax−10, V=0. The Criterion of Victory, the primary parameter that determines the placing and rank of each Team, minimizes effects of statistical noise and Juror-to-Juror differences in grading.


Distributions of G and P

During the 1st, 2nd, 3rd, and 4th IYNTs taken together, 4098 Grades were delivered in 219 stages. Each of the 657 performances therefore obtained its Grades G from an average of n=6.2 Jurors. A total of 81 round-robin matches (Science Fights in a Group) were played with three or two Teams each. In them, the Teams collected 219 values of SP and 219 values of V (98 instances of V=1, 84 instances of V=½, and 37 instances of V=0; including the 1st IYNT 2013 retrospectively.)



The graph (hi-res image, raw ASCII data) shows a fitted histogram of the Grades for each type of performance (Report, Opposition, and Review.)

The spread in the G along the X-axis is broad, indicating that nearly a whole spectrum of available G is used by the Jury, however the extreme G are inherently less frequent. We encourage each juror to stay centered around 15 points for a Report, around 10 points for an Opposition, and around 5 points for a Review, each time weighting their G against what they believe an average IYNT performance is. In reality however the actual mean and standard deviation are 18.7±5.1 for a Report, 12.6±3.5 for an Opposition, and 6.8±1.9 for a Review, indicating that an average Juror shifts all their G to the right as compared to our guidelines. The distributions are moderately asymmetric, with the median Grades of 19 for a Report, 13 for an Opposition, and 7 for a Review.

In turn, the mean and standard deviation for the Average Points P are 18.6±4.0 for a Report, 12.6±2.8 for an Opposition, and 6.8±1.4 for a Review (raw ASCII data.) Such distributions of P are narrower than the respective distributions of G.


Spread of the Grades

At this point it is important to realize that the IYNT requires a direct comparison of results from parallel Groups, whilst each Team does not play every other Team during one Science Fight. We should therefore be aware of the extent to which parallel boards of Jury can be influenced by fluctuations and what grading parameters are uniformly objective indications of the relative strengths of all IYNT contestants.

A consistent grading is extremely important for the IYNT as it must allow reliable identification of winners in each Group and the ultimate winners of the competition.

Let us consider three hypothetical and extreme cases, Example 1, Example 2, and Example 3.



In the Example 1, all Jurors agree with each other and there is high confidence that Average Points P reflect objective differences between the two Teams. For each Juror and each Grade, G−P=0.

The Example 2 shows a situation where both Teams obtain equal Grades from the middle of the spectrum. Although it is natural that some games can end in a tie, this scenario is less advantageous. If no Team can impress the Jury or the Jury lacks sensitivity to hidden variances between the Teams, it is difficult to rank all Teams from top to bottom.

The Example 3 depicts an undesired event where different Jurors use radically different grading criteria. Team 1 earns 0.1 points more than Team 2, however the level of confidence to this difference is low because σG−P, or the spread in G given to one performance by several Jurors, is much wider than these eventual 0.1 points. Albeit both Average Points P are very close or equal to the Example 2, these two P do not reflect the serious differences between the Teams that each Juror has noticed and highlighted in their grading. Concluding that Team 1 shows a better performance than Team 2, or either Team in Example 2, is inconsistent with this set of Grades G.

Our aim is that different Jurors put very similar Grades for one performance, however each Juror puts distinctly different Grades for different performances.


Grading criteria

During each IYNT, we brief Jurors and Teams on our grading and scoring criteria. Our guidelines have evolved since 2013, and as of now consist of four partial grading criteria. Our aim is to keep the guidelines clear and simple, and make sure that any Juror relies on the fixed, common criteria when evaluating performances across the parallel Groups. The criteria are printed directly on the individual Juror's protocols.



The Jurors are asked to add or subtract points from a starting grade (15, 10, or 5) and decide on their final grade G. Such a decision is individual, and in an upwards of 99% cases there is a spread of the Grades. No Grade can be corrected retroactively, and each Juror must justify any of their Grades upon the request of Team Captains or the Chairperson. Each G is public.

Find below the blank Juror's protocols used at the 4th IYNT 2016, as well as the slides from the most recent introductory briefing for Jurors and Teams.

  • Blank individual Juror's Protocol, A4 size (2016/03/29) [pdf]
  • Briefing for Jurors and Teams, slides by Ilya Martchenko (2016/07/17) [pdf]


Distributions of G−P

It is now good to look at the real data from the four previous IYNTs and analyze the spread in the G given by different Jurors to one performance in one Science Fight.



The graph (hi-res image, raw ASCII data) shows each of the 4098 Grades G given during the first four IYNTs. X-coordinate indicates which of the 30 possible Grades G was given. Y-coordinate indicates the difference between this particular G and the Average Point P that was calculated on its basis.

To interpret the spread in the Y-axis, one should remember that if Example 1 would take place in each IYNT Stage, each G would be equal to its respective P, and G−P would globally collapse to zero. Luckily, the IYNT is not a paper-and-pencil exam, and its Jurors have opinions which result in a distribution of individual G around the P in each Stage.

The standard deviations of the distributions of G−P for the three types of performances are found as follows: 3.26 for a Report, 2.25 for an Opposition, and 1.21 for a Review. These particular values are calculated with two extreme G in each Group taken with the weight of 1, rather than ½.


Statistical significance of SF results

Since the extreme G contribute to statistics of P with the weight of ½, we can prepare the working dataset of 4098−3×219=3441 processed grades that we label g (raw ASCII data from three years only.) g=G if the respective G is not extreme in the Group, and g=½(Gmax+Gmin) for the pairs of extreme Gmax and Gmin in each Group.

The statistical parameters of the distributions of residuals g−P now provide crucial information on the statistical significance of each P and in turn the IYNT rankings.

In case the SF results do not permit rejecting the null hypothesis that a slightly higher SP in a round-robin Science Fight is observed by chance, more than one Team earns V=1. Unlike TSP, the Sum of Victories SV keeps track of such statistically significant cases wherin one or several SF winners step forward.

We can define the significance of V=1 as the level of statistical confidence for the interval [SP−2...60] in one Science Fight Group. This statistical significance depends only on the values of g−P and number of Jurors n in the Group, and does not directly depend on the absolute magnitudes of G. This has the advantage of placing focus on congruence between opinions of Jurors, rather than ranking of Teams.

To illustrate how to compute the confidence of [SP−2...60] and [SP−10...60], let us analyze in depth one example of a round-robin Science Fight (Finals of the most recent 4th IYNT 2016, see a hi-res pdf file.)



The standard deviations σ of the distributions of g−P for each of three types of performances in this SF are found as follows: σREP=1.91 for a Report, σOPP=1.36 for an Opposition, and σREV=0.65 for a Review (each from a sample of 27 residuals g−P.) As per the IYNT procedure, SP is calculated as a sum of Average Points P for these three performances. By assuming σ2SP=σ2REP+σ2OPP+σ2REV, we can easily find σSP=2.43 in these Finals of the 4th IYNT. It is now easy to determine the root-mean-square deviation ρSP by defining ρ2SP=σ2SP/(n−1), where number of Jurors is n=10.

The value of ρSP=0.81 evaluates the standard error of the mean, i.e. the statistical uncertainty of SP earned by any Team in the SF. This standard error is inherent in estimating whatever true value of SP from limited statistics. If the difference between any two sample-based SPi and SPj is comparable to or less than ρSP, they can be assumed statistically indistinguishable.

In the next step, we can find the confidence level for interval [SP−2...60] as a function of the number of degrees of freedom in a representative sample (i.e. n−1) and Student's t-score (i.e. 2/ρSP.) This is based on assuming that Student's t-distribution is a valid approximation.

This calculation yields the confidence level of 98.2% for the interval [SP−2...60], well above a two-sigma significance threshold. In the Finals of other previous IYNTs, confidence levels for the interval [SP−2...60] were 94.8% in 2015, 98.8% in 2014, and 98.2% in 2013. A similar calculation yields the confidence level of 99.99997 for the interval [SP−10...60] in 2016 (well above five-sigma), 99.99986% in 2015, 99.99971% in 2014, and 99.99827% in 2013 (well above four-sigma in each case.)

The Table below summarizes the paramateres calculated in the similar manner for all 81 round-robin Science Fights of past four IYNTs. We cordially acknowledge Dmitriy Baranov for his help in processing the data. Click on the headers to have the table sorted by any desired parameter.

SFσREPσOPPσREVσSFρSPt2nV=1V
2013-1-A1.601.390.722.240.912.19796.4%100.0%
2013-1-B3.022.340.703.881.741.15684.9%99.89%
2013-1-C0.991.260.471.670.842.39596.3%99.99%
2013-1-D1.181.080.531.690.752.65697.7%100.0%
2013-1-E1.162.780.423.041.521.31587.1%99.86%
2013-1-F1.231.410.351.910.852.34696.7%100.0%
2013-2-A1.440.770.691.770.792.52697.4%100.0%
2013-2-B1.670.960.772.080.932.15695.8%99.99%
2013-2-C2.291.290.822.761.231.62691.7%99.98%
2013-2-D1.320.850.681.710.862.34596.0%99.98%
2013-2-E1.441.100.461.870.842.39696.9%100.0%
2013-2-F0.991.610.802.050.842.39797.3%100.0%
2013-3-A1.080.710.551.400.702.85597.7%99.99%
2013-3-B0.980.850.301.330.673.00598.0%99.99%
2013-3-C1.780.981.062.291.151.74592.2%99.95%
2013-3-D0.880.660.701.300.653.07598.1%99.99%
2013-3-E0.482.210.462.301.151.74592.1%99.95%
2013-3-F1.460.870.771.870.842.39696.9%100.0%
2013-4-A1.340.950.531.730.772.59697.6%100.0%
2013-4-B0.200.860.350.950.474.22599.3%100.0%
2013-4-C1.121.230.741.820.912.20595.3%99.98%
2013-4-D1.210.770.371.470.663.03698.6%100.0%
2013-4-E1.431.000.461.810.902.21595.4%99.98%
2013-4-F1.890.770.882.231.002.01694.9%99.99%
2013-S-A1.360.970.911.900.852.35696.7%100.0%
2013-S-B1.641.440.792.320.882.28897.2%100.0%
2013-S-C1.501.080.661.960.742.70898.5%100.0%
2013-F-A1.811.261.022.430.812.471098.2%100.0%
2014-1-A2.421.950.713.181.421.40689.0%99.95%
2014-1-B2.550.870.672.781.241.61691.6%99.98%
2014-2-A1.751.180.522.170.972.06695.3%99.99%
2014-2-B1.200.840.691.620.722.76698.0%100.0%
2014-3-A2.321.170.352.621.171.71692.6%99.98%
2014-3-B2.041.030.682.391.071.87694.0%99.99%
2014-4-A1.021.150.771.720.772.60697.6%100.0%
2014-4-B2.041.150.452.391.071.87694.0%99.99%
2014-F-A1.290.860.561.650.672.97798.8%100.0%
2015-1-A2.961.230.673.281.471.36688.5%99.95%
2015-1-B1.370.950.971.930.862.32696.6%100.0%
2015-1-C1.420.940.571.800.802.49697.2%100.0%
2015-1-D1.060.820.581.460.732.74597.4%99.99%
2015-2-A3.023.281.234.621.891.06783.5%99.91%
2015-2-B1.571.010.832.050.842.39797.3%100.0%
2015-2-C1.771.260.752.301.031.95694.5%99.99%
2015-2-D0.771.430.681.770.792.53697.4%100.0%
2015-3-A2.491.260.412.821.151.74793.4%99.99%
2015-3-B0.901.420.591.780.802.51697.3%100.0%
2015-3-C2.141.160.712.541.131.76693.1%99.98%
2015-3-D2.031.090.552.371.061.89694.1%99.99%
2015-4-A1.121.130.891.820.812.45697.1%100.0%
2015-4-B1.470.990.631.880.842.38696.8%100.0%
2015-4-C1.951.340.622.451.091.83693.6%99.99%
2015-4-D2.240.730.632.441.091.83693.7%99.99%
2015-S-A2.300.951.122.731.111.79793.9%99.99%
2015-S-B2.340.841.002.681.091.83794.1%100.0%
2015-F-A3.211.760.823.751.131.771294.8%100.0%
2016-1-A3.451.960.844.061.821.10684.0%99.87%
2016-1-B2.661.700.763.251.451.38688.7%99.95%
2016-1-C2.621.520.953.171.421.41689.1%99.96%
2016-1-D1.901.580.662.561.151.75692.9%99.98%
2016-1-E2.291.541.263.041.361.47690.0%99.96%
2016-1-F2.402.131.153.411.521.31687.7%99.94%
2016-2-A1.971.330.662.471.111.81693.5%99.99%
2016-2-B0.950.690.481.270.573.52699.2%100.0%
2016-2-C1.791.000.382.081.041.92593.6%99.97%
2016-2-D2.041.770.722.801.251.60691.5%99.98%
2016-2-E2.401.450.662.881.441.39588.1%99.89%
2016-2-F1.592.191.303.001.341.49690.2%99.97%
2016-3-A1.991.400.602.511.021.95795.1%100.0%
2016-3-B3.092.551.254.191.881.07683.3%99.84%
2016-3-C2.371.580.712.941.201.68792.7%99.99%
2016-4-A3.681.440.694.012.001.00581.3%99.62%
2016-4-B2.331.240.782.751.231.63691.8%99.98%
2016-4-C2.391.730.573.011.501.33587.3%99.87%
2016-4-D2.521.650.703.091.381.45689.6%99.96%
2016-4-E0.761.061.051.670.742.67697.8%100.0%
2016-4-F1.381.460.822.171.081.84593.1%99.96%
2016-S-A2.191.630.882.871.171.71793.1%99.99%
2016-S-B3.482.031.224.211.721.16785.6%99.94%
2016-S-C2.101.100.602.451.002.00795.4%100.0%
2016-F-A1.911.360.652.430.812.471098.2%100.0%

These results justify the importance of the Criterion of Victory V and importance of the fact that no IYNT Stage has ever been graded by less than 5 Jurors. As argued below, besides having different opinions, individual Jurors may also work on different grading scales. At all times, when looking at the IYNT scores, we ask whether their difference is representative of a real difference between Teams or whether it is a statistical fluke. An especially high level of significance is demanded if grading parameters are used to resolve the placing of eventual Semi-Finalists and Finalists.

In a typical Science Fight, earning a V=½ or above is a four-sigma event, with an expected confidence level of (99.97±0.05)% for the interval [SP−10...60]. Earning a much more rewarding V=1 is a two-sigma event, with an expected confidence level of (94±4)% for the interval [SP−2...60]. Earning each V is a statistically independent event, and earning several V=1 further contributes to the confidence in the placing of top IYNT Teams.

By assuming that the grading parameters of Jurors would not improve considerably before the 5th IYNT 2017, we may estimate the average expected level of confidence for [SP−2...60] as a function of the number of Jurors n randomly selected to one Group. In this table, we re-calculate ρSP from a historically global σSP=2.5618.

No. Jurors, n23456789101112
DFs, n−11234567891011
Std error, ρSP2.61.81.51.31.11.01.00.90.90.80.8
[SP−2...60], %7181879093959697989899


Grading parameters of individual Jurors

For reference purposes, this table summarizes the individual grading parameters (within one IYNT) of all existing 100 Jurors. We cordially acknowledge Dmitriy Baranov for his help in processing the data. The presented parameters give a glimpse of the Jurors' perceptions of the grading scale in the IYNT. Click on the headers to have the table sorted by any desired parameter.

Yr, StNameGσGσG−PG−PnGnSFnTκ
2016 Ilya Martchenko10.36.13.0-1.94258+1.9
2015 Ilya Martchenko12.76.12.8-0.14868+0.9
2014 Ilya Martchenko12.26.72.1-1.03955+1.7
2013 Ilya Martchenko12.56.02.1-0.348610+0.9
2016 Mladen Matev11.85.62.4-1.151611+0.8
2015 Mladen Matev12.15.12.1-1.34869+0.2
2014 Mladen Matev10.95.32.5-1.43044+0.9
2013 Mladen Matev12.55.53.0+0.039510+0.4
2015 Gur. Mikaberidze11.94.61.9-0.93957-0.3
2014 Gur. Mikaberidze11.05.62.2-1.62444+1.1
2013 Gur. Mikaberidze11.15.12.0-0.333410+0.6
2016 Evgeny Yunosov13.35.51.6+0.11825+0.1
2015 Evgeny Yunosov14.36.11.8+0.43349+0.3
2014 Evgeny Yunosov14.46.71.7+1.23955+0.8
2016 Andrei Klishin10.86.03.0-1.851610+1.6
2015 Andrei Klishin13.56.93.5-0.44869+1.4
2016 Dina Izadi11.97.21.6+0.8913+2.3
2013 Dina Izadi10.44.92.1-1.048511+0.7
2016 Alena Kastenka10.55.92.5-1.03048+1.6
2015 Alena Kastenka11.55.62.2-0.53346+0.9
2016 Dmitry Zhukalin12.86.92.9+0.12436+1.7
2015 Dmitry Zhukalin13.65.71.4+0.73657+0.1
2015 Aleks. Dimić12.65.91.4-0.33347+0.8
2014 Aleks. Dimić12.96.11.9-0.33044+0.8
2016 Danko Marušić11.25.13.7-1.83957+0.5
2015 Danko Marušić11.55.32.8-0.73958+0.6
2015 Milen Kadiyski14.66.61.9+1.14869+0.6
2014 Milen Kadiyski16.36.91.8+1.14255+0.2
2016 Nika Sabashvili11.55.01.9-1.03048+0.3
2015 Nika Sabashvili12.65.41.9-0.13347+0.3
2014 Dmitriy Agarkov14.66.21.3-0.93954+0.2
2013 Dmitriy Agarkov12.75.61.4-0.645510+0.4
2015 Andrey Kravtsov13.65.62.1+1.12137+0.0
2014 Andrey Kravtsov15.16.72.1-0.43954+0.5
2016 Som. Mahmoodi12.67.11.3+0.73349+2.0
2016 N. Seliverstova12.87.21.5+0.21224+2.0
2016 Nikita Datsuk11.86.72.8-1.03348+1.9
2015 D. Radovanović12.36.83.6-0.43958+1.8
2016 Ivan Syulzhyn9.85.61.3-1.73347+1.6
2015 Ivan Reznikov11.76.43.3-1.14869+1.6
2016 Jalil Sedaghat11.76.43.1+0.22437+1.6
2016 Samuel Byland11.15.92.2-1.451610+1.4
2013 Igor Evtodiev10.15.42.0-1.44259+1.3
2013 Alina Astakhova11.86.12.2+0.157614+1.3
2013 Naime Arslan12.46.41.9+1.9913+1.3
2016 Ahmad Sheikhi13.16.63.0+1.436410+1.3
2013 Ismail Kiran11.76.01.8-0.14258+1.2
2016 Af. Montakhab12.06.11.8-0.12136+1.2
2015 Aleks. Suvorova14.27.03.0+0.84567+1.2
2016 Roya Radgohar14.37.02.7+1.93048+1.2
2016 Azizolah Azizi14.77.22.2+1.748610+1.2
2016 Laura Guerrini10.45.32.3-2.245510+1.1
2013 Celalettin Baykul12.76.32.9+0.41825+1.1
2013 Jeyhun Jabarov14.67.12.5+0.739510+1.1
2016 Dmitii Dorofeev10.25.22.3-2.15169+1.0
2013 Jevhen Olijnyk10.85.41.5-0.73958+1.0
2013 Ersin Karademir11.85.81.6+0.33959+1.0
2016 Marzieh Afkhami13.46.51.9+1.52436+1.0
2016 M. Sadat Tahami13.56.52.7+1.142510+1.0
2014 D. Karashanova15.47.31.6+0.63345+1.0
2013 Diana Kovtunova13.06.21.2-0.142510+0.9
2013 Ahmet Çabuk13.16.21.5+0.32739+0.9
2013 Antoan. Nikolova13.36.31.9+0.742510+0.9
2015 Vesna Vasić15.77.31.7+2.3913+0.9
2013 Aliaks. Mamoika12.25.81.6-0.442510+0.8
2013 Vlad. Vanovskiy13.26.22.2+0.054612+0.8
2013 Val. Lobyshev13.56.32.0+1.336410+0.8
2015 Dušan Dimić14.96.92.2+1.04869+0.8
2013 Buras Boljiev11.45.41.7-0.33046+0.7
2016 Ban. Rastegari13.96.41.9+1.32737+0.7
2016 Tatiana Fursova11.25.22.2-0.62737+0.6
2015 Jelena Vračević11.85.42.0+0.52437+0.6
2016 Jaf. Vatanparast12.35.61.6+1.13648+0.6
2015 Viktor Nechaev13.36.01.9+0.12748+0.6
2016 Sed. Forootan13.96.31.2+2.8913+0.6
2013 Ek. Mendeleeva14.36.41.4+0.748511+0.6
2016 Giorgi Khomeriki7.73.62.3-3.12437+0.5
2013 Timothy Timur9.84.52.1+0.31525+0.5
2014 Nasko Stamenov11.65.22.2-0.63044+0.5
2013 Alexander Sigeev11.75.31.7-0.95469+0.5
2013 Emel Alğin11.95.42.0-0.357613+0.5
2013 Sergey Sabaev12.05.41.7-0.1913+0.5
2013 Özge Özşen12.45.62.0+0.236410+0.5
2013 Ayset Yurt Ece13.05.82.6+0.73349+0.5
2015 Wang Sihui13.35.91.5-0.32746+0.5
2016 Som. Haj. Gooki15.16.72.3+2.44559+0.5
2013 Siarhei Seniuk11.25.01.6-0.842513+0.4
2016 Roya Pournejati12.55.51.5+0.233410+0.4
2013 Dursun Eser12.85.61.2+0.636410+0.4
2016 Zahra Yazdgerdi12.85.62.9-0.22437+0.4
2013 Louis D. Heyns13.25.81.6-0.342512+0.4
2016 Afshan Mohajeri14.16.22.8+1.42737+0.4
2013 Hakan Dal16.77.22.3+2.3612+0.4
2013 Sergey Zelenin12.85.51.5+0.14259+0.3
2015 Kirill Volosnikov12.95.62.3-0.54569+0.3
2013 Ebru Ataşlar14.96.42.3+1.73649+0.3
2014 Vasilka Krasteva17.17.31.8+1.53954+0.3
2015 Tatiana Besedina12.35.21.9-0.53957+0.2
2016 Mas. Tor. Azad13.35.62.2-0.151612+0.2
2015 Elena Chernova13.45.71.5+0.93648+0.2
2016 Maryam Bahrami14.46.12.1+1.845510+0.2
2016 Stan. Krasulin10.14.22.5-2.31826+0.1
2016 Milad Zangiabadi13.35.52.0+0.62437+0.1
2015 Yury Kartynnik10.44.21.3-0.92738+0.0
2016 M. R. Moghadam11.24.61.2+0.41524+0.0
2013 Fatih Akay12.85.21.7-0.342510+0.0
2013 Vladimir Shiltsev13.45.52.6-0.51214+0.0
2015 Drag. Jovković15.46.31.9+2.12747+0.0
2016 M. D. Aseman13.25.31.7+1.12738-0.1
2013 Marina Sergeeva13.55.41.5+0.351612-0.1
2016 A. Poostforush14.85.92.0+2.13957-0.1
2014 Elena Trufanova15.36.12.0+0.53954-0.1
2015 Lucija Papa13.15.11.3+0.12135-0.2
2013 Pınar Aytar13.55.31.3+0.230410-0.2
2016 Hassan Eslahi13.65.41.8+0.83649-0.2
2015 Oscar Rabinovich12.74.91.7-0.54867-0.3
2013 Natalia Borodina12.34.61.1+0.03048-0.4
2013 Necmettin Caner12.84.71.2-0.3913-0.5
2013 Sertaç Eroğlu11.73.91.9+0.51525-0.9
2013 Sabahattin Esen15.04.81.1+1.6612-1.3
---Average12.75.82.0+0.13448+0.6
---Best records10.17.31.1+0.057614+2.3

  • Color-coded status tags reflect various roles of the IYNT Juror: violet tag is the Juror who acted as Chairperson while green tag is the Juror who did not act as Chairperson; blue tag is the independent Juror while red tag is the Team Leader was acted as Juror;
  • G⟩ is arithmetic mean of all Grades delivered during the IYNT;G⟩=(5+10+15)/3=10 is our target for any Juror to allow for uniform and equal weight grading scales throughout the parallel Groups; only 3 Jurors went below the target, while 97 Jurors went above the target;
  • σG is standard deviation of all Grades delivered during the IYNT; if σG is large, the Juror uses a broader spectrum of Grades and has more differentiation within their evaluation scale; note that σG does not necessarily reflect clearer separation of the Teams, cf. two notable hypothetical extremes of σG=13.7 for the set {30; 1; 1; 30; 1; 1} and σG=11.1 for the set {30; 1; 1; 1; 20; 10} in a two-Team SF; various limits for three types of performances result in a trend that a higher σG is more likely to appear for Jurors with a higher ⟨G⟩; theoretical slope of a trend, or baseline, is σG/⟨G⟩ for the set {30; 20; 10} or 0.40825; parameter κ corrects for this baseline;
  • σG−P is standard deviation of all residuals G−P; if σG−P is small, the Grades are less scattered respective to the Grades of other Jurors and contribute to a smaller ρ; an implausible theoretical minimum is σG−P=0; real-life maximum and minimum records are 3.7 and 1.1;
  • G−P⟩ is arithmetic mean of all residuals G−P; if ⟨G−P⟩ is close to zero, the grading scale is less shifted respective to the individual scales of other Jurors; it is easy to notice moderate statistical noise that hinders the inherent correlation between ⟨G−P⟩ and ⟨G⟩;
  • nG is the number of Grades given; greater nG means better statistics; theoretical caps depend on exact tournament brackets and were nG=60 in 2013, nG=45 in 2014, and nG=54 in 2015 and 2016, though not accessible even for the Jurors working in each Science Fight due to such constraints as distribution of two-Team Groups;
  • nSF is the number of Science Fights judged; greater nSF means better statistics; a theoretical cap with Semi-Finals is nSF=6;
  • nT is the number of Teams judged; greater nT means more opportunities to observe stronger and weaker Teams and thus have a more comparative judgment; a theoretical cap is number of Teams N but no more than 18; note that some Teams are judged more than once by the same Juror within one IYNT;
  • κ is standard deviation of all delivered Grades corrected for the average Grade ⟨G⟩ via κ=σG−0.40825×⟨G⟩; κ reflects relative width of the spectrum of Grades used by the Juror and can be more suitable for comparison of Jurors with distinctly different ⟨G⟩; note that κ is linearly proportional to the relative value of σG/⟨G⟩.

The names are initially sorted by number of IYNTs judged, then by κ, then by ⟨G⟩. Click on the headers to have the table sorted by any desired parameter.

There is a complex interplay between each of the calculated parameters, and some of the crucial parameters depend not only on what G each Juror gives, but also on what G other Jurors in the same Group give. Other parameters depend on the lot, tournament brackets, or appointing decisions of the General Council, and are beyond control of the individual Juror. Persisting regularities seen for Jurors who worked at more that one IYNT suggest that any shifts in σG or ⟨G⟩, observed consistently in several Jurors, may reflect objective differences between separate IYNTs, viz. in diversity or average strength of participants.

It is interesting to notice that although many listed Jurors demonstrate similar values, there are particular Juror-to-Juror differences which are recognizable in separate IYNTs and not obscured by limited statistics. These differences in particular explain why Jurors and Teams rotate between the Groups, and V is a more representative derivative grading parameter than SP.

Whilst the Criterion of Victory V already alleviates any scaling differences, it would allow for extracting further fine-grained data if each future IYNT Juror

  • is comfortable with lower Grades for weaker performances, and therefore stays centered closer to ⟨G⟩=10, with preferably ⟨G⟩<13;
  • at the same time, works in a broader spectrum of high and low Grades, and thus has a larger σG, with preferably σG>6;
  • at the same time, is balanced to have a moderate σG−P, with preferably σG−P<3.

These three goals can naturally clash with each other. At this point it is important to realize that each Juror must be focused only on assessing immediate performances and sticking to uniform, scientific, merit-based grading criteria, and that furthermore each G must be independent and given individually.


Effects of individual grading parameters

To illustrate the potential consequences of the spread in these grading parameters, let us consider a Gedankenexperiment with three Teams competing in one Science Fight.

Team 1 shows a relatively strong performance and receives the Grades which sit on the upper end of the [⟨G⟩−½σG…⟨G⟩+½σG] interval. In other words, should G be distributed normally for each selected Juror, the performances of Team 1 would be better than Ф(⟨G⟩+½σG)=0.69 of all performances the Juror grades in the IYNT. Such a Team would potentially end up as a Finalist.

Team 2 shows an average performance and receives average Grades ⟨G⟩ from each Juror.

Team 3 shows a relatively weak performance and receives the Grades which sit on the lower end of the [⟨G⟩−½σG…⟨G⟩+½σG] interval. In other words, their performances are weaker than very approximately Ф=0.69 of all IYNT performances graded by the Juror. Such a Team would potentially not qualify for Semi-Finals.

These three Teams are graded simultaneously by two boards of selected Jurors. One board is composed of six Jurors with some of the lowest observed ⟨G⟩, while the other board is composed of six Jurors with some of the highest observed ⟨G⟩. It is easy to determine the results of this hypothetical Science Fight because ⟨G⟩ and σG are publicly known for each Juror.



These results illustrate the level of tolerance of the Criterion of Victory V and Sum of Points SP to the most severe effects of improbably unbalanced boards of Jurors. As seen from this calculation, the strong Team 1 graded by low-⟨G⟩ Jurors obtains less points than the average Team 2 graded by high-⟨G⟩ Jurors and ties with the weak Team 3. The weak Team 3 graded by high-⟨G⟩ Jurors, respectively, earns more points than the average Team 2 graded by low-⟨G⟩ Jurors. An artificial selection of Jurors in this test leads to unrealistically small σg−P and ρSP in both boards of Jurors.

There is however no negative effect on the Criterion of Victory V and consequently the results of the Science Fight.

In the next Gedankenexperiment, let us rotate and evenly distribute the same Jurors as routinely made before each real Science Fight.



Though the Jurors give the same Grades G as in the first experiment, their balanced distribution now mitigates the effects on SP of Juror-to-Juror differences in grading. Note that this happens at the cost of increased σg−P and ρSP, which both now fall in the range of typical IYNT values despite an articifial, bimodal distribution of ⟨G⟩. Although we cannot generalize from one example, the respective Sums of Points SP from both boards of Jurors now differ by only 1.5, 0.6, and 0.6 points.

In this extreme value analysis, we test a statistically improbable scenario which would have some of the worst impacts on the stability of Science Fight results. We test extreme values of ⟨G⟩ and the most unrealistic distribution of Jurors, and observe the amplitude of fluctuations in SP which always fall within a 2 points threshold.


Grading parameters of separate IYNTs

The table below provides an overview of statistical parameters for the Grades given by all Jurors within one IYNT, and the overall statistics for all four IYNTs.

YearGσGsJsSV4sTSP4σG−PnGnstnTnJnchV=1V
201312.55.80.110.430.122.0142278164111(96±3)%(99.98±0.03)%
201414.16.70.150.520.202.2423235125(95±3)%(99.99±0.01)%
201513.06.00.100.290.092.496949102710(94±3)%(99.99±0.02)%
201612.36.20.130.580.242.812846916409(91±5)%(99.94±0.08)%
All12.76.10.120.470.192.440982194710023(94±4)%(99.97±0.05)%

  • G⟩ is arithmetic mean of all Grades delivered during the IYNT by all Jurors;
  • σG is standard deviation of all Grades delivered during the IYNT by all Jurors;
  • sJ=σGJ/⟨⟨GJ⟩ is relative spread in ⟨G⟩ between all Jurors at the IYNT; lower values correspond to less Juror-to-Juror grading differences;
  • sSV4=σSV4/⟨SV4⟩ is relative spread in the Sum of Victories after SF 4; higher values correspond to stronger diversity between Teams and better separation in the ranking;
  • sTSP4=σTSP4/⟨TSP4⟩ is relative spread in Total Sum of Points after SF 4; higher values correspond to stronger diversity between Teams in terms of points and contribute to better separation in the ranking;
  • σG−P is standard deviation of all residuals G−P during the IYNT;
  • nG is the number of Grades given;
  • nst is the number of Stages in the IYNT;
  • nT is the number of Teams in the IYNT;
  • nJ is the number of individual Jurors in the IYNT;
  • nch is the number of individual Chairpersons in the IYNT;
  • V=1 is the mean and standard deviation of the confidence levels for the interval [SP−2...60] in all SFs;
  • V=½ is the mean and standard deviation of the confidence levels for the interval [SP−10...60] in all SFs.

By comparing sSV4 with sTSP4, we can see that the Criterion of Victory prevents a melting pot effect in TSP where Total Sums of Points can converge to rather similar values for many Teams.


Grading parameters of separate SFs (upd 2015)

Teams and Jurors alike rapidly learn from SF to SF. One may argue that the Teams may start showing less diverge performances, or the Jurors may start giving less diverge Grades G. The following data, for three IYNTs together, is of interest to assess the importance of these two effects. Note that Semi-Finals and Finals obviously include the data for the respective participants only.

SF No.SPσSPsSPVσVsV
Selective SF 135.79.10.260.630.340.53
Selective SF 238.06.90.180.660.390.59
Selective SF 337.87.00.180.630.360.57
Selective SF 438.36.10.160.680.350.52
Semi-Finals38.93.30.090.690.300.44
Finals44.84.70.110.700.400.57


Summary

Overall, the presented results define the extent to which the results of single paired comparisons of SP are not yet obscured by statistical noise. With the available data, we can conclude that the IYNT procedures and in particular the Criterion of Victory V alleviate Group-to-Group and Juror-to-Juror scaling differences, and allow separation of each Team in the IYNT with a two-sigma significance threshold.



Comparative results of real IYNT Teams

Click on the headers to have the table sorted by any desired parameter. The Teams are initially sorted by Criterion of Victory in the Finals (VF), then by Sum of Points in the Finals (SPF), then by Criterion of Victory in the Semi-Finals (VsF), then by Sum of Points in the Semi-Finals (SPsF), then by Sum of Victories after Selective SF 4 (SV4), then by Total Sum of Points after Selective SF 4 (TSP4). Final Rank (RF) and the type of Medal (M) reflect the results of each IYNT according to the regulations valid at the time.

Year Team name SV4 TSP4 VsF SPsF VF SPF RF M
2013Belarus-Universum4177.1143.3150.81
2014Georgia "Georgians"3190.6150.51
2014Bulgaria-Sofia4206.1149.32
2016Georgia-Georgians4172.4145.0146.71
2015China3163.0140.6145.31
2015Georgia-Georgians165.4141.4145.12
2015Croatia175.6140.7144.23
2014Serbia184.3½46.53
2013Georgia-Raveko3174.4145.1½42.82
2016Belarus-Pahonia4165.1142.5½42.42
2016Croatia4173.5140.1½41.13
2013Turkey-Bahçeşehir PES4173.8½38.3037.23
2013Moldova-Eco Generation154.6141.7036.44
2013Russia-TMOLimpiycy167.5140.07
2016Iran-Gifted132.7138.84
2013Bulgaria-Bulgaria3170.0½39.16
2015Serbia-1174.3½39.04
2013Russia-MG 123160.1½39.08
2016Iran-Khashayar2115.5½37.48
2013Russia-RLC3173.0½37.15
2015Serbia-2154.4½36.06
2013Afghanistan-Ariana3155.5½35.89
2016Iran-Mehr155.4½35.75
2015Belarus-Spectrum3174.4½32.95
2016Iran-Maple145.3½31.47
2013Bulgaria-Science Girls3153.4032.510
2016Russia-Voronezh3147.7030.46
2016Iran-Besat 1113.0030.29
2014Russia "Vinegret"146.44
2014Bulgaria-Kyustendil "R/"½113.45
2016Iran-Black Intelligence3161.910
2013Turkey-Fatih Eskişehir146.911
2015Bulgaria140.97
2013Kazakhstan-NIS D/team130.315
2015Russia-Voronezh-12161.68
2015Russia-Voronezh-22143.39
2013Iran-Iran145.912
2016Iran-Besat 2116.711
2015Russia-MG 121138.410
2013Kyrgyzstan-Alatoo1137.613
2013Ukraine-Richelieu1137.514
2016Iran-Free Thought187.912
2016Iran-Amordad½118.213
2016Iran-Paramount Notion½106.414
2016Iran-Farhang½85.215
2016Iran-Velayat½75.516
2013Turkey-Samanyolu0107.916