CTA Algorithm 1 Examples
Last Updated: September 13, 2018; First Released: January 09, 2015
Author: Kevin Boyle, President, DevTreks
Version: DevTreks 2.1.6
A. Algorithm 1 Introduction
The sibling reference, Conservation Technology Assessment (CTA), introduces the background numerical techniques for completing CTAs. This reference introduces examples of CTAs completed using Algorithm 1, MathNet and System.Math Algorithms (1*).
Algorithm 1 is a front end to custom algorithms developed by DevTreks. All of these algorithms employ System.Math and MathNet mathematical libraries (refer to the references). The goal of most algorithms will be to produce confidence intervals for an Indicator’s QTM, QTL, and QTU and a Score’s ScoreM, ScoreL, and ScoreU, properties (most likely, low, and high estimates). The advantage to algorithms using this library include: 1) fine-tuned control by developers over how an algorithm works and how results get displayed, 2) compiled binary code which can be optimized for cloud computing performance, and 3) ease of stepping outside the boundaries of conventional statistical libraries.
The following subalgorithms are currently available with this algorithm. Appendix A gives examples for each subalgorithm.
subalgorithm1: Monte Carlo: Example 1a and the examples in the Resource Stock Calculation reference use this option to introduce basic risk analysis. This algorithm uses the distribution of QT (QT, QTD1, and QTD2) or TEXT datasets with a mathematical library to produce QTM, QTL, and QTU. Unlike a full probabilistic risk analysis, this algorithm does not account for correlations between indicators.
subalgorithm2: Normal Copula: Example 1b uses this option to carry out probabilistic risk analysis, accounting for correlations between indicators.
subalgorithm3: Eigen Copula with Normal Distribution: Examples 1c, 1d, and 1e use this option to carry out probabilistic risk analysis, accounting for correlations between indicators.
subalgorithm4: Eigen Copula with Uniform Distribution. Example 1f uses this option to demonstrate a slight variation of subalgorithm3. The only difference is the use of a Normal distribution (sub3) or Uniform distribution (sub4) in the calculations. The resultant calculations have slight differences that might be significant for some circumstances (i.e. health and safety Indicators).
subalgorithm5: Simulated Annealing. Example 1g uses this option to introduce combinatorial optimization analysis.
subalgorithm6: Regression. Example 1h uses this option to introduce probabilistic statistics that employ regression analysis.
subalgorithm7: Neural Network. Example 1l uses this option to introduce prediction and classification analysis.
subalgorithm8: Anova. Example 1m uses this option to introduce probabilistic statistics that employ analysis of variance to analyze randomized experimental data.
subalgorithm9: Disaster Risk Reduction (DRR): The associated Conservation Technology Assessment 2 tutorial demonstrates that this algorithm uses Disaster Risk Reduction algorithms to calculate confidence intervals for Benefit Cost Ratios and Cost Effectiveness Ratios. This algorithm focuses on measuring the direct monetary savings from disaster prevention interventions, especially those associated with climate change.
subalgorithm10: Disaster Risk Index (DRI): The associated Conservation Technology Assessment 2 tutorial demonstrates that this algorithm uses Disaster Risk Index algorithms to calculate confidence intervals for Benefit Cost Ratios and Cost Effectiveness Ratios. These Indexes measure both the direct and indirect savings from disaster prevention interventions, especially those associated with climate change.
subalgorithm11: Risk Management Index (RMI): The associated Conservation Technology Assessment 2 tutorial demonstrates that this algorithm uses Risk Management Index algorithms to calculate confidence intervals for Cost Effectiveness Ratios and Multi-Criteria Assessment Ratings. These Indexes measure a community’s ability to manage disasters, especially those associated with climate change.
subalgorithm12: Resiliency Index (RI): The associated Conservation Technology Assessment 2 tutorial demonstrates that this algorithm uses Resiliency Indexes algorithms to calculate confidence intervals for Cost Effectiveness Ratios and Multi-Criteria Assessment Ratings. These Indexes measure a community’s ability to monitor and evaluate their disaster prevention goals.
subalgorithm13, 14, 15, 16, and 17: Resource Conservation Accounting (RCA) Value Framework: The Performance and Social Performance Analysis tutorials document that Version 2.1.0+ uses these algorithms in a RCA Value Framework to measure social performance.
Calculator Patterns
Versions 2.1.4 and 2.1.6 upgraded the calculator patterns used by these algorithms. The primary pattern enforced in the upgrade is to place greater emphasis on using the Indicator.URL property to store data and/or scripts for the specific Indicator holding the URL. This promotes consistency with how calculations are run for the remaining algorithms (i.e. R and Python).
The legacy patterns of using a combination of the Score.DataURL and Score.JointDataURL for running joint calculations has been retained and simplified. These patterns allow all Indicator and Score data to be stored in 1 dataset and are useful when separate Indicator datasets are considered overkill or when the separate Indicators must be calculated together. The examples in Appendix A demonstrate that these patterns allow different combinations of Score and Indicator properties to be used to fill in Indicator and Score MathResults.
Summary and Conclusion
This reference demonstrates how to use algorithms based on the System.Math and MathNet libraries to complete CTAs. CTAs may help people to reach decisions that improve their lives and livelihoods in sustainable ways.
Footnotes
1. Version 2.0.2 supports additional mathematical libraries discussed in the sibling CTA references. The CTA 2 reference, R algorithms, supports the use of R and Intel Math Kernel mathematical libraries. The CTA 3 reference, Python algorithms, supports the use of Python mathematical libraries. The CTA 4 reference, Azure Machine Learning algorithms, supports the use of AML mathematical libraries. Porting subalgorithms from 1 math library to another may not be particularly difficult. The Version 2.0.4 and 2.0.6 releases upgraded the Monitoring and Evaluation tools so that they can use these algorithms.
2. The confidence interval generated for this gamma distribution is too close to be useful for modeling the uncertainty of this Indicator. Networks should consider including experts in statistics, mathematics, or domain-specific fields such as disaster assessment, to provide uniform guidance to their clubs about how to use specific CTA algorithms (i.e. our role is demonstrate what you should be doing rather than what you are doing). For example, the gamma’s shape and scale parameters can also be estimated using maximum likelihoods methods similar to the following:
shape = (1 / 4A) * (1+ (1 + (4A/3))^2)
scale = mean(x) / shape
where A, a transformed value for n observation, is calculated:
A =ln (mean(x) ? (sum of ln(x) / n)
3. The McCaffrey reference makes it clear that the algorithms employed are introductory examples. The neural network algorithm is a good example. The author points out ways that developers may want to take to improve the algorithm (i.e. stopping techniques). Several major information technology companies (i.e. Facebook, Google, and Microsoft) are starting to make their algorithms available as open source. These are logical replacements for several of the algorithms introduced in this reference.
4. Some M&E practitioners may legitimately disagree. The jobs and economic activity associated with private investments do have legitimate societal benefits –capitalism works. A fuller example is left to those practitioners or a future release.
5. These types of algorithms need the full time commitment of staff who specialize in thoroughly understanding their use and abuse. That’s not necessarily the role of software developers, but it is the responsibility of the “owners of the algorithms” who generally will be members and clubs or full time DevTreks, or DevTreks-like, staff. That is, staff who work in nonconventional institutions and work hard to “do it right”.
References
Anderson, John; Harri, Ardian; Coble, Keith. Simulation from Mixed Marginal Distributions with Application to Whole-Farm Revenue Simulation. Journal of Agricultural Resource and Resource Economics. April, 2009.
Brebbia, C.A. Risk Analysis VII. 2013 (last accessed on the web in December, 2014).
Azure Machine Learning (AML). Microsoft. 2015 (last accessed August, 2016 at http://azure.microsoft.com/en-us/documentation/services/machine-learning/)
MathNet. Last accessed April 24, 2018:
https://numerics.mathdotnet.com/
https://github.com/mathnet/mathnet-numerics
McCaffrey, James. Microsoft Developers Network (MSDN) Magazine (various issues)
Mendenhall, William, Sincich, Terry. A Second Course in Business Statistics: Regression Analysis. Third Edition. Dellen Publishing Company. 1989
Piwcewicz, Bartosz. Assessment of Diversification Benefit in Insurance Portfolios. Institute of Actuaries of Australia, 2005
Studeman, A.H. Using Econometrics, a Practical Guide. 2nd edition. Harpers Collins Publishers. 1992
System.Math. Last accessed April 24, 2018:
https://msdn.microsoft.com/en-us/library/system.math(v=vs.110).aspx
https://docs.microsoft.com/en-us/dotnet/api/?view=netstandard-2.0&term=math
https://github.com/dotnet/standard
References Note
We try to use references that are open access or that do not charge fees.
Improvements, Errors, and New Features
Please notify DevTreks (devtrekkers@gmail.com) if you find errors in these references. Also please let us know about suggested improvements or recommended new features.
A video tutorial explaining this reference can be found at:
https://www.devtreks.org/commontreks/preview/commons/resourcepack/Technology Assessment 1/1526/none
Appendix A. Algorithm 1 Examples.
These datasets are owned by the Natural Resource Stock club in the GreenTreks network group (if testing on localhost, switch clubs). Some of these algorithms, such as the neural network algorithm, have been replaced by Machine Learning algorithms introduced in the Social Performance Analysis 3 reference. The latter reference also demonstrates more advanced algorithms than the regression and ANOVA algorithms introduced here. These algorithms will be retained as examples of “homegrown” algorithms that are built from scratch, as contrasted to relying on prebuilt algorithms in statistical packages, such as R and Python.
Version 2.1.6 upgraded security, with http://localhost:5000 URLs being redirected, automatically, to https://localhost:5001 URLs.
Example 1. Algorithm 1. Subalgorithm 1. Monte Carlo Simulation with Uncertain Net Benefits
URLs
https://www.devtreks.org/greentreks/preview/carbon/resourcepack/Conservation Technology Assessments Media/1534/none
Uncertain Output Net Benefits
https://www.devtreks.org/greentreks/preview/carbon/outputseries/NIST 451 Net Benefits/2141212685/none
http://localhost:5000/greentreks/preview/carbon/outputseries/NIST 5-4-1 Net Benefits/2141212696/none
Uncertain Total Benefits
https://www.devtreks.org/greentreks/preview/carbon/outputseries/NIST 451 Total Benefits/2141212686/none
https://www.devtreks.org/greentreks/preview/carbon/input/NIST 451 Total Costs/2147397542/none
https://www.devtreks.org/greentreks/preview/carbon/component/NIST 451 Total Costs/2194/none
https://www.devtreks.org/greentreks/preview/carbon/outcome/NIST 451 Total Benefits/5766/none
https://www.devtreks.org/greentreks/preview/carbon/investment/NIST 451 Net Benefits/429/none
https://localhost:5000/greentreks/preview/carbon/investment/NIST 451 CTA/433/none
http://localhost:5000/greentreks/preview/carbon/outputseries/NIST 5-4-1 Total Benefits/2141212697/none
Uncertain Correlated Net Benefits
https://www.devtreks.org/greentreks/preview/carbon/outputseries/NIST 451 Net Benefits, SubAlg3/2141212687/none
http://localhost:5000/greentreks/preview/carbon/outputseries/NIST 5-4-1 Net Benefits Correlations/2141212698/none
Examples 1 to 5 in the introductory Resource Stock Calculation reference use algorithm1 and subalgorithm1 to demonstrate how to use Monte Carlo techniques to calculate uncertain emission and environmental performance indicators. Examples in the M&E Calculation reference demonstrate these techniques for malnutrition project performance indicators. This example focuses on economic performance indicators.
The following example derives from Section 5.4.1, Storage Facility Simulation Example, found in the NIST (1988) reference. The reference introduces the example as follows:
A private investor wants to compute the Net Benefits (NB), Benefit Cost Ratio (BCR), and Adjusted Internal Rate of Return (AIRR) measures of worth to evaluate the economic merits of constructing small scale warehouse storage facilities for rent. … Examples of uncertain inputs that might affect the profitability of a warehouse are rental receipts, operating costs, resale value of the facility at the end of its holding period, and construction costs.
Further explanations for these “measures of worth”, or Performance Measures, can be found in the Performance Analysis tutorial. In the context of CTA, this investment could be for any public goods purpose, such as carbon, energy, health, or water, conservation technologies. The difficulty of measuring the returns from investments in public goods will be addressed in related tutorials (i.e. the CTA-Prevention and Social Performance Analysis tutorials). This example demonstrates the following 3 techniques for conducting an economic evaluation of this capital investment:
1. Uncertain Output Net Benefits: This method calculates Net Benefits for 1 Output base element. The Net Benefit is a type of uncertain Output revenue. This technique is appropriate for quick, summary, economic evaluations. Investments that lose money should use Inputs and be treated as a type of uncertain Input cost.
2. Uncertain Capital Budget Net Benefits: This method calculates Net Benefits using 1 Input that calculates the uncertainty of the total costs and 1 Output that calculates the uncertainty of the total benefits. The Input and Output are added to a Capital Budget and a Resource Stock Totals analysis is used to calculate the uncertainty of the final Net Benefits of the investment. This technique is appropriate for formal, full, economic evaluations.
3. Uncertain Output Net Benefits with Correlated Indicators: This method is similar to Method 1 except that the Units Rented, Resale Value, and Operating Costs, Indicators in the analysis are correlated. Examples 1b to 1e should be completed before reviewing this method.
The following image compares the results of the 3 methods. The results for Method 2 may reflect slightly different Indicator properties –Method 1 and 2’s properties were fine-tuned after Method 2 was already run. Given the random samples that are used to generate these numbers, the results do not appear to be significantly different (which is an empirical question that can be further tested).
Method 1. Uncertain Output Net Benefits
The following Indicators have been added to 1 Output base element. The probability distributions for four of the Indicators can be found in the NIST reference. A fictitious distribution was used for the Net Benefit Indicator. Selected properties of each Indicator are highlighted. The confidence interval for these indicators (x%) is defined using the Score.ConfidenceLevel property.
Units Rented Indicator 1: This Indicator uses a gamma distribution of the number of units rented to calculate uncertain revenues. The distribution has 564 (mean) and 12 (standard deviation). Selected properties include:
Q1 = 600 units
Q2 = 0.94 occupancy rate
Math Expression = I1.Q1 * I1.Q2
QT = 564 units rented
Distribution Type: gamma. The following shape and scale parameters were derived from the method of moments estimations from the 1.3.6.6.11.Gamma Distribution in the US NIST Engineering Statistics Handbook. Footnote 2 discusses limitations with the resultant confidence interval.
QTD1: 2209 shape parameter = (mean / sd) ^2
QTD2 = 3.917 the inverse scale parameter = 1 / (sd ^2 / mean)
QTM = 563.9099 units rented
QTL = 563.6759 lower x% ci
QTU = 564.1438 upper x% ci
Unit Rental Income Indicator 2: This Indicator is not uncertain. Selected properties include:
Q1 = 1200 rent per unit per year
Q2 = 10 years
Q3 = .0386 real discount rate (derived from 3.86 / 100)
Math Expression = I2.Q1 * ((((1 + I2.Q3)^I2.Q2) - 1) / (I2.Q3 * ((1 + I2.Q3)^I2.Q2)))
QT and QTM = 9,801.2643 uniform present value of rent per unit over 10 years
QTM has to be manually added
Resale Value Indicator 3: This Indicator uses a normal distribution of the warehouse resale value to calculate uncertain revenues. Selected properties include:
Q1 = 1,980,000 resale value in 10 years
Q2 = 10 years
Q3 = .0386 real discount rate (derived from 3.86 / 100)
Math Expression = I3.Q1 * (1 / (1 + I3.Q3) ^ I3.Q2)
QT = 1,350,547 present value of resale value
Distribution Type: normal
QTD1: 1,355,758 (mean)
QTD2 = 230,479 (standard deviation)
QTM = 1,356,235.5922 present resale value
QTL = 1,351,742.0740 lower x% ci
QTU = 1,360,729.1105 upper x% ci
Operating Cost Indicator 4: This Indicator uses a normal distribution of the operating costs to calculate uncertain costs. Selected properties include:
Q1 = 156,000 annual operating costs
Q2 = 10 years
Q3 = .0386 real discount rate (derived from 3.86 / 100)
Math Expression = I4.Q1 * ((((1 + I4.Q3)^I4.Q2) - 1) / (I4.Q3 * ((1 + I4.Q3)^I4.Q2)))
QT = 1,274,164.3621 uniform present value of operating costs
Distribution Type: normal
QTD1: 1,274,161 (mean)
QTD2 = 127,416 (standard deviation at 10% of mean)
QTM = 1,274,053.6717 operating costs
QTL = 1,271,559.6809 lower x% ci
QTU = 1,276,547.6625 upper x% ci
Construction Costs Indicator 5: This Indicator uses a lognormal distribution of the construction costs to calculate uncertain costs. The distribution has 1,800,000 (mean) and 180,000 (standard deviation). Selected properties include:
Q1 = 900,000 site preparation
Q2 = 900,000 construction costs
Math Expression = I5.Q1 + I5.Q2
QT = 1,800,000.0000
Distribution Type: lognormal
QTD1: 14.3983 = shape parameter = LN(mean / (1 + (variance / mean^2)^0.5)) or LN(mean) – 0.5 * (scale^2) using the LN function of a calculator (not Excel LOG)
QTD2 = 0.00997 = scale parameter = LN((1 + (variance / mean^2))^0.5
QTM = 1,799,956.0168 construction cost
QTL = 1,796,459.5882 lower x% ci
QTU = 1,803,452.4454 upper x% ci
Net Benefit Indicator 6: This indicator is used to subtract the cost Indicators from the benefit Indicators to calculate uncertain Net Benefits. It is also used to update the Output.Price to the calculated Net Benefits.
Q1 = 180,000 land purchase cost
Q2 = 700,000.0000 tax adjustment cost (for the TX variable in the NIST NB formula)
Math Expression = ((I1.QTM * I2.QTM) + I3.QTM) - (I4.QTM + I5.QTM + I6.Q1 + I6.Q2)
QT = 2,930,624.5261
Distribution Type: normal
QTD1: 2,925,000.0000 (mean)
QTD2 = 300,000 (standard deviation)
QTM = 2,931,071.7309 net benefits
QTL = 2,925,154.7017 lower x% ci
QTU = 2,936,988.7601 upper x% ci
BaseIO = benprice (updates the base element’s Output.Price property with QTM)
The Score properties have been set to return the same results as the Net Benefits Indicator. The following image displays the Resource Stock calculated results for this Output.
The following partial image of the equivalent M&E calculated results shows that the Score is treated as just another Indicator located in the zero index, or Indicator 0, position of the collection of Indicators.
The following image is the cumulative density function (CDF) for the Operating Cost Indicator.
The following image is the cumulative density function for the Net Benefits Indicator.
The following image displays the updated Output.Price for this Output. The price was automatically updated by the calculator.
Input and Output Amounts can also be updated automatically by the calculator, but after being added to an Operation, Component, Outcome, Operating Budget, or Capital Budget, their Amounts are not automatically updated to the base Input or Output Amounts. Unlike prices, Input and Output quantities must be manually adjusted in budgets (or Operations, Outcomes, and Components) to define a specific technology. That’s why most DevTreks references recommend using unit Inputs and Outputs.
Although the Monitoring and Evaluation Indicators and Scores include the BaseIO property, this version does not automatically update the underlying base element properties. The current thinking is that the Resource Stock calculators should be used for that purpose because they aggregate their results in the same manner as base element costs and benefits –accumulating data from descendants to ancestors.
Method 2. Uncertain Capital Budget Net Benefits for Resource Stock calculations
The cost Indicators used in Method 1 have been added to an Input and the benefit Indicators to an Output. An additional Indicator was added to each base element to calculate Total Costs or Total Benefits. The Total Costs Indicator was used to update the Input.CAPPrice and Total Benefits Indicator updated the Output.Price. The Input and Output have been added to a Component and Outcome and those base elements have been added to a Capital Budget.
The following image displays the Resource Stock Totals Analysis of this capital investment. The Total Benefits Indicator and Total Costs Indicator can be used to communicate the probability of these uncertain benefits and costs.
The Score for the Input was given a negative number equal to Total Costs, while the Score for the Output had a positive number equal to Total Benefits, resulting in a Net Benefits Score as well. The uncertainty of that Performance Measure has been modeled independently and the results can also be communicated to decision makers. Example 1j demonstrates how to include emissions indicators in this type of analysis to analyze an uncertain Cost Effectiveness Performance Measure. The latter measure is often used in HTAs (substituting QALY or DALY Indicators for the emissions Indicators –see the Ireland HIQA reference and the Social Performance Analysis Examples reference).
The following image displays the Net Present Value analysis of this budget. In this example, they don’t equal the Net Benefits in the Score because the Inputs and Outputs in the NPV calculations have discounted interest added to them. The NPV totals and nets should also be reported to decision makers.
Method 2. Uncertain Capital Budget Net Benefits for M &E calculations
This warehouse investment may not be a good example demonstrating full M&E analysis because, as the M&E references demonstrate, these analyses are typically carried out in the context of public, rather than private, investments (4*). The following image, for a Health and Sanitation civil engineering project taken from the M&E Introduction reference, is a more typical example.
In order to replicate the results of the warehouse investment contained in a full Capital Budget Resource Stock Analysis, new M&E Indicators would have to be added to the Input, Output, Outcome, Component, Time Period, and Investment, base elements. Those Indicators would then replicate the cumulative totals contained in the Stock Analysis. Given that Resource Stock Totals Analysis generates those numbers automatically, that’s overkill for most analyses. Instead, as the previous image shows, M&E should be carried out in the manner it’s supposed to be carried out. For example, Time Period M&E Indicators can be calculated that account for the impact of the investment –has money been spent effectively in improving the lives and livelihoods of the intended beneficiaries? Both sets of Indicators complement one another and should be used together to improve decision making.
Given that the M&E Introduction reference already includes a complete example of a public goods investment in a malnutrition project, this reference will defer, at this time, from providing another complete example.
Method 3. Uncertain Output Net Benefits with Correlated Indicators
Please review examples 1b to 1e before reviewing this method. Example 1e demonstrates how to calculate Scores using correlated Indicator sample observations. That technique will be used with correlated Units Rented, Resale Value, and Operating Costs, Indicators. This method is probably closest to the NIST example because the correlated Indicator sampled data is used to set the Score, or Net Benefits, Performance Measure. Method 2 can’t be used because Indicator calculations can only be run for 1 base element at a time –Inputs and Outputs can’t be calculated jointly yet.
The only properties different than Method 1 are the following Resource Stock Score properties:
* Score Math Type and Math Sub Type: algorithm 1 and subalgorithm3, Eigen Copula with Normal Distribution
* Stock Joint Data URL or M&E Score URL: The following fictitious Pearson correlation matrix has been uploaded to a Resource as a csv TEXT file and that URL has been added to this property. The NIST reference did not include this data. The only logic for these specific numbers is that the reference mentioned that these Indicators might have a positive correlation. The basic logic is that when the number of units rented increases, operating costs increase. Resale value increases when the units are maintained well, that is, when operating costs increase.
pearson
UR,RV,OC
1,0.5,0.5
0.5,1,0.5
0.5,0.5,1
* Score Distribution Type: Set to none. The correlated indicator distributions are used to fill in the final Scores.
* ScoreD1 and ScoreD2: Both properties are set to zero. The correlated indicator distributions are used to fill in the final Scores.
* Score Math Expression: The following Score Math Expression is used to generate a Score for each row in the random sample data matrix. Only the 3 correlated Indicators with QT are in the sample matrix.
((I1.QT * I2.QTM) + I3.QT) - (I4.QT + I5.QTM + I6.Q1 + I6.Q2)
Example 1e mentions that the remaining non-correlated Indicators will be the same value for each row in the sample matrix (i.e. I2.QTM, I5.QTM, I6.Q1, I6.Q2). The QTMs derive from calculations that are run for the non-correlated Indicators prior to using them in this Math Expression.
* Confidence Interval: 90 (this is new property in version 1.8.6)
* ScoreM, ScoreL, and ScoreU: Mean of Score with upper and lower 90% confidence intervals. The indicator distributions used 10,000 iterations, so 10,000 Scores were calculated and that vector generates these final results.
The following images for a Resource Stock calculation demonstrate that the Score properties for Method 3 follow the techniques explained in Example 1e for calculating correlated indicators. The results will be more meaningful when real correlation matrix numbers, or observational data sets, are used.
The following image for an equivalent M&E calculation shows similar results. In this example, the $3,000 difference in the Most Likely Estimate is within acceptable bounds for numbers generated from random samples. Note the use of the Score URL instead of the Stock calculator’s Joint Data URL.
Example 2. Algorithm 1. Subalgorithm 2. Probabilistic Risk: Normal Copulas
URLs:
https://www.devtreks.org/greentreks/preview/carbon/outputseries/CTA Examples 1, Probabilistic Risk/2141212678/none
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7940/PCOR.csv
Corresponding localhost URLs can be found by switching to the Carbon Emission club in the GreenTreks network group, searching for Outputs and following the data hierarchy: CTA Output Examples => Resource Stock Risk Examples => Output Stock Calculators => the 4 Output Series
The NIST (1988) reference used in the previous example mentions that simulation algorithms can also handle interdependencies between Indicators, such as when the Operating Cost Indicator is dependent on the Units Rented Indicator. A probabilistic risk assessment must account for these interdependencies, or correlations, among indicators when generating random samples of numbers to evaluate. Appendix B discusses this further. Otherwise, it’s possible for an indicator to decrease, or not increase enough, when a positively correlated indicator increases. Example 1a demonstrates how Operating Costs may be expected to increase, and definitely not decrease, when the number of Units Rented increase.
Six correlated price and yield indicators are entered as follows:
* Distribution Type: The first simulation used normal distributions for prices and yields. A second simulation used the lognormal price and beta yield distributions shown in Table 2 of the Anderson 2009 reference. The second simulation was not replicated after version 1.8.7 because of Indicator property changes –time was better spent preparing the CTA-Prevention reference.
* Q1 to Q5: Standard variables used in Math Expressions. In this example, Q1 has been set equal to theoretical prices and yields.
* QT: For correlated indicators, this is the correlated parameter that is represented in the correlation matrix and calculated as the mean of the random sample columns. For non-correlated indicators, it is the result of the Math Expression.
* QTD1 and QTD2: The normal simulation uses fictitious yield and price distributions. For tutorial purposes, the lognormal distribution uses the values in Table 2 of the Anderson 2009 reference (which wasn’t carried out again after 1.8.7). Correlated indicators require correct distributions because they derive their initial random sample vectors from this distribution. Non correlated indicators are calculated in the regular manner.
* Math Type; algorithm1, Sub Math Type: subalgorithm1 (Monte Carlo): Each indicator starts with a random sample of numbers that are distributed according to the Distribution Type property.
* Math Expression: The Math Expression identifies which column of data to include in analyses and to calculate a QT for each row of data. Indicator 1 uses the following Expression. This example does not store data in a Data URL TEXT file, so the expression simply sets QT = Q1.
I1.Q1
* QTM, QTU, QTL: Mean of QT with upper and lower x% confidence intervals. Set from each indicator’s correlated random samples explained shortly.
The Score properties are set to the following:
* Math Type: algorithm1, Sub Math Type: subalgorithm2 (Normal Copula). All correlated probabilistic risk analyses must use the Score Math Type, not each separate indicator’s Math Type. Appendix B explains the basic steps used in the calculation.
* Stock Joint Data URL or M&E Score URL: The following correlation matrix is saved as a csv TEXT file, uploaded to a Base Resource element, and the URL is copied to this property. In this example, the Price and Yield indicator correlations are taken from a Pearson correlation matrix (derived from the Rank correlation matrix shown in Table 1 of the Anderson 2009 reference).
The first line must contain the name of the correlation matrix to use. The options are:
pearson = Pearson correlation matrix
spearman = Spearman rho correlation matrix
The second line must contain a comma-separated list of the correlated indicator labels:
P1,Y1,P2,Y2,P3,Y3
The remaining lines contain either a comma-separated real correlation matrix or be blank. If TEXT data referenced by the Data URL is being used and multiple observations of matched indicators is available, a blank matrix can be added and the software automatically generates the correlation matrix (see Example 1d). The Data URL data must contain at least 3 matched data elements for each observation. In this example, a known correlation matrix is derived from the techniques recommended by the IPCC 2006 and NASA 2011 references. The csv file appears as follows:
pearson
P1,Y1,P2,Y2,P3,Y3
1,-0.3645,0.5176,-0.1569,0.1047,-0.0524
-0.3645,1,-0.3129,0.7167,-0.0838,0.3129
0.5176,0.7167,1,-0.4363,0.2922,-0.1256
-0.1569,0.7167,-0.4363,1,-0.0733,0.2611
0.1047,-0.0838,0.2922,-0.0733,1,-0.2091
-0.0524,0.3129,-0.1256,0.2611,-0.2091,1
More than one Joint Data URL can be used (by using semicolon-delimited data urls), but if data files are being used to set the correlation matrixes, each Joint Data URL must have a corresponding Data URL file and they must be in the same order. The latter feature has not been tested with actual datasets.
* Score Math Expression: The result of this Expression is the mean revenue for the three price and yield combinations.
((I1.QT*I2.QT) + (I3.QT*I4.QT) + (I5.QT*I6.QT)) / 3
Refer to the third set of images to understand why the following expression doesn’t work with this algorithm. The last version of this reference actually used this expression rather than the correct expression and showed the second set of images (5*).
((I1.QTM * I2.QTM) + (I3.QTM * I4.QTM) + (I5.QTM * I6.QTM)) / 3
* Score Distribution Type: Both simulations used a normal distribution.
* ScoreD1 and ScoreD2: Set manually by some type of expert logic known about the Score (i.e. run the calculator to get the Score, set ScoreD1 and ScoreD2, and run the calculator again to get the final results). Example 1d shows how, by setting these to zero, the underlying indicator distributions can be used to fill in ScoreM, ScoreL, and ScoreU.
* ScoreM, ScoreL, and ScoreU: Mean of Score with upper and lower x% confidence intervals.
* ScoreMUnit: unit of measure for ScoreM.
* Iterations: Number of random samples to generate.
* Confidence Interval: 90
The calculator uses the following steps:
* Step 1. . Run an asynchronous loop that simultaneously iterates through each dataset in the Joint Data URL TEXT file. Parse the data and determine the type of correlation matrix to build, the labels of the correlated indicators, and optionally, an initial correlation matrix. Errors with datasets will be added to the Calculator.Description property.
* Step 2. If the Data URL holds TEXT datasets, load the dataset corresponding to the Joint Data URL property. Use that matrix to automatically calculate a Pearson or Spearman correlation matrix.
* Step 3. Use the techniques explained in Appendix B to generate a matrix of correlated random samples.
* Step 4. Use the correlated random sample matrix to generate descriptive statistics for each vector of indicators. Set each correlation indicator’s QTM, QTL, QTU, from the statistics and add a summary of the statistics to the Math Result property.
* Step 5. When all of the calculations are completed, Set ScoreM, ScoreL, and ScoreU using the regular properties of the Score.
The following images demonstrate that each indicator is calculated using random samples of correlated numbers and that the Pearson or Spearman coefficient matrix generated from those numbers is added to the Score Math Result. Note the slight difference between the starting Pearson matrix and the new matrix.
The following images shows that, for correlated indicators, the Indicator.QT property is calculated as the mean of the appropriate column of random samples. The initial random sample derive from the Indicator Distribution properties and those properties must be correct. That’s also how the Indicator.QTM is set. For non-correlated indicators, QT is the result of the Math Expression.
The following image shows a new MathExpression used to test Version 2.0.4. DevTreks hadn’t run the correlated indicator examples in a while and assumed this expression was “better” than the original expression in its use of QTMs. Wrong assumption. The Score Math Expression uses the random sample matrix to calculate Scores. That matrix uses the QT, rather than QTM, properties to store numbers in the matrix. All this expression is doing is repeating the same numbers from the existing Indicator QTMs –it is not using the random sample data at all (5*).
((I1.QTM * I2.QTM) + (I3.QTM * I4.QTM) + (I5.QTM * I6.QTM)) / 3
The following image, from an earlier version, displays one potential way to communicate the results of this analysis to decision makers.
Example 3. Algorithm 1. Subalgorithm 3. Probabilistic Risk: Eigen Copulas with Normal Distributions
URL:
https://www.devtreks.org/greentreks/preview/carbon/outputseries/CTA Examples 2, Probabilistic Risk/2141212679/none
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7943/SCOR.csv
Corresponding localhost URLs can be found by switching to the Carbon Emission club in the GreenTreks network group, searching for Outputs and following the data hierarchy: CTA Output Examples => Resource Stock Risk Examples => Output Stock Calculators => the 4 Output Series
This example uses algorithm 1 and subalgorithm3, Eigen Copula with Normal Distribution, for the Score Math Type and Score Math Sub Type properties. The changes from Example 1 include different Indicator distributions and the use of an Eigen decomposition function from the math library and a Spearmen correlation. Appendix B shows the source code. The Anderson 2009 reference explains the steps used with this algorithm (i.e. using the square root of the eigenvalues). Note that a chapter in the Brebbia (2013) reference finds faults with this algorithm (but that reference is not open access and therefore of limited usefulness in this context).
The following images display a normal distribution test using the Spearman correlation matrix.
Example 3. Algorithm 1. Subalgorithm 3. Probabilistic Risk: Eigen with Data URL and no Score Distribution
URL:
https://www.devtreks.org/greentreks/preview/carbon/outputseries/CTA Examples 3, Probabilistic Risk/2141212680/none
Corresponding localhost URLs can be found by switching to the Carbon Emission club in the GreenTreks network group, searching for Outputs and following the data hierarchy: CTA Output Examples => Resource Stock Risk Examples => Output Stock Calculators => the 4 Output Series
Stock datasets (note the use of the Indicator Label)
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7939/PCORPYs.csv
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7942/PYs.csv
http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1753/PYs.csv
http://localhost:5000/resources/network_carbon/resourcepack_166/resource_1743/PCORPYs.csv
M&E datasets (note the use of the Indicator index position)
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_9106/PCORPYs.csv
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_9107/PYs.csv
http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1878/PYs.csv
http://localhost:5000/resources/network_carbon/resourcepack_166/resource_1879/PCORPYs.csv
This example changes the techniques employed in Example 1c by using a blank correlation matrix stored in a Joint Data URL property and a sample dataset of Prices and Yields stored in a Data URL property. In addition, Indicator distributions are used to set Scores, rather than Score Distributions. Normal price and yield distributions are used. The following properties differ from Example 1c:
* Indicator.Math Expression. The Expression must include terms that are associated with data column names using the Ix.Qx.ColName convention. This example uses only one price or yield variable, so the Math Expression simple sets QT = Q1:
I1.Q1.PriceorYield
* Stock Joint Data URL or M&E Score URL: This file holds 2 lines of csv data for the first simulation. The first line specifies the type of correlation, the second line specifies the labels of the correlated Indicators.
Stock data
pearson
P1, Y1, P2, Y2, P3, Y3
M&E data [must follow DataURL dataset conventions]
pearson
2, 1, 4, 3, 6, 5
* Score Distribution Type: Set to none. The indicator distributions are used to fill in the final Scores.
* ScoreD1 and ScoreD2: Both properties are set to zero. The indicator distributions are used to fill in the final Scores.
* Score Math Expression: The correlated random sample matrix (R) generated in Examples 1 and 2 contain each indicator’s QT property only. Sample statistics generated from the indicator vectors in matrix R are used to set the final QTM, QTL, and QTU properties. A Score can be set for each row of R by using a Score Math Expression that includes each row of indicator QTs. The following Score Math Expression is used to generate a Score for each row in the R matrix:
((I1.QT*I2.QT) + (I3.QT*I4.QT) + (I5.QT*I6.QT)) / 3
Additional indicator properties can be included in the expression, but they won’t come from data in the R matrix –they’ll come directly from each indicator (i.e. they’ll be the same for each row). Non-correlated Indicator calculations are run before this Math Expression so that QTM terms can be used in the Expression (see Example 1a).
* Data URL: This csv file holds 11 fictitious data observation for each of the six price and yield indicators. This data is used to automatically generate the appropriate correlation matrix. Once the correlation matrix is built, this data is not used again because each indicator’s distribution properties are used to run calculations. The following is the first row used in this example.
label, date, output, none, PriceorYield
The actual data starts on the second line. The QT value will be calculated for each row of data using the Indicator.MathExpression and whatever Qx properties are in the equation. The calculated QT columns are used to build a new correlation matrix. The first 4 lines as follows:
Stock dataset
P1,12/30/2014,barley,0,22
P1,12/31/2014,barley,0,21
P1,1/1/2015,barley,0,21.25
P1,1/2/2015,barley,0,21.5
P1,1/3/2015,barley,0,21.75
P1,1/4/2015,barley,0,23
P1,1/5/2015,barley,0,22.75
P1,1/6/2015,barley,0,22.5
P1,1/7/2015,barley,0,22.25
P1,1/8/2015,barley,0,21.9
P1,1/9/2015,barley,0,22.1
Y1,12/30/2014,barley,0,11.5
Y1,12/31/2014,barley,0,10.5
Y1,1/1/2015,barley,0,10.75
M&E dataset
index, date, output, none, PriceorYield
2,12/30/2014,barley,0,22
2,12/31/2014,barley,0,21
…
1,12/30/2014,barley,0,11.5
1,12/31/2014,barley,0,10.5
More than one Data URL can be used (by using semicolon-delimited data urls), but each Data URL must have a corresponding Joint Data URL file and they must be in the same order. All algorithms with multiple datasets are run asynchronously and simultaneously.
The following images shows these properties.
Stock Score
M and E Score
Example 4. Algorithm 1. Subalgorithm 4. Probabilistic Risk: Eigen Copulas with Uniform Distributions
URL:
https://www.devtreks.org/greentreks/preview/carbon/outputseries/NIST 451 Net Benefits, SubAlg4/2141212688/none
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7978/NIST451Pearson.csv
This algorithm uses a slight variation of subalgorithm3 that is explained in Appendix B. Since the same exact techniques are used, only the results of an example calculation will be used to explain this algorithm.
The following image shows the results of running this algorithm for the same data used with Example 1a, Method 3. That example used subalgorithm3. The resultant differences are slight, may not be statistically significant, but may be useful for some Indicators. The reason for including it is that some of Appendix B’s references recommend this technique.
Example 5. Algorithm 1. Subalgorithm 5. Combinatorial Optimization: Simulated Annealing (3*)
URL:
https://www.devtreks.org/greentreks/preview/carbon/output/CTA Examples 5, Combinatorial Optimization/2141223457/none
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7944/SimAnn1.csv
http://localhost:5000/greentreks/preview/carbon/output/Resource Stock Combo Optimization Examples/2141223462/none
Stock dataset
http://localhost:5000/resources/network_carbon/resourcepack_166/resource_1746/SimAnn1.csv
M&E dataset
http://localhost:5000/resources/network_carbon/resourcepack_166/resource_1874/SimAnn2.csv
This algorithm derives from the McCaffrey reference (January, 2012 issue). That reference contains examples of dozens of numeric algorithms, often used by software developers, to solve numeric problems involving cost minimization, best scheduling, machine learning, and artificial intelligence. This reference will include additional examples of these algorithms in future upgrades.
The simulated annealing algorithm solves combinatorial optimization problems. It was developed by engineers to calculate the best way to cool down specific materials, such as molten metals. They used the algorithm to figure out the best, or optimum, amount of energy that could be used for the cool down.
In this example, it tries to minimize the total amount of time spent by 5 workers to complete 6 tasks. Each task must be assigned to one worker and each task can be completed by one of three workers. The total potential combinations of worker/tasks is 3^6, or 729. The usefulness of the algorithm increases as the potential combinations increase –the McCaffrey reference uses the example of 20 tasks that can be completed by 12 workers, or 12^20 combinations. The algorithm loops through a sample size defined by an Iterations property to keep selecting better and better (lower number of hours) combinations of workers and tasks.
The algorithm can be applied in creative ways. The author found a recent reference (not cited) where the algorithm has been proposed as a better way to generate random samples of correlated numbers than conventional copula methods (by minimizing differences in Pearson coefficients).
The following properties are the initial parameters used by the indicators being calculated. This example uses the terms “workers, tasks, and hours” for illustrative purposes. The terms “rows, columns, and best result” are also appropriate.
* Label: For Stock calculators, must correspond to the Label used in the Joint Data URL property. For M&E calculators does not need to correspond –but the Indicator must be in the Index position identified in the Score URL dataset.
* Distribution Type: none. Results will not be expressed in terms of confidence intervals.
* Q1: current temperature
* Q2: alpha, the cooling rate
* Q3: in this example, a penalty when a worker has more than 1 task to complete
* Math Type algorithm1, Sub Math Type: subalgorithm5 (simulated_annealing): Run a custom DevTreks algorithm to carry out simulated annealing analysis.
* Math Expression: blank. Q1 to Q3 will be passed to the algorithm and the algo will generate QT and QTM.
* QT and QTM: QT and QTM are equal and reflect the total number of hours (i.e. energy) required in the final solution.
* Math Result: this algorithm requires deleting previous Math Results or the new Math Results get added to the previous results
The Stock Score properties are set to the following:
* Score.DataURL: The following table shows that this style of algorithm employs a standard TEXT dataset structure for jointly calculated Indicators. Unlike a Data URL, the numeric data in these datasets do not usually coincide with QT to Q1 properties. In addition, the Math Expression does not need to identify which columns of data to analyze. The data is saved as a csv TEXT file, uploaded to a Base Resource element, and the URL is copied to this property.
The first line, or header row, substitutes a Row Name column for a Data URL’s Date column and allows up to 11 columns of numbers to be analyzed. Columns and rows throughout algorithms are restricted until testing reveals more about their consequences.
Indicator Label
Row Name
none
Col 1 Amount
Col 2 Amount
Col 3 Amount
…
Col 11 Amount
CO2
W1
0
6.543
7.000
3.26
…
1.500
The actual data for the 2 indicators being analyzed appears as follows;
label,rowname,none,T1, T2, T3, T4, T5, T6
TW1,W1,,7.5,3.5,2.5,0,0,0
TW1,W2,,0,1.5,4.5,3.5,0,0
TW1,W3,,0,0,3.5,5.5,3.5,0
TW1,W4,,0,0,0,6.5,1.5,4.5
TW1,W5,,2.5,0,0,0,2.5,2.5
TW2,W1,,7,2.5,3.5,0,0,0
TW2,W2,,0,3.5,1.5,3.5,0,0
TW2,W3,,0,0,3.5,5.5,2.5,0
TW2,W4,,0,0,0,6.5,1.5,4
TW2,W5,,2.5,0,0,0,2.5,2.5
* Math Type: algorithm1 and subalgorithm1. In this example, the score is running a standard probabilistic risk function.
* Score Math Expression: I1.QTM + I2.QTM
* Score Distribution Type: normal
* Score and ScoreM, ScoreLow, ScoreHigh: The result of the Math Expression and Score algorithm
* ScoreMUnit: unit of Measure for ScoreM.
* Score.MathResult: The result of running the Score algorithm or none.
The following M&E Score properties differ from the Stock Score properties:
* Score.DataURL: The following table shows this dataset differs from the Stock dataset by using the Indicators’ Index position, rather than Label, to identify the Indicator where the calculations get run.
Indicator Index
Row Name
none
Col 1 Amount
Col 2 Amount
Col 3 Amount
…
Col 11 Amount
1
W1
0
6.543
7.000
3.26
…
1.500
The actual data for the 2 indicators being analyzed appears as follows;
index,rowname,none,T1, T2, T3, T4, T5, T6
1,W1,,7.5,3.5,2.5,0,0,0
1,W2,,0,1.5,4.5,3.5,0,0
1,W3,,0,0,3.5,5.5,3.5,0
1,W4,,0,0,0,6.5,1.5,4.5
1,W5,,2.5,0,0,0,2.5,2.5
2,W1,,7,2.5,3.5,0,0,0
2,W2,,0,3.5,1.5,3.5,0,0
2,W3,,0,0,3.5,5.5,2.5,0
2,W4,,0,0,0,6.5,1.5,4
2,W5,,2.5,0,0,0,2.5,2.5
The calculator uses the following steps:
* Step 1. Run an asynchronous loop that simultaneously iterates through each dataset in the Joint Data URL TEXT file. Parse the data into a double[,] array. Errors with datasets will be added to the Calculator.Description property.
* Step 2. Use the indicator Q1 to Q3 properties to initiate a simulated annealing object. Pass in a double[,] array holding the initial worker/task hours. Run simultaneous simulations.
* Step 3. Set Indicator 1’s QT and QTM properties to the solution’s best energy amount. Set the MathResult property to the final optimum worker/task matrix.
* Step 4. Loop to the next indicator and carry out Steps 1 to 3 for that indicator.
* Step 5. Set the Score properties from the indicators properties when all the simulations have been completed.
The following image shows the results. This example is not particularly important in itself. The importance lies in the context in which the algorithm can be used.
Example 7. Algorithm 1. Subalgorithm 7. Classification and Prediction: Neural Network (3*)
URLs:
Simulation 1. 4 input variables with 3 possible output values
https://www.devtreks.org/greentreks/preview/carbon/output/CTA Examples 7a, Neural Network/2141223458/none
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7955/NeuralEx1.csv
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7956/NeuralEx2.csv
http://localhost:5000/greentreks/preview/carbon/output/Resource Stock Neural Network Example 1/2141223463/none
Stock dataset
http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1755/NeuralEx1.csv
M&E dataset
http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1876/NeuralEx1.csv
Version 2.1.4, as documented in the Social Performance Analysis 3 reference, demonstrates the latest techniques employed by DevTreks for running machine learning algorithms. This algorithm is being retained as another example of how to run joint calculations.
This algorithm derives from the McCaffrey reference (July, 2012 issue). This algorithm classifies and predicts the value of an output variable, given a set of predictive input variables. The neural network term refers to the techniques employed by the algorithm to learn how to classify and predict an output value based on an initial dataset of observed, or expertly surmised, input variables.
Botanists first developed this algorithm to predict the classification of a plant’s species based on a species’ input characteristics (i.e. sepal length, sepal width, petal length, petal width). The McCaffrey reference cites examples where the algorithm has been used to classify a loan applicant’s credit score based on their income and expenses, and a hospital patient’s cancer status based on blood test variables.
Although the algorithm supports classifications that are either numeric or textual, the nature of these Resource Stock indicators means that only numeric outcome values are supported in this version. Analysts should provide multimedia support (graphics, tables) that communicates the raw data numeric results in a manner that decision makers will understand.
McCaffrey uses data containing 4 fictitious input variables that are used to classify or predict 3 possible flower colors. The predicted output can take the values: 0=red, 1=green, 2=blue, and the code turns each value into an N-1 vector (i.e. red = {1, 0. 0}). The referenced code has been modified to accept up to 9 input variables and 10 potential values for the output being predicted (i.e. but was not debugged in version 1.9.2 because it will be either upgraded or replaced at an appropriate time). The size of the neural network used to make predictions directly relates to the number of input variables and output options (hence the restrictions). This version of the algorithm uses 500 iterations to train the neural network.
The algorithm employs a Particle Swarming optimization algorithm to minimize the error associated with training the neural network. The algorithm uses random numbers for what it calls “cognitive and social randomizations”. The result is that running the algorithm consecutive times returns different results. To test the sensitivity of the different results, one test ran the calculator 10 consecutive times for one output and output series with this example’s data. The test generated the following results: 0 predicted 10 out of 20 times with 90% average accuracy, 2 predicted 5 out of 20 times with 90% average accuracy, and 1 predicted 5 out of 20 times with 86 percent average accuracy. The reference points out that alternative ways for training the neural network can be used.
The author is not an expert using this type of algorithm, but assumes these types of results may be acceptable for some types of data (i.e. predicting advertisement use). He further assumes those types of results support a predicted output value of 0 with 90% accuracy (rather than 50%). As pointed out in Footnote 3, these algorithms will evolve.
A simulation that uses 1 indicator is displayed in the following image.
The following Indicator properties are set using a similar pattern to Example 5:
* Label: Must correspond to the Labels used in the Joint Data URL property.
* Distribution Type: none. Results will not be expressed in terms of confidence intervals.
* Q1 to Q5 Amounts: Not required because 20% of the data rows are test data –the first row of that test data can correspond to Qx variables and the calculated result is displayed in the Math Result. Rather than leaving these set to zero, the values for the first row of test data can be used to improve reporting. When 2 indicators are used, the second indicator will contain the final calculations.
* Q1 to Q5 Units: Not required, but rather than leaving empty, they can correspond to the first row of test data.
* Math Type algorithm1, Sub Math Type: subalgorithm7 (neural_network): Run a custom DevTreks algorithm to carry out neural network analysis.
* Math Expression: blank. The Qx properties will be passed to the algorithm and the algo will generate QT and QTM.
* QT and QTM Amounts: The QT and QTM Amounts will be equal and reflect the classified/predicted output variable. In this example, Q1 is the guessed output variable.
* QTL: If the Distribution Type is set to none, records the percent accuracy of the network in making predications.
The following Score properties are set using a similar pattern to Example 5:
* Score.DataURL: The following table shows that this style of algorithm employs the same data structure as algorithm 5. In addition, the first 80% of rows are used to train the neural network while the remaining 20% are used to test the network. The prediction results for the first 3 rows of test data are included in the Math Result and can correspond to specific Qxs used to make predictions. The data is saved as a csv TEXT file, uploaded to a Base Resource element, and the URL is copied to this property. In this algorithm, the Col 0 Amount must be an output value. The input values go into Col 1 to Col 10.
Indicator Label
Species
Row Name
Col 1 Amount
Col 2 Amount
Col 3 Amount
…
Col 11 Amount
C1
rose
red
1
6.543
7.000
…
1.500
Indicator 1. 4 input variables and 3 output value options.
This data contains 100 rows of artificial data taken directly from the McCaffrey reference. The following tables displays the first few lines of data. The output value, with 3 possible numeric values, is stored in the 3rd column with the header “color”. The input values are in the remaining 4 columns.
label,species,color2,color,length,width,slength,swidth
C1,rose,green,2,8,5,9,5
C1,rose,blue,1,9,5,2,2
C1,rose,red,0,6,9,4,6
C1,rose,blue,1,9,2,3,3
C1,rose,red,0,7,6,9,8
Indicator 2. 8 input variables and 5 output value options.
This data contains 100 rows of artificial data. The following tables displays the first few lines of data. The output value, with 5 possible numeric values, is stored in the 3rd column. The input values are in the remaining 8 columns.
label,species,color2,color,length,width,slength,swidth,var5,var6,var7,var8
C2,rose,green,2,8,5,9,5,8,5,9,5
C2,rose,blue,1,9,5,2,2,9,5,2,2
C2,rose,red,0,6,9,4,6,6,9,4,6
C2,rose,white,4,9,2,3,3,9,2,3,3
C2,rose,red,0,7,6,9,8,7,6,9,8
C2,rose,red,0,9,8,8,6,9,8,8,6
C2,rose,blue,1,3,8,6,2,3,8,6,2
* Iterations: This property is not used by the algorithm (the algorithm uses 500 iterations). This property can be used if confidence intervals are being generated.
The M&E Score properties are as follows:
* Score.DataURL: M&E calculators use the same property to store these types of datasets, but they place an Indicator’s Index position, rather than Label, in the first column. The Score’s Index position, 0, can be included in these datasets.
index,species,color2,color,length,width,slength,swidth,var5,var6,var7,var8
1,rose,green,2,8,5,9,5,8,5,9,5
1,rose,blue,1,9,5,2,2,9,5,2,2
The calculator uses the following steps:
* Step 1. Run an asynchronous loop that simultaneously iterates through each dataset in the Joint Data URL TEXT file. Parse the data stored in the Joint Data URL file (which must be uploaded to a base Resource element).
* Step 2. Use one or more indicator Q1 to Q5 properties as an observed set of output and input variables. Parse the data into a matrix and pass the output/input matrix to the algorithm. Run the simulation.
* Step 3. Use the first 80% of the data observations to train the neural networks. This data is normalized to values between -1.0 and 1.0 using the intercept and slope of the min and max values in the data. Use the final 20% of observations to see how well the new neural network matrix predicts and classifies the output variable. Add the Qx variables to the last row of the test matrix. Use this last row to set the indicator’s QT and QTM properties. Do not use that row when determining the percent accuracy of the neural network. Add summary data from the simulation to the MathResult property.
* Step 4. Loop to the next unique indicator and carry out Steps 1 to 3 for that indicator.
* Step 5. When all of the calculations have been completed, set the Score properties from the indicators properties.
The following images shows the Math Result property of the first simulation. The first row of numbers, Input, in the Examples of the network accuracy are the indicator Qx results (using data that has been normalized to values between -1 and 1). The second row, Output, uses an indexed value of 1 to show the actual Output value found in the dataset. The third row, Predicted, uses fractions to show the network’s predicted probabilities for each output value. The fraction with the highest value is the predicted output (i.e. a winner takes all approach). The predicted color is 0, or red. In general, having 0 as an option is discouraged because the results appear as if nothing happened. The network’s prediction accuracy is around 85%. Decision makers can decide if that level of accuracy for making predictions is good or bad (i.e. it might be bad for a health and safety-related indicator, but acceptable for an advertising sales prediction). Note that the exact same calculator properties were run at the same time for the child output series and produced the same predicted color with a 76% accuracy.
Example 6. Algorithm 1. Subalgorithm 6. Probabilistic Statistics: Regression Analysis
https://www.devtreks.org/greentreks/preview/carbon/input/DevTreks OLS 1/2147397534/none
http://localhost:5000/greentreks/preview/carbon/inputseries/Example 6, Regression/2147380293/none
Stock datasets
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7951/Ex6.csv
http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1749/Ex6.csv
M&E datasets
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_9108/Ex6.csv
http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1875/Ex6.csv
This algorithm uses regression analysis to estimate probable values for an indicator’s QTM property (i.e. dependent variable), given its Qx values (i.e. independent variables). It also makes predictions about that value. At the Score level, the algorithm uses whatever indicator properties are used in the Score.MathExpression and Score.MathType to generate Score, ScoreM, ScoreL, and ScoreU properties. Up to 10 explanatory variables can be analyzed. Use algorithm 2 or 3 when more explanatory variables need to be analyzed. Multiple base elements can be calculated at one time using standard techniques.
An indicator’s QT Amount is the dependent variable, y, of a typical regression expressed by the following equation. The x variables that will be estimated and predicted are added to the last 3 rows of datasets and serve as a scoring dataset. All of the data being analyzed must be added to a comma-separated-value file found using the Data URL property.
y = b0 + b1x + bnx
In the context of regression analysis and CTA, the probabilistic model for any particular observed value of y is:
y = (mean value of y for a given value of xi) + (random error)
y = b0 + b1 + bnx + e
This version requires that all data being analyzed to be transformed (log, negative numbers, polynomials, weights) prior to being used in the algorithm.
More than one indicator can be used in the regression analysis by using appropriate Math Expressions and dataset columns.
The Indicator.MathExpression terms must correspond to column names used in the dataset. The math terms must use the standard Ix.Qx with a required “.commoncolname” suffix. The “commoncolname” suffix must exactly match a dataset column name, without the period delimiter. The following two Math Expressions show that his convention allows columns in the dataset, and Ix.Qx variables, to be ignored by not including their name in the Math Expression.
Math Expression with 2 variables in 1 indicator:
I1.Q1.housesize1 + I1.Q2.housesize2
Math Expression with 10 variables in 1 indicator (calculators don’t display Q6 to Q10 properties, but datasets can contain up to 10 independent variables).
I1.Q1.housesize1 + I1.Q2.housesize2 + I1.Q3.locationamenity1 + I1.Q4. locationamenity2 + I1.Q5. locationamenity3 + I1.Q6.lotsize1 + I1.Q7.lotsize2 + I1.Q8.meansalesprice + I1.Q9.constructionqualityrating + I1.Q10.locationrating
The first example derives from a college text book (Mendenhall and Sincich, 1989). The text book example provided datasets, regression examples, mathematical matrix examples, and SAS reports, needed to develop and test every major feature in this version of the algorithm. The dependent variable, y, is household monthly energy use. The first dependent variable, x1, is size of house. The second dependent variable, x2, is size of house squared.
Abstract equation: y = B0 + B1x1 + B2x1^2 + e
Math Expression: I1.Q1.housesize1 + I1.Q2.housesize2
The following images display the completed properties.
These properties are set as follows:
* Label: For Stock calculators, must correspond to the Label used in the Joint Data URL property. For M&E calculators does not need to correspond –but the Indicator must be in the Index position identified in the Score URL dataset.
* Description: Explanation of the regression equation being analyzed (y = B0 + B1x1 + B2x1^2 + e) and the results.
* Distribution Type: none. This algorithm automatically generates confidence intervals for the estimated QTM. The Math Result includes additional confidence intervals for both estimated and predicted QTM amounts.
* Q1 to Q5 Amounts: Set of independent, or explanatory, variables used to estimate and predict the dependent variable, QTM. These are not actually used in any calculation –the last 3 rows of data are used to “score” the model. The last row of data is used to complete the QTM, QTL, and QTM properties. Because Q1 to Q5 are displayed in Stock Total Analyses, set them equal to the five most significant independent variables in the last dataset row. In this example, the calculation is trying to estimate and predict the energy use for a 1500 square foot house.
* Q1 to Q5 Units: Units of measurement for the independent variables. They do not need to match their corresponding dataset names.
* QT Amount and Unit: Prior guess about the amount of energy use associated with the independent variables. Every dataset must include data for the QT Amount property as the observed values of the dependent variable.
* Math Type algorithm1, Sub Math Type: subalgorithm6 (Regression): Each indicator is using regression analysis to generate its QTM, QTL, and QTU properties.
* URL: Regressions are run for indicators found in datasets, not for individual indicators. The regression examples demonstrate the required data conventions. The Resource Stock Calculation reference explains the following conventions more thoroughly.
Version 2.1.4 and 2.1.6 refactored examples 1 and 5 to 8 by placing greater emphasis on using the Indicator.URL property to store datasets. The 1st column of data in the following dataset uses the standard R and Python data convention of including row identifiers (i.e. 1, 2, 3, …) in the label column.
The actual data starts on the second line. The third column, which supports a custom data column, has been left blank because it is not being used. The last 3 rows of data is not used in the regression analysis but is used to score the model –in this case, used in a sensitivity analysis of the confidence interval. The last row of data is used to complete the QTM, QTL, and QTU properties.
label,date,none,energyuse,housesize1,housesize2
1,1/30/2015,,1182,1290,1664100
2,2/30/2015,,1172,1350,1822500
3,3/30/2015,,1264,1470,2160900
The previous pattern of using the Score.DataURL property can still be used. In that case, the label column must correspond to an Indicator.Label.
label,date,none,energyuse,housesize1,housesize2
A1,1/30/2015,,1182,1290,1664100
A1,2/30/2015,,1172,1350,1822500
A1,3/30/2015,,1264,1470,216090
If the Score.DataURL property is being used with the M&E calculators, the following dataset shows that the label column must store an integer identifying the index position of an indicator.
index, date, none, energyuse, housesize1, housesize2
1,1/30/2015,,1182,1290,1664100
1,2/30/2015,,1172,1350,1822500
Math Expression: In the case of regression, this property is used to identify which columns of data in the TEXT file, and which independent variables, to include in the analysis. It’s not actually parsed and run independently. The algorithm only uses the generic y = b0 + b1x + bnx expression to run the regression. If the column name is not found in the Math Expression terms, the data column is ignored in the regression analysis. Each column of data in the TEXT file being analyzed must be transformed appropriately in the data set.
In the following Math Expression, the I1.Q2.housesize2 term is represented by a column of independent variable data that has already been raised to the second power.
I1.Q1.housesize1 + I1.Q2.housesize2
The terms used in the Math Expression must end in a corresponding column name used in the TEXT data file. They must start with the conventional Ix.Qx syntax. The following column names are used in the TEXT file:
housesize1, housesize2
* QTM Amounts: The QTM Amount will be calculated from the algorithm as the estimated value of the dependent variable. The last row of data in the TEXT file is used to set QTM, QTL, and QTU properties.
* QTL and QTU Amounts: These properties will be calculated from the algorithm as the lower and upper x% confidence interval for the estimated QTM Amount. The calculation uses a 2 sided T statistic test (t025). The Math Results display confidence intervals for both the estimated and predicted amounts.
* Math Result: This property includes standard descriptive statistics for regression analysis, including T statistics for each coefficient and R squared and F statistics for the estimated QTM. It also includes confidence intervals for estimated and predicted amounts for the Qx variables for the last 3 rows of data in the Data URL Text file. The results matched the referenced text book. The CTA 2 reference demonstrates that the results also match R and Python regression results.
The following images demonstrate the Score properties.
The following properties show how Stock Score properties are set.
* Score.MathExpression: The following expression is just the result of the Indicator 1 QTM:
I1.QTM
* Score Math Type Properties: none. The Score can be expressed as a confidence interval by using the properties explained in the previous examples, including this regression example (i.e. use the Score.JointDataURL to store the dataset).
The calculator uses the following steps:
* Step 1. Run an asynchronous loop that simultaneously iterates through each Indicator. Parse the data and build a vector of dependent variables (y) and a matrix of independent variables (xn). Errors with datasets will be added to the Calculator.Description property. The memory requirements for this technique must be considered when deciding which algorithm to use in a CTA.
* Step 2. Pass the data to the regression algorithm and run a regression for the indicator corresponding to the iteration loop. Errors with regression calculations will be added to the Math Result property of each indicator. To the extent possible, algorithms with multiple datasets run their calculations asynchronously and simultaneously.
* Step 2a. Use one or more indicator Q1 to Q5 properties as a set of independent variables from which a dependent variable can be estimated and predicted. Generate the dependent variable (QTM) and x% confidence intervals (QTL and QTU) and Score.ConfidenceInterval % prediction intervals for these specific values.
* Step 3. Add the results of the regression each Indicator’s Math Result, QTM, QTL, and QTU properties.
* Step 4. Set the Score properties from the indicators properties. Score results may not always be meaningful if the regressions are primarily being run for their own sake, rather than for generating a Score. The Score is always reported, so we recommend using some meaningful combinations of Indicators.
Example 6a. Algorithm 1. Subalgorithm 6. Probabilistic Statistics: Multiple Regression with Uncertain Costs
URLs
The following Input contains 2 Input Series. One sets OC and AOH Prices. The other sets CAP and AOH Prices. Run the calculation at the Parent Input level and insert the calculator into the 2 children Input Series, then go into 1 of the Series, adjust the “wrong” price, rerun and save the calculation. Remember not to overwrite that child Series when running the Input calculation in the future. If calculations need to be rerun in the future and Input or Output prices must be updated in the base element, calculations must be run at the Series level –just updating them from the parent won’t work unless they are completely overwritten.
https://www.devtreks.org/greentreks/preview/carbon/input/DevTreks OLS 3/2147397536/none
The following Operation uses the Input with the OC and AOH price adjustments.
https://www.devtreks.org/greentreks/preview/carbon/operation/Example 1I, Regression/2091557278/none
The following Component uses the Input with the CAP and AOH price adjustments.
https://www.devtreks.org/greentreks/preview/carbon/component/CTA Regress 1/1194/none
The following dataset is used with the regression analysis.
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7973/Ex6eshipping.csv
http://localhost:5000/greentreks/preview/carbon/input/Example 1hb, Regression/2147409823/none
Score datasets
http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1779/Ex6eshipping.csv;http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1777/Ex6dshipping.csv
M and E datasets
http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1881/Ex6eshipping.csv;http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1882/Ex6dshipping.csv
This example extends Example 1i by calculating uncertain Input costs. The shipping cost dataset in that example will be used to demonstrate how to update an Input’s quantitative properties. Two Indicators will be added to an Input to demonstrate how to update an Input’s allocated overhead costs and operating costs. In the context of CTA, a logical way to use this technique is to calculate carbon prices and damages associated with climate change.
Only the Stock calculators can be used to update base Input and Output element properties. Although the M&E calculators include the BaseIO property, the current version does not use that property to update the underlying base element properties.
The following image demonstrates that the BaseIO property for the first Indicator is updating the Input.OCPrice. The amount of the Input will be changed once the Input is added to an Operation or Component and operating costs will be calculated by multiplying the fixed Input.OCPrice by the adjusted Input.Amount. The Resource Stock and M&E Calculation references recommend using “Unit Inputs and Outputs” so that these elements can be reused in any budget. In order to update the base Input with a “Unit OCPrice”, the independent variables’ amounts have been changed from the amounts used with Indicator 1 so that the Input.OCPrice corresponds to an Input.OCUnit of “shipping price per pound per mile”. That required changing the independent variables in the scoring dataset (the last 3 rows of data). For consistency, the five most important independent variables were added to the Qx properties, but these properties are only used in Resource Stock Totals Analyses.
Although not shown, the second Indicator had the exact same properties as the first Indicator except the Label matches its dataset and the BaseIO property is set to aohprice so that the Input.AOHPrice can be updated to calculate uncertain allocated overhead costs.
The following image displays the updated $0.32 Input OCPrice and $0.32 Input.AOHPrice. The calculator automatically filled in these properties. No unit in the OC Unit list matches the actual “shipping price per pound per mile” unit so a default unit of “each” has been set.
The following image demonstrates that after this Input has been added to an Operation, the Input’s OCAmount and AOHAmount are changed from 0 to 500 (i.e. 5 pounds * 100 miles) to calculate the full Input costs. This is an important step in any technology assessment because it defines the exact nature of the technology.
The following image displays a Resource Stock Total Analysis for an Operation containing this Input. The total operating costs for this input is the QT Most Amount shown for the first Indicator ($159.50 = 500 *.32). The uncertainty for this cost is defined using the QT Low (-477.65) and QT High ($796.65) Amounts. The uncertainty of the Allocated Overhead costs is displayed in the second indicator.
The image also demonstrates that the Input.OCAmount is used as a stock multiplier when conducting Operation and Operating Budget Stock Analysis (Qxs = 500). For simplicity, all of the final calculated quantities are multiplied by 500. The Input.AOHAmount is never used as a stock multiplier. Certain types of allocated overhead costs may need to be dealt with differently in future upgrades.
Most persons using this analysis to make decisions about how to ship packages, or reduce climate change damages, will not be looking at this raw data, they’ll be looking at the summarized tables and graphics of the analysis that have been referenced using the Media URL property. Those communication aids will include an explanation for all uncertain cost and benefits.
The following image displays the associated Net Present Value calculation for this Operation.
Two Indicators were added to a sibling Input Series to demonstrate how to update an Input’s allocated overhead costs and capital costs. The only change made to the previous example is to change Indicator 1’s BaseIO property to caprice, rather than ocprice. The following Resource Stock Total Analysis and Net Present Value Analysis for a Component that uses this Input displays the resultant allocated overhead, capital costs, and stock totals. The Resource Stock Totals analysis for this Component is used to communicate the uncertainty of these costs.
The Resource Stock Totals Analysis also demonstrates that the Input.CAPAmount is used as a stock multiplier when conducting Component and Capital Budget Stock Analysis (Qxs = 500). The Input.AOHAmount is never used as a stock multiplier.
Example 6b. Algorithm 1. Subalgorithm 6. Probabilistic Statistics: Multiple Regression with 10 Explanatory Variables and Uncertain Performance Measures
URLs
https://www.devtreks.org/greentreks/preview/carbon/input/DevTreks OLS 4/2147397541/none
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7974/ozone1.csv
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7975/Cost-Per-Unit-Pollution-cdf.PNG
http://localhost:5000/greentreks/preview/carbon/input/Example 1hc, Regression/2147409830/none
Datasets
http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1778/ozone1.csv
http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1880/ozone1.csv
This example demonstrates how to include up to 10 explanatory variables in a regression and to calculate the uncertainty of Performance Measures. The 10 variable limit is arbitrary –the R and Python algorithms don’t impose this restriction. The following technique can be used when using more than 5 explanatory variables in an analysis:
1. Q6 to Q10 terms in Math Expressions: Although calculators don’t contain Q6 to Q10 properties, these terms can still be used in Math Expressions to identify columns of data to include in analyses.
The following dataset contains 10 explanatory variables and 330 observations. The dependent variable is the amount of ozone measured in the atmosphere. The explanatory variables are weather and other factors influencing ozone concentrations. This algorithm uses the last 3 rows of data in the TEXT Data URL dataset to score the statistical model. The remaining rows are used to train the statistical model. The last row of data is used to fill in the QTM, QTL, and QTU properties. This is the same way that algorithms 2 and 3 use datasets.
Indicator.URL datasets:
label,date,location,O3,vh,wind,humidity,temp,ibh,dpg,ibt,vis,doy,ampm
1,5/15/2015,N45'37.75W121'46.25,3,5710,4,28,40,2693,-25,87,250,33,0
2,5/16/2015,N45'37.75W121'46.26,5,5700,3,37,45,590,-24,128,100,34,0
3,5/17/2015,N45'37.75W121'46.27,5,5760,3,51,54,1450,25,139,60,35,0
4,5/18/2015,N45'37.75W121'46.28,6,5720,4,69,35,1568,15,121,60,36,0
Score.DataURL datasets (label column must correspond to Indicator.Label for Stocks or Indicator index position for M&E):
label,date,location,O3,vh,wind,humidity,temp,ibh,dpg,ibt,vis,doy,ampm
O3,5/15/2015,N45'37.75W121'46.25,3,5710,4,28,40,2693,-25,87,250,33,0
O3,5/16/2015,N45'37.75W121'46.26,5,5700,3,37,45,590,-24,128,100,34,0
O3,5/17/2015,N45'37.75W121'46.27,5,5760,3,51,54,1450,25,139,60,35,0
O3,5/18/2015,N45'37.75W121'46.28,6,5720,4,69,35,1568,15,121,60,36,0
The Math Expression for Indicator1 identifies the 10 columns of data to include in the analysis.
.
I1.Q1.vh + I1.Q2.wind + I1.Q3.humidity + I1.Q4.temp + I1.Q5.ibh + I1.Q6.dpg + I1.Q7.ibt + I1.Q8.vis + I1.Q9.doy + I1.Q10.ampm
The following image displays the results of this regression. In the context of CTA, a logical way to use this technique is to calculate the uncertainty of a Pollution Index (Score) that tracks several types of emission Indicators.
To demonstrate using this technique to communicate the uncertainty of Performance Measures, a cost Indicator (I2) value from the previous example has been added to the Score.MathExpression to calculate uncertain operating costs. In practice, the cost Indicator is included as a separate Indicator and the Score.MathExpression includes the calculated cost. The following image demonstrates that the Score has been used to calculate the uncertainty of a Performance Measure –in this case, Cost per Unit Pollution Index. The Math Expression shows that the Cost Indicator (I2) is being divided by the Emissions Indicator (I1).
The following image demonstrates one way communicate the results of this type of Performance Measure (the GAO 2009 and NASA 2011 references explain using this type of cumulative density function to communicate the uncertainty of costs to decision makers).
Example 8. Algorithm 1. Subalgorithm 8. Differences among Means: Analysis of Variance (ANOVA)
This algorithm analyzes whether different subgroups of experimental data have statistically significant differences among their means. It also generates confidence intervals showing the exact difference between the means of specific subgroups of data. More explanation is offered for this algorithm because it introduces the analysis of randomized experimental data –an important characteristic of many CTAs. Three different experimental data designs can be used with the algorithm:
1. Example 1ma. Completely randomized data: The explanatory variables, or factors, are categorized into different levels, or treatments. The null hypothesis that the treatment means are equal is tested against the hypothesis that at least two of the means differ.
2. Example 1mb. Randomized block data: Besides treatments, blocks are used to further subdivide the data being analyzed. The null hypothesis that the treatment and/or block means are equal is tested against the hypothesis that at least two of the means differ.
3. Example 1mc. Randomized factorial data: Besides treatments and blocks, usually referred to as factors and levels, interactions between all factors are used to further analyze the data. The null hypothesis that the factor, level, and interaction effects means are equal is tested against the hypothesis that at least two of the means differ.
4. Other random experimental data: Examples of additional ANOVA techniques such as split-plot and covariance, are not included yet.
Each of the following 3 examples are taken from a college text book (Mendenhall and Sincich, 1989) demonstrating ANOVA. Because the Score and Indicator properties are similar to the regression algorithm (subalgorithm6), only selected properties for the 3 examples will be presented. All of the examples demonstrate running the first 3 ANOVAs:
1. Method 1. Standard ANOVA: This method carries out the analysis without using matrix mathematics and requires uniform sizes for treatments (and blocks, factors, levels). The primary test statistics for discerning differences among means are F statistics.
2. Method 2. Regression ANOVA: This method carries out the analysis using regression analysis. The test statistics for discerning differences among means are F statistics and coefficient t-ratios.
As of 1.9.2, the following 2 features are being retained, but not debugged, until more advanced RCT algorithms are developed that can either replace, or enhance, this feature.
3. [Method 3. Resource Stock Analysis ANOVA: Unlike Method 1 and Method 2, the treatments and factors being analyzed are contained in different base elements. The data from all of the base elements being analyzed is combined into 1 data file and that file is used to carry out the analysis. Unlike standard Resource Stock Analysis, which analyzes differences in totals, these analyses examine the statistical differences among the treatment and factor means.
4. Method 4. DevPacks Resource Stock Analysis ANOVA: Version 1.9.2 deemed this less important than other algorithms at this time (i.e. CTA Prevention), so debugging is put off until the next release. Appendix B, DevPacks Stock Analysis, in the Resource Stock Analysis reference demonstrates how to run these same analyses using DevPacks.]
Example 8a. Completely Randomized Data
URLs:
https://www.devtreks.org/greentreks/preview/carbon/input/Example 1ma, Anova, Complete Randomized/2147397543/none
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7983/Anova1.csv
https://www.devtreks.org/greentreks/preview/carbon/input/Example 1ma2, Anova, Complete Randomized/2147397544/none
https://www.devtreks.org/greentreks/preview/carbon/input/Example 1ma3, Anova, Complete Randomized/2147397545/none
http://localhost:5000/greentreks/preview/carbon/input/Example 1ma, Anova, Complete Randomized/2147409832/none
Datasets (Version 2.1.6 no longer requires Indicator.Labels in the 1st column of data when the Indicator.URLs hold the datasets)
http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1784/Anova1.csv
http://localhost:5000/resources/network_carbon/resourcepack_526/resource_1877/Anova1.csv
DOCUMENT CHANGE TO INDICATOR.URL
Method 1: Standard ANOVA: The following dataset shows a subset of the amount of debt owed (the y column) by delinquent credit card customers for 3 different income groups: A = under $12,000, B = $12,000 to $25,000, and C = over $25,000. The debts have been randomly selected from the 3 different groups. This algorithm tests whether the 3 different groups owe significantly different mean amounts of debts. The 3 income groups are referred to as treatments or levels. This algorithm runs a standard ANOVA when the following conventions are followed:
1. Treatments: The column name for the fifth column of data must be “treatment”. The data in the column must be doubles (or integers) that distinguish each treatment.
2. Training Data: The column name for the fourth column of data can follow standard conventions for dependent, or output, variable data. The data in the column must be doubles that contain the treatment observed data.
3. Scoring Data: In order to stay consistent with other algorithms, the final 3 rows of data must be scoring data. These 3 rows are not currently used by this algorithm.
4. Indicator Math Expression: The expression should be identical to regression analysis expressions (I1.Q1.treatment)
Stock dataset (The label column can be row identifiers when Indicator.URLs are used)
label,date,income,y,treatment
D1,12/3/2015,12,148,1
D1,12/4/2015,12,76,1
D1,12/5/2015,12,393,1
D1,12/6/2015,12,520,1
D1,12/7/2015,12,236,1
D1,12/8/2015,12,134,1
D1,12/9/2015,12,55,1
D1,12/10/2015,12,166,1
D1,12/11/2015,12,415,1
D1,12/12/2015,12,153,1
D1,12/3/2015,12to25,513,2
D1,12/4/2015,12to25,264,2
M&E dataset (the same convention is followed in the remaining ANOVA datasets)
index,date,income,y,treatment
1,12/3/2015,12,148,1
1,12/4/2015,12,76,1
The following image shows the Math Result for the Indicator associated with this dataset. The results for the statistical tests and confidence intervals are explained in the following list. These results matched the reference text book.
1. F Test: The F Test Statistic, 3.48, is greater than the critical value for F at .05 = 3.35. The means are significantly different (at Score.ConfidenceLevel = 95). The p-value for the F statistic in the regression analysis also verifies this statistical significance.
2. Treatment 1 Mean (< 12,000 debt): The first row of the ANOVA confidence interval shows the mean for treatment 1, 229.6, lies somewhere in the interval 119.98 and 339.2. The terms used in the confidence interval, “base” and “xminus1” derive from the Change by Resource Stock analyzers and Method 3, below, demonstrates how they get displayed during a Change by Id Resource Stock Analysis.
3. Treatment 2 Mean (12,000 to 25,000 debt) - Treatment 1 Mean (< 12,000 debt): The second and third row of the ANOVA confidence interval shows the mean for the second treatment minus the mean for the first treatment, 80.3, lies somewhere in the interval -74.73 and 235.33.
4. Treatment 3 Mean (> 25,000 debt) - Treatment 2 Mean (12,000 to 25,000 debt): The third row of the ANOVA confidence interval shows the mean for the third treatment minus the mean for the second treatment, 117.90, lies somewhere in the interval -37.13 and 272.93.
5. Treatment 3 Mean (> 25,000 debt) - Treatment 1 Mean (< 12,000 debt): The fifth row of the ANOVA confidence interval shows the mean for the third treatment minus the mean for the second treatment, 198.20, lies somewhere in the interval 43.17 and 353.23.
Method 2: Regression ANOVA: The following dataset changes the column names from “treatment” to any other acceptable name (such as x2 and x3). This tells subalgorithm8 to run the model as a regression analysis. The column x2 is coded 1 when the treatment, or income group, is 12,000 to 25,000, otherwise its coded 0. The column x3 is coded 1 when the treatment, or income group, is over 25,000, otherwise its coded 0.
Stock dataset (M&E datasets use the conventions explained for DataURL datasets)
label,date,income,y,x2,x3
D1,12/3/2015,12,148,0,0
D1,12/4/2015,12,76,0,0
D1,12/5/2015,12,393,0,0
D1,12/6/2015,12,520,0,0
D1,12/7/2015,12,236,0,0
D1,12/8/2015,12,134,0,0
D1,12/9/2015,12,55,0,0
D1,12/10/2015,12,166,0,0
D1,12/11/2015,12,415,0,0
D1,12/12/2015,12,153,0,0
D1,12/3/2015,12to25,513,1,0
D1,12/4/2015,12to25,264,1,0
D1,12/5/2015,12to25,433,1,0
D1,12/6/2015,12to25,94,1,0
D1,12/7/2015,12to25,535,1,0
D1,12/8/2015,12to25,327,1,0
D1,12/9/2015,12to25,214,1,0
D1,12/10/2015,12to25,135,1,0
D1,12/11/2015,12to25,280,1,0
D1,12/12/2015,12to25,304,1,0
D1,12/13/2015,25,335,0,1
D1,12/14/2015,25,643,0,1
D1,12/15/2015,25,216,0,1
D1,12/16/2015,25,536,0,1
D1,12/17/2015,25,128,0,1
D1,12/18/2015,25,723,0,1
D1,12/19/2015,25,258,0,1
D1,12/20/2015,25,380,0,1
D1,12/21/2015,25,594,0,1
D1,12/22/2015,25,465,0,1
D1,12/23/2015,12,500,0,0
D1,12/24/2015,12to25,250,1,0
D1,12/25/2015,25,375,0,1
The following results demonstrate that the regression ANOVA returns the same results as the standard ANOVA.
Method 3: Resource Stock Analysis ANOVA: As of 1.9.2, the following feature is being retained, but not debugged, until more advanced RCT algorithms are developed that can either replace, or enhance, this feature. The following 3 datasets are subsets of the data used in the standard ANOVA. Each dataset has been added to 3 sibling Input Series. Each Input Series and dataset represent a separate experimental treatment. The 3 rows of scoring data must be added to the last dataset (Series 3) in these types of analyses.
Stock dataset (M&E datasets use the conventions explained for DataURL datasets)
Input Series 1
label,date,income,y,treatment
D1,12/3/2015,12,148,1
D1,12/4/2015,12,76,1
D1,12/5/2015,12,393,1
D1,12/6/2015,12,520,1
D1,12/7/2015,12,236,1
D1,12/8/2015,12,134,1
D1,12/9/2015,12,55,1
D1,12/10/2015,12,166,1
D1,12/11/2015,12,415,1
D1,12/12/2015,12,153,1
Input Series 2
label,date,income,y,treatment
D1,12/3/2015,12to25,513,2
D1,12/4/2015,12to25,264,2
…
Input Series 3 (the 3 rows of scoring data must be added to the last dataset but are not currently used)
label,date,income,y,treatment
D1,12/13/2015,25,335,3
D1,12/14/2015,25,643,3
…
The following image shows that Input Series 1, Indicator 1’s, properties have been set in a manner to a) cancel out the non-mathematical treatment column (I1.Q1.treatment - 1) or (I1.Q1.treatment * 0), and b) use the dataset’s y (+ I1.QT), or dependent variable, column, to generate descriptive statistics. The properties of Input Series 2 and 3 were set in a similar manner (but subtracting 2 and 3 respectively in the expression or multiplying by 0).
The following image demonstrates that running a Change by Id Resource Stock Analysis required setting the following analyzer properties (new in 1.8.8):
* Math Type = algorithm1,
* Math Sub Type – subalgorithm8,
* Math Expression = I1.QTM.treatment.
* Confidence Interval = 95
The analysis fills in the following properties:
* Data = the data from each dataset is combined behind the scenes into 1 data file. All datasets are assumed to have the same column names and data content context.
* Data Column Names = the column names for the combined dataset is taken from the first dataset.
* Math Result: depending on the column names, displays the same results as the standard or regression ANOVA
The following image demonstrates that running a Change by Id Resource Stock Analysis returns the identical Score properties as standard Resource Stock Analyses, but the Indicators compare the differences between the statistical means for each treatment. They use the same calculations as the ones used to set the confidence intervals for the standard ANOVA and regression ANOVA, but use the standard comparators used by the Resource Stock analyzers. They also use the same stylesheet as the standard analyses, which is interpreted as follows:
1st Column
* Indicator.Total = Indicator.Mean
* Indicator.AmountChange = F Statistic for all Treatments
* Indicator.PercentChange = F Critical Value for all Treatments
* Indicator.BaseChange = Indicator.Mean
* Indicator.BasePercentChange = plus or minus confidence interval for the base mean alone
Remaining Columns
* Indicator.Total = Indicator.Mean
* Indicator.AmountChange = current Indicator Mean – xminus1 Indicator Mean
* Indicator.PercentChange = plus or minus confidence interval for the difference between the xminus1 means
* Indicator.BaseChange = current Indicator Mean – base Indicator Mean
* Indicator.BasePercentChange = plus or minus confidence interval for the difference between the base means
Example 8b. Randomized Block Data
URLs:
https://www.devtreks.org/greentreks/preview/carbon/input/Example 1mb, Anova, Randomized Block/2147397546/none
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7988/Anova2.csv
https://www.devtreks.org/greentreks/preview/carbon/input/Example 1mb2, Anova, Randomized Block/2147397547/none
https://www.devtreks.org/greentreks/preview/carbon/input/Example 1mb3, Anova, Randomized Block/2147397548/none
Method 1: Standard ANOVA: The object of the following dataset is to compare cost estimates for 3 cost estimators. Each cost estimator estimated the costs for the same 4 jobs. Randomized block data requires structuring data by treatments, the 3 cost estimators, and blocks, the 4 job cost estimates. This algorithm runs a standard ANOVA when the following conventions are followed:
5. Treatments: The column name for the fifth column of data must be “treatment”. The data in the column must be doubles (or integers) that distinguish each treatment.
6. Blocks: The column name for the sixth column of data must be “block”. The data in the column must be doubles (or integers) that distinguish each block.
7. Training Data: The column name for the fourth column of data can follow standard conventions for dependent, or output, variable data. The data in the column must be doubles that contain the treatment-block observed data.
8. Scoring Data: In order to stay consistent with other algorithms, the final 3 rows of data must be scoring data. These 3 rows are not currently used by this algorithm.
9. Indicator Math Expression: The expression should be identical to regression analysis expressions (I1.Q1.treatment + I1.Q2.block)
Stock dataset (M&E datasets use the conventions explained for DataURL datasets)
label,estimator,job,y,treatment,block
B1,1,1,4.6,1,1
B1,2,2,6.3,2,2
B1,3,3,5.4,3,3
B1,1,4,6.6,1,4
B1,2,1,4.9,2,1
B1,3,2,5.9,3,2
B1,1,3,5,1,3
B1,2,4,6.8,2,4
B1,3,1,4.4,3,1
B1,1,2,6.2,1,2
B1,2,3,5.4,2,3
B1,3,4,6.3,3,4
B1,1,1,7,1,1
B1,2,2,4,2,2
B1,3,3,5,3,3
The following image shows the Math Result for the Indicator associated with this dataset. The results for the statistical tests and confidence intervals are explained in the following list. These results matched the reference text book.
1. F Test Treatments: The F test statistic for the treatments, 4.176, is greater than the critical value for F at .10 = 3.463. The treatment means are significantly different (at Score.ConfidenceLevel = 90).
2. F Test Blocks: The F test statistic for the blocks, 72.464, is greater than the critical value for F at .10 = 3.463. The block means are significantly different (at Score.ConfidenceLevel = 90).
3. Treatment Confidence Intervals: The same confidence intervals are generated as the completely randomized example, with one exception. Instead of calculating the difference between the 2nd treatment and the 1st treatment means, these confidence intervals subtract the 1st from the 2nd and the 3rd from the 2nd. Differences in blocks are not displayed because of display issues with the Resource Stock Analysis (they all share the same code).
Method 2: Regression ANOVA: The following dataset changes the column names from “treatment” and “block” to any other acceptable name (such as t1 and b1). This tells subalgorithm8 to run the model as a regression analysis. The column t1 is coded 0 when the treatment, or cost estimator, is number 1, otherwise its coded 1. The column b1 is coded 0 when the block, or job estimate, is number 1, otherwise its coded 1.
Stock dataset (M&E datasets use the conventions explained for DataURL datasets)
label,estimator,job,y,t1,b1
B1,1,1,4.6,0,0
B1,2,2,6.3,1,1
B1,3,3,5.4,1,1
B1,1,4,6.6,0,1
B1,2,1,4.9,1,0
B1,3,2,5.9,1,1
B1,1,3,5,0,1
B1,2,4,6.8,1,1
B1,3,1,4.4,1,0
B1,1,2,6.2,0,1
B1,2,3,5.4,1,1
B1,3,4,6.3,1,1
B1,1,1,7,0,0
B1,2,2,4,1,1
B1,3,3,5,1,1
The following image shows that running this dataset using subalgorithm8 will produce a standard regression analysis. In this analysis, the t-ratios for t1 (treatments or job estimators) and b1 (blocks or cost estimates) can be used to assess the differences among treatment and block means.
Method 3: Resource Stock Analysis ANOVA: As of 1.9.2, the following feature is being retained, but not debugged, until more advanced RCT algorithms are developed that can either replace, or enhance, this feature. The following 3 datasets are subsets of the data used in the standard ANOVA. Each dataset has been added to 3 sibling Input Series. Each Input Series and dataset represent a separate experimental treatment. An example of a Math Expression used to calculate Series 1 is (I1.Q1.treatment * 0) + (I1.Q2.block * 0) + I1.QT.
Stock dataset (M&E datasets use the conventions explained for DataURL datasets)
Input Series 1
label,estimator,job,y,treatment,block
B1,1,1,4.6,1,1
B1,1,4,6.6,1,4
B1,1,3,5,1,3
B1,1,2,6.2,1,2
Input Series 2
label,estimator,job,y,treatment,block
B1,2,2,6.3,2,2
B1,2,1,4.9,2,1
B1,2,4,6.8,2,4
B1,2,3,5.4,2,3
Input Series 3 (the 3 rows of scoring data must be added to the last dataset but are not currently used)
label,estimator,job,y,treatment,block
B1,3,3,5.4,3,3
B1,3,2,5.9,3,2
B1,3,1,4.4,3,1
B1,3,4,6.3,3,4
B1,1,1,7,1,1
B1,2,2,4,2,2
B1,3,3,5,3,3
The following image demonstrates that running a Change by Id Resource Stock Analysis returns the identical Score properties as standard Resource Stock Analyses, but the Indicators compare the differences between the statistical means for each treatment. They use the same calculations as the ones used to set the confidence intervals for the standard ANOVA and regression ANOVA, but use the standard comparators used by the Resource Stock analyzers. They currently use the same stylesheet as the standard analyses, which cannot display the additional differences among blocks. Use the Score.MathResult to view the complete statistical results.
Example 8c. Randomized Factorial Data
URLs:
https://www.devtreks.org/greentreks/preview/carbon/input/Example 1mc, Anova, Randomized Factorial/2147397549/none
https://devtreks1.blob.core.windows.net/resources/network_carbon/resourcepack_1534/resource_7993/Anova3.csv
https://www.devtreks.org/greentreks/preview/carbon/input/Example 1mc2, Anova, Randomized Factorial/2147397550/none
https://www.devtreks.org/greentreks/preview/carbon/input/Example 1mc3, Anova, Randomized Factorial/2147397551/none
Method 1: Standard ANOVA: The object of the following subset of a full dataset is to compare the mean profit per unit raw material for 3 amounts of raw material amounts and 3 ratios of raw materials allocated to manufacturing lines. Randomized factorial data requires structuring data by factors, the ratios and raw materials, and levels, the 3 different amounts for each factor. The dataset represents 3 replications of a complete 3 x 3 factorial experiment. This algorithm runs a standard ANOVA when the following conventions are followed:
1. Factors: The column name for the fifth column of data must be “factor1”. The data in the column must be doubles (or integers) that distinguish each level of factor1. The column name for the sixth column of data must be “factor2”. The data in the column must be doubles (or integers) that distinguish each level of factor2. Additional factors will be supported in future releases.
2. Training Data: The column name for the fourth column of data can follow standard conventions for dependent, or output, variable data. The data in the column must be doubles that contain the factor-level observed data.
3. Scoring Data: In order to stay consistent with other algorithms, the final 3 rows of data must be scoring data. These 3 rows are not currently used by this algorithm.
4. Indicator Math Expression: The expression should be identical to regression analysis expressions (I1.Q1.factor1 + I1.Q2.factor2)
Stock dataset (M&E datasets use the conventions explained for DataURL datasets)
label,ratioraw,rawmat,y,factor1,factor2
F1,0.50,15,23,0.50,15
F1,0.50,15,20,0.50,15
F1,0.50,15,21,0.50,15
F1,1.00,15,22,1.00,15
F1,1.00,15,20,1.00,15
F1,1.00,15,19,1.00,15
F1,2.00,15,18,2.00,15
F1,2.00,15,18,2.00,15
F1,2.00,15,16,2.00,15
F1,0.50,18,22,0.50,18
…
The following image shows the Math Result for the Indicator associated with this dataset. The results for the statistical tests and confidence intervals are explained in the following list. These results matched the reference text book.
1. F Test Factor1: The F test statistic for factor1, 1.71, is less than the critical value for F at .05 = 3.55. The factor1 means are not significantly different (at Score.ConfidenceLevel = 95).
2. F Test Factor2: The F test statistic for factor2, 4.20, is greater than the critical value for F at .05 = 3.55. The factor2 means are significantly different (at Score.ConfidenceLevel = 95).
3. F Test Interactions: The F test statistic for the interactive effects, 4.80, is greater than the critical value for F at .05 = 2.93. The interactive means are significantly different (at Score.ConfidenceLevel = 95).
4. Factor Confidence Intervals: The confidence intervals reflect differences among factor-level data cells in the following data table. Specifically, the differences in the means for cells in position (0,0), (1,1), and (2,2) are compared.
Method 2: Regression ANOVA: The following dataset changes the column names from “factor1” and “factor2” to any other acceptable name (such as s1 and r1). This tells subalgorithm8 to run the model as a regression analysis. The column s1 is coded 0 when factor 1, or raw material ratio, is level 1, 0.5, otherwise its coded 1. The column r1 is coded 0 when factor 2, or raw material amount, is level 1, 15, otherwise its coded 1. The last column, sr, models the interactive effects between s1 and s2 and is coded as s1 * r1.
Note that adding additional interactive terms such as x1^2 and x2^2, while using 0 and 1 codes for independent variables, is not supported by the matrix mathematical techniques used by the regression algorithm –they return an error message stating that the matrix is not positive definite.
Stock dataset (M&E datasets use the conventions explained for DataURL datasets)
label,ratioraw,rawmat,y,s1,r1,sr
F1,0.50,15,23,0.00,0,0
F1,0.50,15,20,0.00,0,0
F1,0.50,15,21,0.00,0,0
F1,1.00,15,22,1.00,0,0
F1,1.00,15,20,1.00,0,0
F1,1.00,15,19,1.00,0,0
F1,2.00,15,18,1.00,0,0
F1,2.00,15,18,1.00,0,0
F1,2.00,15,16,1.00,0,0
F1,0.50,18,22,0.00,1,0
F1,0.50,18,19,0.00,1,0
F1,0.50,18,20,0.00,1,0
F1,1.00,18,24,1.00,1,1
F1,1.00,18,25,1.00,1,1
F1,1.00,18,22,1.00,1,1
F1,2.00,18,21,1.00,1,1
F1,2.00,18,23,1.00,1,1
F1,2.00,18,20,1.00,1,1
F1,0.50,21,19,0.00,1,0
F1,0.50,21,18,0.00,1,0
F1,0.50,21,21,0.00,1,0
F1,1.00,21,20,1.00,1,1
F1,1.00,21,19,1.00,1,1
F1,1.00,21,22,1.00,1,1
F1,2.00,21,20,1.00,1,1
F1,2.00,21,22,1.00,1,1
F1,2.00,21,24,1.00,1,1
F1,0.50,21,24,1.00,1,0
F1,1.00,21,24,1.00,1,0
F1,2.00,21,24,1.00,1,1
The following image shows that running this dataset using subalgorithm8 will produce a standard regression analysis. In this analysis, the t-ratios for s1 (factor1 or raw material ratios), r1 (factore2 or raw material amounts), and sr (interactive effects between s1 and r1), can be used to assess the differences among factors and interactive effect means.
Method 3: Resource Stock Analysis ANOVA: As of 1.9.2, the following feature is being retained, but not debugged, until more advanced RCT algorithms are developed that can either replace, or enhance, this feature. The following 3 datasets are subsets of the data used in the standard ANOVA. Each dataset has been added to 3 sibling Input Series. Each Input Series and dataset represent a separate experimental treatment. An example of a Math Expression used to calculate Series 1 is (I1.Q1.factor1 * 0) + (I1.Q2.factor2 * 0) + I1.QT.
Stock dataset (M&E datasets use the conventions explained for DataURL datasets)
Input Series 1
label,ratioraw,rawmat,y,factor1,factor2
F1,0.50,15,23,0.50,15
F1,0.50,15,20,0.50,15
F1,0.50,15,21,0.50,15
F1,0.50,18,22,0.50,18
F1,0.50,18,19,0.50,18
F1,0.50,18,20,0.50,18
F1,0.50,21,19,0.50,21
F1,0.50,21,18,0.50,21
F1,0.50,21,21,0.50,21
Input Series 2
label,ratioraw,rawmat,y,factor1,factor2
F1,1.00,15,22,1.00,15
F1,1.00,15,20,1.00,15
…
Input Series 3 (the 3 rows of scoring data must be added to the last dataset but are not currently used)
label,ratioraw,rawmat,y,factor1,factor2
F1,2.00,15,18,2.00,15
F1,2.00,15,18,2.00,15
…
The following images demonstrates that running a Change by Id Resource Stock Analysis returns the identical Score properties as standard Resource Stock Analyses,, but the Indicator confidence intervals compare the differences between the statistical means for specific factor-level cells shown in the data table above. They use the same calculations as the ones used to set the confidence intervals for the standard ANOVA and regression ANOVA, but use the standard comparators used by the Resource Stock analyzers. They currently use the same stylesheet as the standard analyses, which cannot display the additional differences among all factor-level cells. Use the first image, the Score.MathResult, to view the complete statistical results.
Appendix B. Correlated Uncertain Numbers
All of the main probabilistic-risk references (GAO, IPCC, NASA) explain the importance of accounting for correlated indicators in PRA. Failure to do so results in random samples that don’t retain the correct correlations between indicators and therefore incorrect descriptive statistics. The following references, in particular, provide guidance about potential mathematical techniques that account for correlated indicators:
1. Piwcewicz of the Australian Actuaries Society (2005) provides an introduction to correlated multivariate analysis for practitioners. They introduce common algorithms, including pair-wise rank correlations (or Inman and Conover) and copula (mathematical formulas that manipulate matrixes) that analysts commonly use. Their approach is practical, recognizing that assumptions may have to be made about data, distributions, and correlations. They also explain common dangers posed by simulation techniques. The Brebbia (2013) reference provides more recent risk analysis techniques, such as using Kernel copulas (but it isn’t an open access publication and therefore of limited usefulness).
2. Anderson, Harri, and Cable (2009) use an agricultural economics example relevant to CTA analysis to explain the difference between the conventional pair-wise rank correlations (Inman and Conover) and a matrix manipulation technique (eigenvalue decomposition). Advantages to the latter approach include better simulated numbers and ease of computation (i.e. using modern mathematical libraries). References to this latter approach can be found in the finance and engineering literature, but are not cited here.
3. Although not cited, several references were found that used techniques such as genetic algorithms, simulated annealing algorithms, and several sequencing methods (i.e. Hammersley) for carrying out these simulations. Potential advantages with some of the techniques involve speed and large data manipulation. References to these approaches can be found in the computer science literature, but are not cited here.
Searches on the web reveal that statisticians, software developers, and the mathematically-inclined, frequently answer this question in their forums. Many present statistical scripts, such as R package or Matlab, with concrete datasets demonstrating the answer. Judging from their online blogs, it appears that most practitioners recommend using mathematical matrix manipulation techniques based on copulas.
The open source mathematical library (Math.Net) used in this version supports a wide assortment of mathematical matrix manipulation. Algorithm1 with subalgorithm2, subalgorithm3, and subalgorithm4, use the following steps to generate statistics for correlated indicators:
a) Use Monte Carlo simulation to generate a sample of random numbers using each indicator’s Distribution Type, QT, QTD1, and QTD2. Combine each indicator’s vector of random samples (Fn) into a random sample matrix F. This matrix is also called marginal distributions.
b) Generate a Pearson or Spearman correlation matrix R from the data or from expert knowledge. The matrix is based on the calculated QT variable.
c) Use Monte Carlo simulation to generate a matrix Z of random numbers using a normal distribution (N(0,1) with the same rank as F.
d) Select an n-copula C that is consistent with the data. A copula is multivariate uniform (or normal) distribution that uses correlation matrixes to define dependence among the uniform (or normal) variables. Subalgorithm2, or pra_copulaN, generates matrix X by multiplying a Cholesky decomposition matrix of R times Z. Subalgorithm3, or pra_eigen, and subalgorithm4, pra_eigenU, generate matrix X by multiplying an Eigen decomposition matrix of R times Z.
e) Subalgorthm3 uses a Normal distribution and subalgorithm4 uses a Uniform distribution in the previous step. That’s their only difference. The author found examples of both distributions being used and assumes it may matter. Tests confirm slightly different results.
f) Generate a correlated random sample matrix R by using the probability of each variable in X to determine the inverse cumulative density function of each corresponding variable in F.
g) Verify the accuracy of the random correlated sample matrix by generating another Pearson or Spearman correlation matrix from R. Add the result to the Score Math Result.
h) Generate sample statistics for each vector in R and add descriptive statistics to the associated indicator.
i) Set the Score using standard Score properties.
The following source code displays the calculations used by subalgorithm3. Subalgorithm4 uses a Uniform distribution for the matrix u. The source code demonstrates why these types of algorithms should not be considered daunting. As mentioned, the analysis of large data sets should consider using algorithms that don’t employ these exact techniques.
else if (HasMathType(MATH_TYPES.algorithm1, MATH_SUBTYPES.subalgorithm3))
{
//eigenvalue decomposition
//jointData is the correlation matrix
var evd = jointData.Evd();
if (evd == null)
{
this.SB1ScoreMathResult += string.Concat(" ", Errors.MakeStandardErrorMsg("MATRIX_BADEIGEN"));
return null;
}
//take the square root of the diagonal eigenvalue matrix
Matrix squarerootEigenValues = evd.D.PointwisePower(0.5);
//multiply the eigenvalues square root matrix by the eigenvectors diagonal matrix
var v = evd.EigenVectors.Multiply(squarerootEigenValues);
//random normal standards
var u = Matrix.Build.Random(this.SB1Iterations, cols);
//generate correlated normal randoms by multiplying both matrixes together
var X = u.TransposeAndMultiply(v);
//random sample vector n = inverse of cumulative density function for Xn
SetCorrelatedRandomSamples(X, sampleData, randomSampleData);
}
DevTreks –social budgeting that improves lives and livelihoods
102