Software Development Magazine - Project Management, Programming, Software Testing |
Scrum Expert - Articles, tools, videos, news and other resources on Agile, Scrum and Kanban |
Click here to view the complete list of archived articles
This article was originally published in the Summer 2011 issue of Methods & Tools
Everything You Always Wanted to Know About Software Measurement
Sue Rule & P. Grant Rule, s.rule @ smsexemplar.com
Software Measurement Services Ltd., www.smsexemplar.com
What Does Functional Size Measurement Mean?
When Grant Rule was working with early pioneers of agile methods back in the late 1980s, estimating was a big issue. No one, it seemed, knew how to determine accurately in advance how big a project was, how long it would take, or how much it would cost. This was particularly a problem for a software house delivering fixed price contracts. Focusing on rapid, iterative delivery to ensure the user gets the right software product is good; but at what price?
While advocates of Lean-Agile rightly emphasise the importance of delivering the right thing, it is essential that professional software managers also keep reliable accounts of software process performance for estimating costs, auditing performance, and improving productivity. Effectiveness is delivering optimum value for the minimum cost.
The solution to Grant’s estimating issue, and many associated software project control issues, was the IFPUG function point measurement method developed by Alan Albrecht in 1979. Yet for various reasons, functional size measurement has failed to become widely adopted as a software project management tool. Perhaps it seems arcane - difficult - part of that top-heavy bureaucracy development teams want to be rid of. But as a result, estimating remains a widespread problem for Agile and non-Agile developers alike. Process performance remains hugely variable and unpredictable, with measurement often still the most neglected area of process improvement. By stripping away much of the dysfunctional and redundant process control that can accumulate around the software process, the current interest in Agile is bringing many of these issues to the surface. But many IT professionals seem stuck in a rut, ploughing new land with a newly Agile team but using the same blunt plough and running into the same obstacles.
CIOs, bid teams, development teams, vendor managers and legal advisors seeking better ways of contracting - particularly for Lean-Agile software development - are still faced with the dilemmas Albrecht was seeking to tackle. How do you know your development teams - in-house or outsourced - are delivering best value for money? How does a supplier manage the risk of a fixed price contract? How does he demonstrate competitive productivity levels and on-going improvement? How will Agile benefit my organisation? How does the business manage the risk if we can’t detail the requirements up-front? How do you measure the benefits of outsourcing IT? Which technology should we choose? Will a change of technology affect the value delivered? What are the relative merits of buying it, building it or outsourcing it?
Alan Albrecht’s answer to addressing these issues was to measure software size in terms of the output delivered to the user - its functionality - rather than simply in terms of the time and effort put into the development. Deriving a software size from the user’s requirements which can then be mapped to the time, resources and budget needed to realise those requirements gives you the necessary objectivity to tackle the scope, project and performance control issues. It also takes you at least part way to measuring the business value and benefit delivered by software. At least you know how much it will cost and how long it will take to deliver the specified functionality; the software should do what it "says on the tin" and it should be delivered on schedule, within budget. If the customer has bought beans instead of tomatoes, or needed a three course meal and not a tin of beans at all, there is something amiss elsewhere in the process.
Functional size measurement means the ability to make objective performance comparisons across teams, across market sectors, and between different suppliers. It means reliable productivity and pricing benchmarks. It means accurate baselines and measures of improvement. It means early and accurate estimations of software project cost and duration. That was why Grant Rule first adopted the technique back in the 1980s - and remains a knowledgeable advocate of it today.
Estimating
The predictability of software projects remains a major issue for many business managers. Agile can exacerbate this if delivered software products or software product components have to be routinely re-factored. Customers considering outsourced Agile development will rightly require some assurance of value for money, and suppliers need a reliable oversight of costs to ensure prices remain competitive. The ‘Story Point’ measures used by many Agile teams to estimate and manage projects are little more than a new take on our old friend, the measure of input. Estimates using Story Points rely on team knowledge and capability, based on individual experience. The process of arriving at them is even known as Planning Poker, an indication of the level of personal skill and informed guesswork required. Story Points cannot be compared between one team and another (let alone one market sector or one supplier and another). Their use either as an effective and reliable estimating tool or as objective and comparable measures of productivity is therefore very limited.
In 2010, Grant Rule published a short paper on using the most modern of the functional size methods - COSMIC - to derive accurate estimates from User Stories (used by Agile teams to capture requirements.) As measurement specialists, SMS recommend the use of COSMIC as the quickest, most reliable, and most easily learned method of introducing robust software estimating practice for developers using Agile or more traditional development practices.
Size Matters - Accurate Early Estimating
Project failure rates have been shown to relate strongly to the size of project undertaken, with the largest projects accounting for the highest failure rate. Rule’s Relative Size Scale was a table developed by Grant Rule to provide a quick and easy early range estimate of project size and cost in terms of clothing sizes - Small, Medium, Large, Extra Large. The scale uses benchmark data to provide a valuable and reliable early-stage feasability check.
COCOMO II Profiling - Accounting for Non-Functional Requirements
The Constructive Cost Model (COCOMO) is an algorithmic software cost estimation model developed by Barry Boehm. The model uses a basic regression formula, with parameters that are derived from historical project data and current project characteristics. COCOMO enables more refined estimates to be derived from simple unadjusted function points by taking into account differences in non-functional project attributes (Cost Drivers). Detailed COCOMO can also account for the influence of individual project phases.
COCOMO II was brought out in 2000, to adapt the original model for use estimating software projects in modern development conditions such as incremental development and rational unified process. COCOMO II is continually evolving to adapt to changing software methods.
What are Function Points?
Function Points measure the end users requirements in terms of data movements. This means that an early functional size estimate can be derived before any code has been written. They are then also used to measure the actual cost of developing the requirements, which can be compared directly to the estimate and used to calibrate a scale for a particular development environment. If development productivity figures are known, the cost of developing new software can therefore be estimated early enough to make direct comparisons to the cost of buying a software package, or simply using a non-technical solution. Function Point measures are independent of technology so can compare different suppliers, technologies and different teams to give an objective basis for a cost benefit decision.
Because they encompass both cost (development effort consumed) and benefit (requirements fulfilled), Function Points unlock the door to measuring productivity; velocity; quality; and value in software development.
- The capacity to measure the amount of output produced for input provided gives an objective productivity figure.
- Measured against time - E.g. FP delivered per elapsed month - function points provide objective and comparable measures of velocity.
- The number of defects detected over total size of delivery - i.e. no of defects delivered over no of FP delivered - gives a measure of quality.
- By comparing costs, effort and time expended per function point delivered, we can also demonstrate the huge waste generated by ineffective software processes compared to the standards of measurable efficiency shown by effective, Right-shifted organisations. Although a long way from being a comprehensive measure of value delivered to the customer, this offers a measure of the component of value for which the software developer is solely responsible. Ensuring the functionality delivered is aligned to business need is the joint responsibility of the business user and the development team and outside the scope of functional size measurement.
A Comparison of Functional Size Methods
The most widely used function point methods are IFPUG, NESMA, Mk II and COSMIC. Of these, IFPUG is based on the original method designed by Alan Albrecht. COSMIC is the newest method, developed by an international team of metrics experts to support modern software.
Where organisations have already made some investment in IFPUG capability, it may make sense to use this more widely known method. It is actively managed and maintained by the International Function Point User Group, with the latest version 4.3 released in January 2010. The Certified Function Point Specialist qualification is a recognised standard for IFPUG counting competency. There is considerable IFPUG benchmarking data available, although it is questionable how valuable some of the older data is in terms of benchmarking modern software performance.
While the older functional size measures - typically, IFPUG - can be used for Agile projects, there are significant problems. IFPUG counting is not the easiest process to learn or apply, and even with trained practitioners, time can be wasted debating fuzzy areas. Interpretation of the IFPUG counting rules for complex modern software environments can be tricky, sometimes resulting in differences of opinion between supplier and customer.
COSMIC offers all the advantages of functional size measurement, without some of the shortcomings of earlier methodologies which were not designed with 21st century software in mind. It will prove easier to introduce, easier for both business users and software developers to understand and apply, and a more effective communication mechanism for negotiating the delivery of value. It is an established standard with recognised training qualifications. There is a growing repository of benchmark data available.
See the table below for a more detailed comparison.
Types of measurement scale and permissible operations using them
The type of scale depends on the nature of the relationship between values on the scale. Four types of scale are commonly defined:
Nominal – arbitrary labels, classification data, no ordering – the measurement values are categorical but it makes no sense to state that one category is ‘greater than’ another. For example: Yes/No; Black/White/Yellow/Red; male/female, animal/vegetable/mineral; the classification of defects by their type.
Ordinal – ordered but differences between values are not important – the measurement values are rankings. For example: restaurant ‘star’ ratings; political parties on left to right of the spectrum are given labels Red, Orange, Blue; Likert scales that rank ‘user satisfaction’ on a scale of 1..5; the assignment of a severity level to defects.
Interval – ordered, constant scale, but no natural zero – the measurement values have equal distances corresponding to equal quantities of the attribute. For example: dates, temperature on Celsius or Fahrenheit scales – differences make sense, but ratios do not (e.g., 30°-20° = 20°-10°, but 20° is not twice as hot as 10°! Other examples: cyclomatic complexity has the minimum value of one, but each additional path increments the count by one.
Ratio – ordered, constant scale, natural zero – the measurement values have equal distances corresponding to equal quantities of the attribute where the value of zero corresponds to none of the attribute. For example: height; weight; age; length; temperature on Kelvin scale (e.g. absolute zero = 0°K, and 200°K is twice as hot as 100°K); the size of a software source listing in terms of Non-Commentary Source Statements (or Source Lines Of Code).
The method of measurement usually affects the type of scale that can be used reliably with a given attribute. For example, subjective methods of measurement usually only support ordinal or nominal scales.
Only certain operations can be performed on certain scales of measurement. The following list summarizes which operations are legitimate for each scale. Note that you can always apply operations from a 'lesser scale' to any particular data, e.g. you may apply nominal, ordinal, or interval operations to an interval scaled datum.
Nominal Scale. You are only allowed to examine if a nominal scale datum is equal to some particular value or to count the number of occurrences of each value. For example, gender is a nominal scale variable. You can examine if the gender of a person is F (female) or to count the number of Ms (males) in a sample. Valid statistics: mode, chi square.
Ordinal Scale. You are also allowed to examine if an ordinal scale datum is less than or greater than another value. Hence, you can 'rank' ordinal data, but you cannot 'quantify' differences between two ordinal values. For example, political party is an ordinal datum with the Liberal Democratic Party to the left of the Conservative Party, but you can't quantify the difference. Another example are preference scores, e.g. ratings of eating establishments where 10=good, 1=poor, but the difference between an establishment with a 10 ranking and an 8 ranking can't be quantified. Valid statistics: mode, chi square, median, percentile.
Interval Scale. You are also allowed to quantify the difference between two interval scale values but there is no natural zero. For example, temperature scales are interval data with 25C warmer than 20C and a 5C difference has some physical meaning. Note that 0C is arbitrary, so that it does not make sense to say that 20C is twice as hot as 10C. Valid statistics: mode, chi square, median, percentile, mean, standard deviation, correlation, regression, analysis of variance.
Ratio Scale. You are also allowed to take ratios among ratio scaled variables. Physical measurements of height, weight, and length are typically ratio variables. It is now meaningful to say that 10 metres is twice as long as 5 metres. This ratio holds true regardless of which scale the object is being measured in (e.g. metres or yards). This is because there is a natural zero. Valid statistics: mode, chi square, median, percentile, mean, standard deviation, correlation, regression, analysis of variance, geometric mean, harmonic mean, coefficient of variation, logarithms.
A comparison of the most common Function Size Measurement (FSM) Methods
General Information |
IFPUG FPA r4.3 |
NESMA FPA v2.0 |
Mark II FPA r1.3.1 |
COSMIC FSM r3.0.1 |
Origin |
Created by Allan Albrecht at IBM in 1978 Latest release (January 2010) of the original method |
Believed to have been created by NESMA (aka NEFPUG) in mid-1980s Derived from IFPUG |
Created by Charles Symons at Nolan Norton in 1984 (put into public domain 1991) Updated method for use with DBMS, structured methods, CASE tools, etc |
Created by international consortium of industry subject matter experts and academics from 19 countries in 1997 Updated method for use with OOA/D, layered architectures, Web2.0, lean/agile, etc |
Counting Practices Manual |
Available to IFPUG members |
Available for sale |
Available - public domain |
Available - public domain |
Counting Practices Manual - languages available |
English & some other language versions available to members |
Dutch-language version |
English-language version |
9 language versions: |
Used by |
Public & private sector organisations, large & small, both customers & vendors, around the world Mostly MIS users Stable user base – international |
Public & private sector organisations, large & small, both customers & vendors, primarily in The Netherlands Mostly MIS users Declining user base – mostly The Netherlands |
Originally HM Government’s preferred method for sizing & estimating software. Now used by a few public sector customers & their vendors |
Public & private sector organisations, large & small, both customers & vendors, around the world MIS and Engineering users Growing user base – international |
Terminology used |
Founded in the 1970s |
Founded in the 1970s |
Uses structured methods terminology |
Compatible with OOA/D, & software eng. principles |
Availability |
Available only to members of IFPUG (but easy to join organisation) |
Public domain -– download from NESMA |
Public domain – download from UKSMA |
Public domain – download from COSMIC |
Design Authority (independent of vendors) |
International Function Point Users Group (IFPUG) |
Netherlands Software Metrics Association (NESMA) |
United Kingdom Software Metrics Association (UKSMA) |
COmmon Software Measurement International Consortium (COSMIC) |
Common Features
1. Compliance
All four methods comply with the international standard for Functional Size Measurement Methods – ISO14143:
IFPUG FPA r4.3 |
NESMA FPA v2.0 |
Mark II FPA r1.3.1 |
COSMIC FSM r3.0.1 |
ISO/IEC 20926:2003 ISO Standard applies only to |
ISO/IEC 24570:2005 ISO Standard applies only to |
ISO/IEC 20968:2002 Recommended method for HM Government (UK) |
ISO/IEC 19761:2003/2010 BCS Technology Award Winner in 2006 Recognised as a National Standard in Spain & Japan |
2. Certification
All four methods operate certification schemes for training measurement staff:
IFPUG FPA r4.3 |
NESMA FPA v2.0 |
Mark II FPA r1.3.1 |
COSMIC FSM r3.0.1 |
Yes |
|
Yes |
Yes |
Benchmarking Data
All four methods are supported by the International Software Benchmarking Standards Group (ISBSG). There are differences, however, in the size of the comparative data pool:
IFPUG FPA r4.3 |
NESMA FPA v2.0 |
Mark II FPA r1.3.1 |
COSMIC FSM r3.0.1 |
Large data set compiled over many years – the utility of antique data is questionable |
Large |
Small |
Moderate and growing |
All four methods share the following characteristics:
- Oriented toward user-required functionality
- Helps verify consistency & completeness of user-required functionality
- Analyses can be used as basis for construction of tests independent of code & test activities
- Measures functional size of dynamic (behavioural) aspects of system (expressed as e.g. use cases, conversational dialogues, user stories, epics & themes, etc)
- Measures development of new requirements
- Measures adaptive maintenance (enhancements)
- Designed for MIS systems - flat & indexed files, batch systems, OLTP systems
- Can be used to measure Functional User Requirements before design, code & test
- Can be used to measure Functional User Requirements after design, code & test
- Can be used to (re)estimate during product life-cycle
- Size can be used as input into top-down software cost models such as COCOMO.II.2000, SLIM, SEER, Price-S, etc
- Can be used to construct product burndown charts, calculate takt time, #sprints, etc
- Independent of product non-functional requirements
- Independent of project constraints
- Independent of developer experience
- Independent of process, project management & development methods
- Early estimates of functional size can be made based on incomplete knowledge of Functional User Requirements – enabling consistent use of one size scale for estimating & measurement throughout project:
IFPUG FPA r4.3 |
NESMA FPA v2.0 |
Mark II FPA r1.3.1 |
COSMIC FSM r3.0.1 |
Can produce early estimates using various methods: |
Can produce early estimates using various methods: |
Can produce early estimates using various methods: |
Can produce early estimates using various methods: |
NONE of the four methods deliver the following characteristics:
- Measures corrective maintenance (fixes)
- Measures perfective maintenance (refactoring for improved performance)
- Measures algorithmic complexity
- Measures reuse of code
Differences between the four main methods:
Characteristic |
IFPUG FPA r4.3 |
NESMA FPA v2.0 |
Mark II FPA r1.3.1 |
COSMIC FSM r3.0.1 |
Measures functional size of static (data storage) aspects of a system (expressed as files, tables, entity types, classes, etc) |
Yes |
Yes |
Regarded as ‘double accounting’ |
Regarded as ‘double accounting’ |
Compatible with modern methods of requirements analysis |
Partially |
Partially |
Yes |
Yes |
Designed for MIS systems - Relational DBMS |
No |
No |
Yes |
Yes |
Designed to be applicable to real-time and/or embedded systems |
No |
No |
No |
Yes |
Can be used to measure complex, layered architectures |
No |
No |
Yes |
Yes |
Scale type: |
‘Nominal/Ordinal’ Scale Unequal intervals between Low & Average, and between Average & High |
‘Nominal/Ordinal’ Scale Unequal intervals between Low & Average, and between Average & High |
‘Ordinal/Interval’ Scale Weights derived so that |
Ratio Scale Empirical data suggests |
Permissible arithmetic & statistical operations |
Categories assigned relative weights: Data can be 'ranked', but 'quantifying' differences between values is
difficult due to ‘cut off’ (Low is c. half of High) – |
Categories assigned relative weights: Data can be 'ranked', but 'quantifying' differences between values is
difficult due to ‘cut off’ (Low is c. half of High) – |
Ordered, synthetic scale with a natural zero: Data can be ranked; differences & ratios between values can be quantified within limits but are problematic due to the use of weights |
Ordered, constant scale with a natural zero: Data can be ranked; differences between values can be quantified; ratios make sense (i.e. 20 is twice the size of 10, and 2000cfp is twice 1000cfp). |
Accounts for information processing by: |
Sizing static data and dynamic behaviour |
Sizing static data and dynamic behaviour |
Sizing dynamic behaviour, the use of data |
Sizing dynamic behaviour, the use of data |
Models the functional user requirements as: |
File Types |
File Types |
Logical Transactions |
Functional Processes |
Equivalent of |
Elementary Process either: |
Elementary Process either: |
Logical Transaction (LT) All stimulus/response message pairs regarded at LT irrespective of ‘primary purpose’ |
Functional Process (FP) All stimulus/response message pairs regarded at FP irrespective of ‘primary purpose’ |
Rules for measuring size |
Different rules apply depending on elementary process type |
Different rules apply depending on elementary process type |
Same rules apply to all logical transactions |
Same rules apply to all functional processes |
Base Functional Component(s)
|
Internal Logical File |
Internal Logical File |
Input Data Element |
Data Movement |
Contributors to functional size |
Per File Type: Per Transaction Type: |
Per File Type: Per Transaction Type: |
Per Logical Transaction: |
Per Functional Process: i.e. the movement (Entry, eXit, Read or Write) of one Data Group |
Unit of measure |
Different weights assigned to 5 function types depending on their relative ‘complexity’ Unit = 1 fp (IFPUG) |
Different weights assigned to 5 function types depending on their relative ‘complexity’ Unit = 1 fp (NESMA) |
Weights assigned to the ‘minimum size logical transaction’ add to 2.5 to establish comparability between MkII and IFPUG Unit = 1 fp (MkII) |
1 Data Movement Unit = 1 cfp |
Sensitivity to small changes to requirements |
Low (only detects changes at boundaries between Low, Average, High categories) |
Low (only detects changes at boundaries between Low, Average, High categories) |
High (detects changes of single data element types and single entity references) |
Moderate (detects changes to single data-groups) |
Integrity of measures (how well do the measures reflect the thing measured?) |
Artificial limits (weights, thresholds, uneven intervals) limit size of function types measured. |
Artificial limits (weights, thresholds, uneven intervals) limit size of function types measured. |
No artificial limits imposed on size of functional process. |
No artificial limits imposed on size of functional process. |
Sensitivity to variation in functional size of dynamic model of system i.e. functional processes |
Stepped: |
Stepped: |
Stepped: |
Accommodates size variation from zero to infinity in steps of 1 cfp |
Sensitivity to variation in functional size of static model of system i.e. data stores |
Stepped: |
Stepped: |
Data stores are considered to deliver functionality only when the data is referenced in transactions |
Data stores are considered to deliver functionality only when the data is used in functional processes |
Smallest feasible functional process |
3 fp |
3 fp |
2.5 fp |
2 cfp |
Smallest feasible enhancement |
3 fp |
3 fp |
0.26 fp |
1 cfp |
Related Articles
Estimating With Use Case Points
Methods & Tools Testmatick.com Software Testing Magazine The Scrum Expert |