Agile Data Modeling: Evolving Toward Excellence

June 22nd, 2011 | Author: Ken Collier

TDWI’s BI This Week newsletter ran this short article on June 22, 2011, and I’ve included it below. It provides a glimpse at the important technical practices that enable you to safely evolve your data models in ways that improve the design quality rather than degrading it. My book Agile Analytics has an entire chapter devoted to evolving excellent design, but this article provides an overview. I’d love your feedback on this piece.

The number one resistance I encounter helping organizations adopt Agile BI is from technical architects and data modelers who say something like, “Agile makes sense as long as the data models are developed and settled.” Modelers often incorrectly believe that data models must be designed, developed, and locked down before building applications that use the data. The idea of evolving a data model incrementally can strike fear in the hearts of modelers and architects. Sometimes it is a fear of rework; data modelers prefer the goal of getting it right once. Other times it is the fear of unintended side effects or the risk of creating a spaghetti mess.

What’s agile data modeling? Agile modeling calls for a minimally sufficient design up front to establish a reference model that guides the delivery team’s incremental development activities. Aspects of the logical and physical models are completed just in time to support the BI features under development. Agile modelers avoid detailing aspects of the model that aren’t immediately needed. Combined with good data modeling discipline, this style produces the right data model for its intended purpose. The model evolves to support future requirements as those become reality. Scott Ambler’s book Agile Modeling (see recommended reading at end of article) covers agile modeling principles and practices in-depth.

Why is agile data modeling a good idea? To paraphrase Ron Jeffries, one of Extreme Programming’s founders, the best way to implement a DW/BI system is to implement less of it. The best way to have fewer defects in your DW/BI system is to have a smaller/simpler one. The problem with comprehensive up-front modeling is that you must design for all contingent requirements, both known and speculated. This inevitably results in an overdesigned model that costs more to implement, is costlier to maintain, is more likely to contain defects, and is more difficult to understand.

Agile data models are as simple as possible while being sufficiently detailed, accurate, and consistent. They also fulfill a well-understood purpose and provide positive value (i.e., their benefit outweighs the cost of keeping them updated).

How to Safely Evolve Data Models

These concerns about evolutionary modeling resulting unnecessary rework, unintended side effects, and design degradation are legitimate. Additionally, the prospect of making a data model change to a high-volume data warehouse in production can be scary. Agile data modeling calls for a new set of practices that enable the safe evolution of models, even those in production. I’ll summarize those practices here. Consider this list a brief introduction; each deserves a deeper study to gain proficiency.

Data Model Patterns: Data models evolve toward excellence when we take advantage of tried and proven designs. Design patterns enable us to benefit from mature solutions that have previously been developed. Effective application of patterns relies on familiarity and awareness of patterns catalogs, and the ability to use them appropriately and sparingly.

Michael Blaha’s Patterns of Data Modeling is the most recent catalog of patterns, but David Hay first started cataloging data model patterns in 1996 with his Data Model Patterns: Conventions of Thought, and followed in 2006 with Data Model Patterns: A Metadata Map. An extension of data modeling patterns is the adaptive data model (ADM), a generalized data model designed to accommodate multiple domains. The ADM has been successfully used in data warehouse design and I write about it in detail in the Cutter executive report, The Message Driven Warehouse.

Technical Debt Management: Data models evolve toward excellence when changes are easy to make due to low technical debt. Technical debt is common in data warehousing. It is the entropy that occurs in any system over time due to development shortcuts, suboptimal design choices, maintenance activities, and so on. Like financial debt, a little technical debt is acceptable as long as we monitor it and pay it back quickly. When technical debt accrues unabated, the cost of change becomes unacceptably high. Agile data modelers continuously identify, prioritize, and monitor technical debt in the data model, seeking to eliminate it to assist with fast response to new requirements.

Database Test Automation: Data models evolve toward excellence when we have continuous confidence that our ideas are working. Agile BI practitioners work in short iterations delivering business value every few weeks. We need confirmation that what we build in later iterations doesn’t break what we built in early ones. The only practical way to accomplish this is with an automated test suite. Automated database tests validate data structures, data content and quality, schema constraints and integrity, data derivations, and so on. We can run automated tests quickly and simply at any time to confirm that everything still works. Tests are added as data model changes are made, so the test suite grows alongside the model. I devote an entire chapter to this topic in my book, Agile Analytics.

Database Refactoring: Data models evolve toward excellence when we can safely make changes to the design, even if it is in production. Database refactoring is a technical discipline that enables the safe evolution of data models without breaking previously working features and components. In their book Refactoring Databases, Scott Ambler and Pramod Sadalage explain database refactoring as “… a simple change to a database schema that improves its design while retaining both its behavioral and informational semantics.” Refactoring combines automated regression testing with a change transition period to ensure that changes haven’t broken anything. The change transition period is a window of time during which the model revision lives alongside the former version to ensure that nothing breaks. Refactoring combined with test automation are the central disciplines for effectively evolving data models.

Recognizing Data Model Smells: Data models evolve toward excellence when we recognize where they need improvement. Experienced data modelers develop a nose for elegant designs … and stinky ones. Discovering smells in the model is an essential precursor to improving it. Smells may include smart keys, multipurpose columns and tables, data redundancy, very large tables, and so on. By learning to pay attention to these smells, you can focus your attention on possible problem areas and consider them as candidates for technical debt management.

Change Deployment: Data models evolve toward excellence when we can quickly deploy changes at any time without fear. Agile BI practitioners develop data model changes using techniques that safeguard production deployment. All data model changes are scripted and those scripts are kept under version control. Database schemas are versioned and scripts are developed to roll forward to the next version, and roll back to the previous version in case things don’t work as expected. Data migration scripts ensure no loss or corruption of production data. Everything is automated and tested carefully in preproduction. Automated deployment, automated testing, and database refactoring disciplines support frequent, fast and fearless deployments.

Take Small Steps: Data models evolve toward excellence as a series of small, easily understood changes. It’s easier to undo a sequence of little changes than one big, complicated one. A side benefit of agile BI is that short iterations force us to plan in small steps. Agile data modelers quickly learn to change only what is needed to support the BI features currently in development.

The Final Word

All of these techniques combined form a strong safety net for evolutionary data modeling. Moreover, these techniques aren’t exclusive to agile BI, they should be considered as modern data warehousing practices that should be among the skills of every data modeler and data warehouse practitioner — agile or otherwise.

Recommended Reading

Ambler, S. W. (2002). Agile Modeling: Effective Practices for eXtreme Programming and the Unified Process. New York: John Wiley & Sons, Inc.

Ambler, S. W., & Sadalage, P. J. (2006). Refactoring Databases: Evolutionary Database Design. Boston: Addison Wesley.

Blaha, M. (2010). Patterns of Data Modeling. Boca Raton: CRC Press.

Collier, K. (2011). Agile Analytics: A Value-Driven Approach to Business Intelligence and Data Warehousing. Boston: Addison-Wesley Professional.

Collier, K., & O’Leary, D. (2009). The Message Driven Warehouse. Cambridge: Cutter Consortium, Inc.

Hay, D. C. (1996). Data Model Patterns: Conventions of Thought. New York: Dorset House Publishing.

Hay, D. C. (2006). Data Model Patterns: A Metadata Map. San Francisco: Morgan Kaufman.

Longman, C. (2005, December 7). Data Warehousing Meeting – December 7, 2005. Retrieved November 16, 2008, from DAMA UK – Data Management Association:http://www.damauk.org/Building%20the%20adaptive%20data%20warehouse%20-%20Cliff%20Longman.pdf

Posted in Agile Business Intelligence

Ken Collier:

February 20, 2013 at 1:13 pm

Thank you for your comments and questions Jeff. I’m always excited to meet folks doing database development who are eager to get more rigorous than simple daily standup meetings and story cards. Unfortunately the database community in general lags many years behind the software community in terms of agile adoption. And therefore the tool vendors are only just now attempting to respond with agile-enabling technologies for database development.

Having said that, I am a huge fan of XP for data warehouse and BI development, and I have worked on teams (and with teams) who are successfully using the 12 practices of XP. No special tooling is required to do this, and I’ve had reasonable success with build automation and continuous integration tools like Ant, Hudson, Jenkins, and Cruise Control. For test automation I’ve used DbFit, SQLUnit, DBUnit and a few others with good success.

My book has chapters devoted to database test driven development, version control, project automation, and other “XP style techniques”. Unfortunately, like many average agile software teams, many agile BI teams limit themselves to the simple practices and eschew the challenging ones. As you might imagine, the results are good but not always great. I hope that helps a bit.

Lars Rönnbäck:

June 23, 2012 at 6:03 am

This was an interesting read. However, it fails to mention Anchor Modeling, which is an agile modeling technique particularly suited for evolving data environments. It was designed to solve the problems with agility in database modeling, and it has been proven in practice with many successful implementations. There is a plethora of useful information on the technique here:

http://www.anchormodeling.com

Jeff Winchell:

February 20, 2013 at 11:42 am

Are there any tools that support eXtreme Programming with databases (BI in particular, but heck, I’ll take anything that is database related). This was a problem when I first started to redesign an existing production database, automate the build, put in unit testing back in 2000 with coaching from Ward C (while reading discouraging words from Ron J). As I jump back into this environment a dozen years later, I haven’t seen any progress from vendors. (just having support for a rename refactoring would be major progress in tools).

I’m all for agile BI, but that seems to be mostly short hand for iterative development, which is a far cry from the robust development one does with eXtreme Programming.
Jeff Winchell

Terry Liddell:

April 15, 2013 at 3:46 pm

One problem I’ve seen is that teams often don’t understand the quality of their data. They may believe the data is more complete, accurate, or consistent than it really is. They may not understand the inconsistent data rules across various OLTP systems. In such cases validating with a small set of test data fails to expose these issues. However, testing with a full set of production data may overwhelm a team. The risk is that all code works just as expected but still produces unexpected results because of unexpected data.

Any suggestions for helping teams better understand the data they actually have (versus what they think they have) so they can do more realistic testing?

April 15, 2013 at 4:37 pm

Thanks for your comment. Agile testing is always a balance between testing-for-what-you-know-today and adding-tests-for-new-discoveries. In other words, it’s folly to think that you can comprehensively analyze all of the discrepancies and inconsistencies in production data. Moreover, tomorrow’s production data may contain new anomalies that don’t exist in today’s data. Therefore, a rigorous agile testing environment includes test data that is a solid but manageable representative sample of live production data. When we discover problems in the data that current tests don’t catch, then we:
1. Add new data to the test set that represent the anomalous data
2. Add new test cases to the test suite and watch them fail on the new test data
3. Fix the ETL (or other) code to make the new tests pass
4. Analyze the root cause of the problem to determine if there may be other related problems

It’s all about evolving and adapting, not about exhaustively analysis. Does that help?