Refactoring Databases

Scott W. Ambler, Pramod J. Sadalage

Mentioned 37

Brings agile/object developers and database administrators together to develop a common language for better database design and change.

More on

Mentioned in questions and answers.

I've had a hard time trying to find good examples of how to manage database schemas and data between development, test, and production servers.

Here's our setup. Each developer has a virtual machine running our app and the MySQL database. It is their personal sandbox to do whatever they want. Currently, developers will make a change to the SQL schema and do a dump of the database to a text file that they commit into SVN.

We're wanting to deploy a continuous integration development server that will always be running the latest committed code. If we do that now, it will reload the database from SVN for each build.

We have a test (virtual) server that runs "release candidates." Deploying to the test server is currently a very manual process, and usually involves me loading the latest SQL from SVN and tweaking it. Also, the data on the test server is inconsistent. You end up with whatever test data the last developer to commit had on his sandbox server.

Where everything breaks down is the deployment to production. Since we can't overwrite the live data with test data, this involves manually re-creating all the schema changes. If there were a large number of schema changes or conversion scripts to manipulate the data, this can get really hairy.

If the problem was just the schema, It'd be an easier problem, but there is "base" data in the database that is updated during development as well, such as meta-data in security and permissions tables.

This is the biggest barrier I see in moving toward continuous integration and one-step-builds. How do you solve it?

A follow-up question: how do you track database versions so you know which scripts to run to upgrade a given database instance? Is a version table like Lance mentions below the standard procedure?

Thanks for the reference to Tarantino. I'm not in a .NET environment, but I found their DataBaseChangeMangement wiki page to be very helpful. Especially this Powerpoint Presentation (.ppt)

I'm going to write a Python script that checks the names of *.sql scripts in a given directory against a table in the database and runs the ones that aren't there in order based on a integer that forms the first part of the filename. If it is a pretty simple solution, as I suspect it will be, then I'll post it here.

I've got a working script for this. It handles initializing the DB if it doesn't exist and running upgrade scripts as necessary. There are also switches for wiping an existing database and importing test data from a file. It's about 200 lines, so I won't post it (though I might put it on pastebin if there's interest).

The book Refactoring Databases: Evolutionary Database Design might give you some ideas on how to manage the database. A short version is readable also at

In one PHP+MySQL project I've had the database revision number stored in the database, and when the program connects to the database, it will first check the revision. If the program requires a different revision, it will open a page for upgrading the database. Each upgrade is specified in PHP code, which will change the database schema and migrate all existing data.

This question is more or less programming language agnostic. However as I'm mostly into Java these days that's where I'll draw my examples from. I'm also thinking about the OOP case, so if you want to test a method you need an instance of that methods class.

A core rule for unit tests is that they should be autonomous, and that can be achieved by isolating a class from its dependencies. There are several ways to do it and it depends on if you inject your dependencies using IoC (in the Java world we have Spring, EJB3 and other frameworks/platforms which provide injection capabilities) and/or if you mock objects (for Java you have JMock and EasyMock) to separate a class being tested from its dependencies.

If we need to test groups of methods in different classes* and see that they are well integration, we write integration tests. And here is my question!

  • At least in web applications, state is often persisted to a database. We could use the same tools as for unit tests to achieve independence from the database. But in my humble opinion I think that there are cases when not using a database for integration tests is mocking too much (but feel free to disagree; not using a database at all, ever, is also a valid answer as it makes the question irrelevant).
  • When you use a database for integration tests, how do you fill that database with data? I can see two approaches:
    • Store the database contents for the integration test and load it before starting the test. If it's stored as an SQL dump, a database file, XML or something else would be interesting to know.
    • Create the necessary database structures by API calls. These calls are probably split up into several methods in your test code and each of these methods may fail. It could be seen as your integration test having dependencies on other tests.

How are you making certain that database data needed for tests is there when you need it? And why did you choose the method you choose?

Please provide an answer with a motivation, as it's in the motivation the interesting part lies. Remember that just saying "It's best practice!" isn't a real motivation, it's just re-iterating something you've read or heard from someone. For that case please explain why it's best practice.

*I'm including one method calling other methods in (the same or other) instances of the same class in my definition of unit test, even though it might technically not be correct. Feel free to correct me, but let's keep it as a side issue.

I prefer creating the test data using API calls.

In the beginning of the test, you create an empty database (in-memory or the same that is used in production), run the install script to initialize it, and then create whatever test data used by the database. Creation of the test data may be organized for example with the Object Mother pattern, so that the same data can be reused in many tests, possibly with minor variations.

You want to have the database in a known state before every test, in order to have reproducable tests without side effects. So when a test ends, you should drop the test database or roll back the transaction, so that the next test could recreate the test data always the same way, regardless of whether the previous tests passed or failed.

The reason why I would avoid importing database dumps (or similar), is that it will couple the test data with the database schema. When the database schema changes, you would also need to change or recreate the test data, which may require manual work.

If the test data is specified in code, you will have the power of your IDE's refactoring tools at your hand. When you make a change which affects the database schema, it will probably also affect the API calls, so you will anyways need to refactor the code using the API. With nearly the same effort you can also refactor the creation of the test data - especially if the refactoring can be automated (renames, introducing parameters etc.). But if the tests rely on a database dump, you would need to manually refactor the database dump in addition to refactoring the code which uses the API.

Another thing related to integration testing the database, is testing that upgrading from a previous database schema works right. For that you might want to read the book Refactoring Databases: Evolutionary Database Design or this article:

I am looking for something to integrate to my CI workflow.
I've heard of dbdeploy but I'm looking for something else. The reason I don't like dbdeploy is I don't want to install java on my server.

I would prefer of course that the solution doesn't involve stringing some shell scripts together.

An important part of Refactoring Databases is the migrations part. A .NET migrations solution that does not require EF or Java is Rob Reynold's Roundhouse

Might be worth checking out.

It's not a tool, but Ambler and Sadalage's book, Refactoring Databases: Evolutionary Database Design is quite good.

What is the best way for developing a database based application? We can have two approaches.

  1. One common database for all the developers.
  2. Separate database for all the developers.

What are the pros and cons of each? And which one is better way?

Edit: More then one developer is supposed to update the database and we already have SqlExpress 2005 on each developer machine.

Edit: Most of us are suggesting a common database. However if one of the dev has modified the code and database schema . He has not committed the code changes but the schema changes has gone to the common database. Will it not possibly break the other developers code.

You might also want to look at Refactoring Databases. Aside from discussing database changes, he includes discussions on going from development to production in a way that reduces risk.

Being stuck with a legacy database schema that no longer reflects your data model is every developer's nightmare. Yet with all the talk of refactoring code for maintainability I have not heard much of refactoring outdated database schemas.

What are some tips on how to transition to a better schema without breaking all the code that relies on the old one? I will propose a specific problem I am having to illustrate my point but feel free to give advice on other techniques that have proven helpful - those will likely come in handy as well.

My example:

My company receives and ships products. Now a product receipt and a product shipment have some very different data associated with them so the original database designers created a separate table for receipts and for shipments.

In my one year working with this system I have come to the realization that the current schema doesn't make a lick of sense. After all, both a receipt and a shipment are basically a transaction, they each involve changing the amount of a product, at heart only the +/- sign is different. Indeed, we frequently need to find the total amount that the product has changed over a period of time, a problem for which this design is downright intractable.

Obviously the appropriate design would be to have a single Transactions table with the Id being a foreign key of either a ReceiptInfo or a ShipmentInfo table. Unfortunately, the wrong schema has already been in production for some years and has hundreds of stored procedures, and thousands of lines of code written off of it. How then can I transition the schema to work correctly?

This book (Refactoring Databases) has been a God-send to me when dealing with legacy database schemas, including when I had to deal with almost the exact same issue for our inventory database.

Also, having a system in place to track changes to the database schema (like a series of alter scripts that is stored int he source control repository) helps immensely in figuring out code-to-database dependencies.

I was just wondering if a storage engine type existed that allowed you to do version control on row level contents. For instance, if I have a simple table with ID, name, value, and ID is the PK, I could see that row 354 started as (354, "zak", "test")v1 then was updated to be (354, "zak", "this is version 2 of the value")v2 , and could see a change history on the row with something like select history (value) where ID = 354.

It's kind of an esoteric thing, but it would beat having to keep writing these separate history tables and functions every time a change is made...

The book Refactoring Databases has some insights on the matter.

But it also points out there is no real solution currently, other then carefully making changes and managing them manually.

Are there any formal techniques for refactoring SQL similar to this list here that is for code?

I am currently working on a massive query for a particular report and I'm sure there's plenty of scope for refactoring here which I'm just stumbling through myself bit by bit.

There is a book on the subject: "Refactoring Databases". I haven't read it, but it got 4.5/5 stars on Amazon and is co-authored by Scott Ambler, which are both good signs.

As a database architect, developer, and consultant, there are many questions that can be answered. One, though I was asked recently and still can't answer good, is...

"What is one of, or some of, the best methods or techniques to keep database changes documented, organized, and yet able to roll out effectively either in a single-developer or multi-developer environment."

This may involve stored procedures and other object scripts, but especially schemas - from documentation, to the new physical update scripts, to rollout, and then full-circle. There are applications to make this happen, but require schema hooks and overhead. I would rather like to know about techniques used without a lot of extra third-party involvement.

The easiest way I have seen this done without the aid of an external tool is to create a "schema patch" if you will. The schema patch is just a simple t-sql script. The schema patch is given a version number within the script and this number is stored in a table in the database to receive the changes.

Any new changes to the database involve creating a new schema patch that you can then run in sequence which would then detect what version the database is currently on and run all schema patches in between. Afterwards the schema version table is updated with whatever date/time the patch was executed to store for the next run.

A good book that goes into details like this is called Refactoring Databases.

If you wish to use an external tool you can look at Ruby's Migrations project or a similar tool in C# called Migrator.NET. These tools work by creating c# classes/ruby classes with an "Forward" and "Backward" migration. These tools are more feature rich because they know how to go forward as well as backwards in the schema patches. As you stated however, you are not interested in an external tool, but I thought I would add that for other readers anyways.

Let's say we have a continuous integration server. When I check in, the post-hook pulls the latest code, runs the tests, packages everything. What is the best way to also automate the database changes?

Ideally, I'd build an installer that could either build a database from scratch or update an existing one using some automated syncing method.

I would caution against using a db backup as a development artifact, most CI best practices suggest that you manage the schema, procedures, triggers, and views as first class development artifacts. The side effects is that you can take this one step further and use them to build a new database whenever you want, ideally you also have some data that can be pushed into the database.

Here is a cliff notes version to get your feet wet, but there is lots out there in this space:

I like some of the ideas that Scott Ambler has here as well, the site is good but the book is surprisingly deep for such a difficult set of problems.

I recently started a job knowing that the company's practices needed help: no source control, no bug tracking, definitely no automated testing. They told me that their application code wasn’t good.

What an understatement.

Their code is an unholy mess of classic asp, non object oriented (hello recordsets!), embedded stylesheets, etc. The database is a nightmare: 500 odd tables, half of which appear to be old garbage, hundreds and hundreds of undocumented sprocs. The sprocs are full of business logic, ui logic (lots of PIVOT commands), etc.

There are four overlapping mission critical web applications that use this mess, and needless to say, most of the work was done by outside consultants who no longer return the company’s calls. (Who can blame them?)

Management is baffled at their web team’s inability to deliver on business requirements. I have had some success in explaining what went wrong: “imagine a ball of duct tape and wires seven stories high…”

Naturally, they want answers, not complaints. I considered a Big Rewrite, but that seems fraught with risk. I’m being encouraged to make a big win early, but I’m not sure where to start. I would appreciate any suggestions that y’all might be able to offer.

I’m thinking of approaching it like this:

  • EDIT: Get TFS 2010
  • Kill all "classic" asp—if it can’t be compiled, we can’t even be sure if we have all the references right.
  • Find who uses which sprocs/tables via text search, delete unused stuff
  • Rewrite functionality in pieces- e.g. replace the asp shopping cart with something OO.

Or alternately, we just keep slapping duct tape on the existing mess while building the One True System on the side.

I would appreciate any suggestions/war stories/offers of alternate employment, etc.

Tackling just the database part of this, here is a book that wil lhelp you out:

You have to set thing up properly to refactor a database and this book is invaluable.

There are also tools that can help you find the most poorly performing queries, start with those. If you can get a big win in performance on something that is annoying everyone, then it will give you more leverage to keep on fixing the other problems.

You might also look at a book on performance tuning specific to the datbase backend you have. There are a lot of know perfornace issues that relate to both database design and query design, knowing how to fix them can help you out immersurably as you refactor this mess.

I know it's tempting to just throw it out and start over new, but you will be introducing major new bugs that way, taking a huge amount of time while the users see no improvement and possibly missing some very important business rules that must be enforced. While incremetally changing and refactoring appears to be harder in a case this bad, it really is the better choice.

You might talk to the users and find out what they perceive as the worst problems the system has, that's one place to start after all making the users happier is part of what you are there for.

Make sure that you document the performance improvements and other changes for your resume. Think how much better it will look to your next potential employer when you can give actual performance improvement figures. Actaull accpmplichments attached to figures that show how much you accomplish are rare on resumes, this job can really make you stand out in the future as someone who gets things done.

For those agile practitioners out there...

How do you manage changes to a database schema during a project? My assumption is that in an agile project the schema of any database involved will change and be refactored just as happens with the codebase.

Is this assumption correct? If so, do you have any particular tools or processes that you use to help keep things running smoothly?

The database structure is most likely to be a dependency of many parts of your code, and schema changes will have cascading effects. Kind of like making changes to the interface in a class which many classes extend. So be cautious about schema changes.

Agile methodology is no different than other methodologies in that it's to your benefit to design the database up front as much as possible, and you should seek to change it less frequently than the code. Not to say you can never change it, but it is costly to do so.

Migrations are a simple but effective tool for tracking schema changes, as others have noted. The concept being scripts of CREATE and ALTER statements to upgrade from one revision of the schema to the next, accompanied by scripts of ALTER and DROP statements to downgrade the same changes. Ruby on Rails uses a database abstraction layer on top of this to make it easier to switch brands of database, but if you only need to support one brand you could simply use SQL files.

There's a highly-regarded book on this subject (though I haven't gotten around to buying or reading it yet) called "Refactoring Databases: Evolutionary Database Design" by Scott Ambler

I'm curious about how others have approached the problem of maintaining and synchronizing database changes across many (10+) developers without a DBA? What I mean, basically, is that if someone wants to make a change to the database, what are some strategies to doing that? (i.e. I've created a 'Car' model and now I want to apply the appropriate DDL to the database, etc..)

We're primarily a Python shop and our ORM is SQLAlchemy. Previously, we had written our models in such a way to create the models using our ORM, but we recently ditched this because:

  • We couldn't track changes using the ORM
  • The state of the ORM wasn't in sync with the database (e.g. lots of differences primarily related to indexes and unique constraints)
  • There was no way to audit database changes unless the developer documented the database change via email to the team.

Our solution to this problem was to basically have a "gatekeeper" individual who checks every change into the database and applies all accepted database changes to an accepted_db_changes.sql file, whereby the developers who need to make any database changes put their requests into a proposed_db_changes.sql file. We check this file in, and, when it's updated, we all apply the change to our personal database on our development machine. We don't create indexes or constraints on the models, they are applied explicitly on the database.

I would like to know what are some strategies to maintain database schemas and if ours seems reasonable.


You might find the book Refactoring Databases helpful as it contains general strategies for managing database, not just how to refractor them.

His system expects that every developer will have their own copy of the database as well as some general test database used before deploying to production. Your situation is one of the easier situations in the book describes as you don't have a number of separate applications using the database (although you do need someone who knows how to describe database migrations). The biggest thing is to be able to build the database from information in source control and have changes described by small migrations (see @WoLpH's answer) rather than just making the change in the database. Also you will find things easier if you have at least ORM <-> database tests to check that they are still in sync.

Because of the presence of type providers for accessing SQL data in F#, there's not much focus on the use of ORMs, micro or otherwise. I can definitely see the logic behind this.

It also seems that many examples of using F# to play with relational data is to plug them into existing large databases, that would appear to be created elsewhere.

It feels like there's a bit of a gap here: is there a nice way to manage schema creation and migration directly from F#? The example linked to above suggests running your manual schema creation script as the first step. Is this the only option?

I've recently started a little project that is F# from the outset, and I'm looking at storing some data in a relational database (sqlite for now). I don't have an existing schema to explore, I'm designing from scratch. Is there a friendlier or more idiomatic way to manage my schema (creation and then migration) in F#?

I've been trying to find a solution for this and I'm more and more inclined to try to fall back to the basics when it comes to SQL databases. One of the major parts described in the book Refactoring Databases is migrations. Having a tool that supports working with migrations solves a lot of problems as long as you can accept to write all up and down scripts in plain ol' SQL would probably be very suitable.

One tool I've found for the .NET stack is Rob Reynold's Roundhouse. I have yet to try it out but it seems promising albeit somewhat dated.

Currently we're using hand-rolled SQL in Data-Access objects and a lot of stored-procedures and triggers which amount to around 20k lines of code. We're finding that simple changes are causing a couple of days' work to fix, and its causing deadlines to slip.

Changes include modifications to tables to cope with additional data, general refactoring of the schema based on QA/user reports, etc. Its a very active system that's being built to replace something old and slow.

We looked at the PHP ORM solutions available to try and limit the effects of these changes, but they were just too slow to cope with our schema; "simple" sql results were taking orders of magnitude longer to return than our custom queries and caused page views of ~.5s to take over 20s.

What best-practices/strategies could I look into to cope with schema evolution with relational databases, in a general context?

Edit: forgot to mention about the triggers; we have a lot of data which relies on cascading changes, eg. a price change here for this user updates a price there for that user, etc.

You might want to checkout this book on Refactoring Databases: Evolutionary Database Design.

About a year ago, I picked up Scott Ambler's Refactoring Databases: Evolutionary Database Design. I was won over to the idea that just as you can develop your code with TDD, you probably should be covering your databases with unit tests (at least) or even writing database tests before you make a change to the schema so you do database work in a TDD style as well.

I really like the idea, and I have been doing this (OK, sometimes I do it) by hand for a while now, just writing regular unit tests that happen to connect to the database and check its structure against a given schema file. But I haven't found any good database change management tool-kits that might help automate this process. Does anyone know any?

I only know of two unit testing frameworks:

As for change management, these are some recommended tools:

Although I am not sure if this is really what you're looking for.

I am about to embark on the development of a web application project.

I'd like to get something up early and often for early adopters to play with and feedback.

But I envisage the data model changing as the project progresses and my understanding of the system improves.

How should I manage the dilemma of updating the data model appropriately and preventing data loss for early adopters? Should I simply put up a big warning saying "user beware", or should I put in the effort to create migration scripts?

Scott Ambler has written some nice stuff about agile databases. There's a book and a website.

I've shown up at a new job and discovered database which is in dire need of some help. There are many many things wrong with it, including

  • No foreign keys...anywhere. They're faked by using ints and managing the relationship in code.
  • Practically every field can be NULL, which isn't really true
  • Naming conventions for tables and columns are practically non-existent
  • Varchars which are storing concatenated strings of relational information

Folks can argue, "It works", which it is. But moving forward, it's a total pain to manage all of this with code and opens us up to bugs IMO. Basically, the DB is being used as a flat file since it's not doing a whole lot of work.

I want to fix this. The issues I see now are:

  1. We have a lot of data (migration, possibly tricky)
  2. All of the DB logic is in code (with migration comes big code changes)

I'm also tempted to do something "radical" like moving to a schema-free DB.

What are some good strategies when faced with an existing DB built upon a poorly designed schema?

Read Scott Ambler's book on Refactoring Databases. It covers a good many techniques for how to go about improving a database - including the transitional measures needed to allow both old and new programs to work with the changing design.

Suppose we have a system (in production) written in an obsolete technology and difficult to adapt to changing business needs. The decision has been made to rewrite it in a newer technology. Should we start fresh with a new database schema that will accurately reflect the data models of the new system but accept the risks and costs of developing a database conversion (necessarily 2-way due to a phased implementation plan)? Or should we keep the same schema, even though it will complicate development since it does not reflect the new model, but gain the advantage of eliminating the conversion task?

I would only consider redoing the schema if you have a a database professional to help you with the design. Application programmers in general do a poor job of designing perfomant databases that have all the checks and balances a business critical system needs.

Changing the schema and successfully moving existing data is a much harder job than you may think. This will be a large effort taking months of full-time work and it's risky. The larger and more complex the existing database is the harder the redesign.

One thing I would consider is moving the old data to a data warehouse and then designing the new system for data going forward. It would then send data to the data warehouse periodically too for people to be able to query historical and current records. That way your new system can have constraints that maybe the old data didn't have and you won't have to try to figure out what values to put in the required field for old data which did not have a value.

If you are considering this, you may also want to read up on refactoring databases. Here is an excellent book onthe subject:

Also do not consider doing this without readding in depth about performance tuning the database you plan to use as your backend. THere is no point in redesiging if you don't develop something that will perform well and scale up. Forget that trash about premature optimization - databases need to be designed from the start with performance as well as dat aintegrity and security in mind. There are a lot of well-known techniques to create better performance that should be considered in any redesign.

Are there any guidelines on writing database tests so that you can refactor database "without fear" while doing evolutionary database design?

What aspects of database should be put to test while developing it? Any example would be great..

Scott Ambler's "Refactoring Databases" is a very good book, analogous to Fowler's "Refactoring". It talks about testing.

I am looking to learn more about databases (MySQL for starters), particularly about choosing the right approach (e.g. stored procedures vs views and triggers for data aggregation) in a situation.

Maybe this example shows what I mean: Yesterday I asked a general SQL question about aggregation of data and received the answer that "it depends" – which turned out to be absolutely correct. Now I am looking for some database lectures that covers scenarios as opposed to focusing on syntax. Though I encourage everybody to suggest the obvious too, I think the standard normalization / relational stuff is easy to find – personally I am rather looking for practical examples in different kinds of application.

People have mentioned SQL books which is a place for a beginner to start. But there is much more the dba needs to have under his or her belt.

This one is critical, you need to know how to design a database so it can be refactored:

You need to read a good book on performance tuning for the specifc database backend you use. You will find there are many query techniques to avoid (and they are database specific) and many design structure to avoid. These are things you must know to effectively design a datbase and query against it.

Finally, and again it depends on the database which book to get, you need to get a good book on database admin if you are going to be a dba. There are many things that are not so obvious until you read how to administer.

The best book that helped me, without a doubt, was SQL: A Beginner's Guide. This had to be one of the most boring books I've read (very very plain) but I learned so much from it. Very, very useful book, I would not have learned SQL with out it. Bonus is that it teaches you SQL, not MySQL, not T-SQL, not PostgreSQL, not SQL Server: just plain SQL. It does highlight nuances where necessary, but otherwise it's just SQL.

Coincidentally, the second book I read was Beginning MySQL, which was by the same author, Robert Sheldon (I didn't realize it until just now, lol). Again, a very helpful book. I really like Wrox Publishing, they do good books. I would also recommend any book by O'Reilly publishing or Apress as well (can't go wrong with either).

Once you get a handle on the SQL language itself, then you should (need) to read MySQL 5.0 Certification Study Guide, especially if you plan on being a DBA. You'll want to get your CMDBA and this book will help. I just found out the other day at work that there aren't near as many CMDBA's in the country as I thought (I think I heard them say under 1000). We're a rare breed (yes I have my CMDBA), and they pay us well ;)

Lastly, after you have your CMDBA, read High Performance MySQL, the 2nd edition by Baron Schwartz, Peter Zaitsev, Vadim Tkachenko, Jeremy Zawodny, Arjen Lentz, and Derek J. Balling. Zaitsev, Tkachenko, and Schwartz work for Percona (CEO/Founder, CTO/Co-Founder, and VP of Consulting respectively). Percona pretty much has the most brilliant minds in the world when it comes to MySQL, and that's not an overstatement. We've worked with some of them before and words can describe their brilliance.

I hope this helps to maybe give some direction. Good Luck.

I'm interested in database refactoring. I deal with several databases that don't have a large amount of data, just a few GB with at most a few hundred thousand rows. However, they have hundreds -- sometimes many hundreds -- of tables, views, sprocs and functions. In some places a divide-and-rule strategy using schemas has been implemented which has helped some problems of seeing ownership/usage of tables. However, it hasn't really helped object coupling.

We all read that integration via shared database isn't A Good Thing, but we also know that it is, at least for a while , a very productive thing as everything is in the database. We just don't apply the Single Responsibility Principle to databases like we do to objects.

Edit: I should add that I have no database performance issues. The tables are not large, the biggest has only a few hundred thousand rows. There is no real database performance issue; except when the database schema/logic/implementation is grotesquely inefficient (say requiring a cursor to do a sproc execution for each row in a result set in order to pre-process data for a report). Before you say I should change these, that is the whole point: I can't because the database is no longer in a state where the impact of changes can be assessed.

Clearly at some point you say "Enough!" and divide into multiple databases connected by messages, ETL, application tiers etc etc

The question is: how many is too many? What is the absolute upper limit of the number of sprocs/tables/functions that you can have before you go insane?

First, stop trying to think of databases in object oriented terms. Principles of object oriented programming simply do NOT apply to relational databases.

Shared databases are a very good thing from a business perspective. Multiple databases storing information that has to be transferred between them quickly becomes way more complex than your piddly many hundreds of objects. Data that is consistent between enterprise applications is priceless. Trying to reconcile if GE Corp and General Electric Corporation are really the same entity between two databases can be a nightmare.

Refactoring datbases is a nice goal, but it is very complex in reality. Don't do it unless you have a major performance issue that needs to be addressed or unless you are willing to commit to a process of identifying all the code that might be affected by a change. Even then, consider if you can know all the code that might change (this is one reason why database people hate, hate, hate dynamic code!).

Often the best way to refactor is to add your change and start changing over to using your new field, sp etc while leaving the old one in place until a set expiration date. Since you are on an annual cycle, you will need to manage those dates over a long period of time. To see if sps are being used, you can identify the ones you aren't sure of and add some code to them to insert to a table everytime they are run. If after your whole year cycle, they haven't been run, you can safely eliminate them. The cycle may be shorter depending on the sp.

If I'm writing something that will only be run annually, I would normally put the word annual in the sp name. But that may not be true where you are, however, the function of the sp should give you an idea if it is something that should only be run periodically. I wouldn't expect usp_send email proc to only run once a year but I might expect that a usp_attendance_report might not be run often. Of course as I said, I would have named it something more like usp_annual_attendance_report and you can consider doing that sort of thing moving forward.

But be aware that any refactoring you do will have to take place on a long cycle to ensure that you don't delete something you need. If your code is in a source control system (and all database tables, sp, views, UDFs, triggers, etc should be), you can probably eliminate some things knowing that if they fail you can pretty instantly put them back. Again, I'd examine the object to determine the possible risk eliminating them would have.

Of course if you have good automated tests in place, eliminating something on dev and running the tests can help you find out if something is still being referenced.

If you are looking for an easy way to refactor, I don't know of one. Refactoring databses is a time-consuming, risky activity and one which may not show enough improvement for the powers that be to be willing to pay for it.

A good book on refactoring databases is:

Nowadays most programmers know about code refactorings.

What about refactorings of data structures, are there any good readings about it?

One paradigm that I can think of is the normalization process of a relational database.

Are there any other good examples?

xUnit Test Patterns: Refactoring Test Code is a good reference for refactoring your code to work well with unit tests. It's not exactly what your asking for but its a good reference to keeep on hand.

Refactoring Databases : Evolutionary Database Design seems to be a worthwile read on the subject, judging from the fact that it won the 2007 Software Development Jolt Productivity Award in the Technical Books category. I haven't yet read it though, so I can't comment on it personally.

I am about to alter the several tables in a massive system which I probably only understand around 10%.

I want to add three columns. One of these is just a rename of an existing column. Part of me wants to :-

  • Rename the column but worried about the impact on unknown parts of the system that use the old name.

  • Append the three columns to the table therefore making the old column redundant (over time).

Here's a good presentation on refactoring databases by Scott Ambler, the guy that literally wrote the book on refactoring databases.

I have a question, just looking for suggestions here.

So, my application is 'modernizing' a desktop application by converting it to the web, with an ICEFaces UI and server side written in Java. However, they are keeping around the same Oracle database, which at current count has about 700-900 tables and probably a billion total records in the tables. Some individual tables have 250 million rows, many have over 25 million.

Needless to say, the database is not scaling well. As a result, the performance of the application is looking to be abysmal. The architects / decision makers-that-be have all either refused or are unwilling to restructure the persistence. So, basically we are putting a fresh coat of paint on a functional desktop application that currently serves most user needs and does so with relative ease. The actual database performance is pretty slow in the desktop app now. The quick performance I referred to earlier was non-database related stuff (sorry I misspoke there). I am having trouble sleeping at night thinking of how poorly this application is going to perform and how difficult it is going to be for everyday users to do their job.

So, my question is, what options do I have to mitigate this impending disaster? Is there some type of intermediate layer I can put in between the database and the Java code to speed up performance while at the same time keeping the database structure intact? Caching is obviously an option, but I don't see that as being a cure-all. Is it possible to layer a NoSQL DB in between or something?

What you describe is something that Oracle should be capable of handling very easily if you have the right equipment and database design. It should scale well if you get someone on your team who is a specialist in performance tuning large applications.

Redoing the database from scratch would cost a fortune and would introduce new bugs and the potential for loss of critical information is huge. It almost never is a better idea to rewrite the database at this point. Usually those kinds of projects fail miserably after costing the company thousands or even millions of dollars. Your architects made the right choice. Learn to accept that what you want isn't always the best way. The data is far more important to the company than the app. There are many reasons why people have learned not to try to redesign the database from scratch.

Now there are ways to improve database performance. First thing I would consider with a database this size is partioning the data. I would also consider archiving old data to a data warehouse and doing most reporting from that. Other things to consider would be improving your servers to higher performing models, profiling to find slowest running queries and individually fixing them, looking at indexing, updating statistics and indexes (not sure if this is what you do on Oracle, I'm a SLQ Server gal but your dbas would know). There are some good books on refactoring old legacy databases. The one below is not datbase specific. There are also some good books on performance tuning (look for ones specific to Oracle, what works for SQL Server or mySQL is not what is best for Oracle) Personally I would get those and read them from cover to cover before designing a plan for how you are going to fix the poor performance. I would also include the DBAs in all your planning, they know things that you do not about the database and why some things are designed the way they are.

I'm currently working on a project where our product at the current version is deployed at more than one client site. More often that not we test and scrap (a.k.a. refactor :) the DB design in our dev/test environment. It's not possible to keep track of every incremental change to the DB to create the corresponding 'migration files'. And now when we are satisfied with a DB design (relatively stable) we need to update the production DBs to the current state.

Our system in production can't just be issued a DROP/CREATE SCHEMA command to update the schema (think of the jobs that would be lost, mine included :) So we land up either writing MySQL dumps or application wrappers to backup the data and re-enter it into the DB. Quite cumbersome, especially when multiple 'clients' are NOT at the same DB version. It's quite cumbersome to write one for every version that the clients have.

So what are some of the best practices that you employ to refactor databases in production especially when you CAN'T just delete that data but need to re-enter it and have the system seamlessly continue after some 'downtime'?

If it's of any use the backend is Java/Restlet and MySQL

You can use database refactoring patterns described in the book Refactoring Databases and see the website for database refactoring . Using the migrations approach you don't need to drop and create the database. Tools such as [dbdeploy], [dbmaintain], [ibatis migrator], [liquibase] etc can help you here.

Let us assume that I want to normalize a table into 2 tables. Eg. A table Person with columns PhoneNumber1...PhoneNumber5 into tables Person and PhoneNumbers with foreign key constraint from Person into PhoneNumber and deletion of PhoneNumber1..PhoneNumber5 columns from the Person table. I want to preserve data ie PhoneNumber1..PhoneNumber5 should end up as one record each for non-null data in the PhoneNumber table . Can any schema comparison tool help achieve this ? I checked SSDT and it does not support it even though this "feature" is on their "roadmap"

This is just one simple scenario that I face in my day to day life and we have been using hand-coded SQL to manage this. In a more complicated scenario, one set of tables could change into another set of tables. Is there an easy way in any tool to manage these kind of schema changes and at the same time manage the data mappings successfully ?

The approach suggested by Peter sounds sensible, e.g.

  1. Create PhoneNumbers table
  2. Copy data into PhoneNumbers table from People table
  3. Modify People table to remove the PhoneNumber1-5 columns

Providing that the time required to copy the data is acceptable within your downtime that could be run as one 'migration'.

If you have a vast amount of data to copy and need to avoid downtime then you could add an abstraction layer to read from both locations while you copy the data over little by little. Something like:

  1. Create PhoneNumbers table
  2. Deploy code to read from either People or PhoneNumbers, and write to new PhoneNumbers table only
  3. Copy data from People to PhoneNumbers in batches over a period of time
  4. Remove code that reads from old People location
  5. Modify People table to remove the PhoneNumber1-5 columns

You'd deploy changes 1 & 2, run step 3 for however long it takes or in quiet periods, then deploy changes 4 & 5.

I've found this book Refactoring Databases: Evolutionary Database Design by Scott Ambler and Pramodkumar Sadalage really helpful when planning changes like this.

In terms of tooling if you are using SQL Server (or Oracle) you might be interested in taking a look at Red Gate SQL Source Control. That could handle this change in a single deployment. Full disclosure - I do work for Red Gate.

SQL Source Control automatically detects changes made in a development database, and links these changes into your existing source control system. It can then generate the SQL required to synchronise these changes to another database. There is an advanced feature called Migrations, which allows you to convert a subset of the automatically generated changes into manual SQL steps for more complicated scenarios like this data migration. You could use Migrations to perform the change described at the top of this post.

In your development database SQL Source Control would have automatically detected the creation of a new PhoneNumbers table, and the deletion of the PhoneNumber1-5 in the People table. You could select those two changes in the tool and add extra SQL to copy the data over in between them. It would save that as a Migration script, which SQL Compare would recognise and run when deploying to another database (e.g. a QA/production database), alongside any other changes it finds automatically.

Setup is following: Drupal project, one svn repo with trunk/qa/production-ready branches, vhosts for every branch, post-commit hook that copies files from repository to docroots.

Problem is following: Drupal website often relies not only on source code but on DB data too (node types, their settings, etc.).

I'm looking for solution to make this changes versionable. But not like 'diffing' all data in database, instead something like fixtures in unit tests.

Fixture-like scripts with SQL data and files for content that should be versionable and be applied after main post-commit hook.

Is there anything written for that purpose, or maybe it would be easy to adapt some kind of build tool (like Apache Ant) or unit testing framework. And it would be very great, if this tool know about drupal, so in scripts I can do things like variable_set(), drupal_execute().

Any ideas? Or should I start coding right now instead of asking this? :)

You might want to check out the book Refactoring Databases.

The advice I heard from one of the authors is to have a script that will upgrade the database from version to version rather than building up from scratch each time.

Possible Duplicate:
Best Book for a new Database Developer

I'm looking to become an intermediate, eventually advance sql programmer. I'm currently half way between beginner and intermediate. I'd like to book to cover at least beginner to intermediate, or all the way even.


Thank you all for your responses, I have chosen the MVP Deep Dive book from amazon, it has 5 stars and only 5 stars! Sounds good enough for me. I look forward to reading it.

The 2005 and 2008 editions of the inside SQL Server series by Itzik Ben-Gan, Kalen Delaney et al

And of did I forget..after all, I wrote 2 chapters in this book (chapter 8 and 20) SQL Server MVP Deep Dives

This one is not SQL Server specific, but if you are learning databases, it is ggod to understand how to refactor them and most books don't cover this:

I really enjoy the Apress books about SQL, especially when I had to work with SQL Server's Service Broker or it's Full Text Search engine:

alt text

Entity Framework Code First will build the database for you if it doesn't exist and structure it based on your mapping objects. I believe Roundhouse will do the same thing with Fluent Mapping files using NHibernate.

Are there any other ORM's (or tools like Roundhouse) that will take care of all your SQL DDL creation and execution?

NHibernate does not need Fluent Mappings to generate database schema. This feature is built into the NHibernate core:

new SchemaExport(_configuration).Execute(false, true, false);

In my experience however this is mostly useful for in-memory integration tests or initial rollouts. Production databases need to be upgraded. If you stick around, then you will need to add and remove columns, tables and foreign keys without affecting data. There is a continuity and versioning aspect to it. NHibernate only knows your current mapping. It does not know for example that 2 months ago you stored your customer first and last name in column called "CustomerName" and then you decided to split this into two columns "FirstName" and "LastName" (which is probably the most primitive change that can be made). NHibernate job is to map your current schema to objects, not to remember data modeling choices from few years ago.

In my experience there is no magic tool that will write upgrade scripts, they have to be written manually or at least reviewed by developer. Tools can provide you a framework for executing these scripts, like RoundhouseE. Scott Allen has an excellent series about 'forward-only, run-once' approach.

How do people recommend managing database structure updates?

e.g. new features are added that require existing tables being altered, new ones added etc.

Is there a good piece of software or is it a case of writing ALTER type statements.

I'm going to add one thing. It is not enough to write an alter table statement to change a table structure. If you are changing a table structure, you had better be sure before you run it that you know exactly what other views,functions, tables, stored procs, triggers, SSIS(DTS) packages (for SQL Server) and dynamic code from the applications will be affected by the change. If you are not completely sure what other objects may be impacted, you are not ready to alter the table. I've seen way too many business critical functions break because someone hapahazardly changed a table structure without considering what else used that structure. If you are considering making database structural changes, I suggest you read up on database refactoring before you do so.

Here isa good book to start with:

I just finished working on a project for the last couple of months. It's online and ready to go. The client is now back with what is more or less a complete rewrite of most parts of the application. A new contract has been drafted and payment made for the additional work involved.

I'm wondering what would be the best way to start reworking this whole thing. What are the first few things you would do? How would you rework the design in a way that you stay confident that the stuff you're changing does not break other stuff?

In short, how would you tackle drastic application design changes efficiently (both DB and code)?

This is NOT-A-NEW thing in software and people have done this and written a lot about this.

Try reading

The techniques explained here are invaluable to sustain any kind of long running IT projects.

It is looking increasingly like I'll have to go live before I have had the time to tweak all the queries/tables etc, before I go live with a website (already 6 months behind schedule, so all though this is not the ideal scenario - thats how things are).

Its now a case of having to bite the bullet. Its just a case of trying to work out how big that bullet will be when we come to 'biting it'. Once the databse goes live obviously we cant change the data on a whim, because its live data. I am fairly confident on the most of db schema - for e.g. the tables are in most 3 and 4th normal form, and constraints are used to ensure data integrity. I have also put in some indexes on some column that (I think) will be used a lot in queries though this was done quite hurridly and not tested - this is the bit I am worried about.

To clarify, I am not talking about wholesale structure change. The tables themselves are unlikely to change (if ever), however it is almost guaranteed that I will have to tune the tables at some stage (either personally or by hiring someone).

I want to know how much of a task this is. Specifically, assuming a database of a few gigabytes (so far roughly 300 tables)

Assuming 50% of the tables need tuning in the next few months:

  1. How long will it take to perform the tuning (I know this is a "how long is a piece of string" type question) - but what are the main determinants of the effort required, so I can work out how long it is likely to take?

  2. Is it possible to either lock sections of the database, (or specific tables) whilst the indexes are being reworked, or does the entire databse need to go offline? (I am using mySQL 5.x as the db)

  3. Is what I describe (going live before ALL tables perfectly tuned) outrageously risky/inadvisable ? (Does it justify the months of sleepless nights this has caused me so far) ?

In general it is much harder to fix a poor database design that is causing performance issues after going live becasue you have to deal with the existing records. Even worse, the poor design may not become apparent until months after going live when there are many records instead of a few. This is why databses should be designed with performance in mind (no this is not premature optimization, there are known techniques which generally perform better than other techniques and they shoulod be considered inthe design) and databases should be tested against a test set of records that is close to or more than the expected level of records you would have after a couple of years.

As to how long it will take to completely fix a badly designed database, months or years. Often the worst part is something that is central to the design (like say an EAV table) and which will require almost every query/sp/view. UDF to be adjusted to move to a better structure. You then have to ensure all records are moved to the new better structure. The sooner you can fix a mistake like this the better. Far better to move a couple of thousand records to a new structure than 100,000,000.

If your structure is ok but your queries are bad, you are better off as you can take the top ten worst performing (Choose based not just on total time to run but time X no of times run) and fix, rinse and repeat.

If you are in the midst of fixing a poor database, this book might come in handy:

So I'm currently working on rebuilding an existing website that is used internally at my company for project management, at heart it is a bug tracking utility that has some customer support and accounting operations linked into it.

Currently the database model is very repetitive, a good example of this is, currently a UserId is linked into a record (FK relationship into a user table that contains all the information about the user) and then all the information about the user also exists in the table.

I've been tasked with improving the website and the functionality of the model; however, I want to reduce the repetition of data in the website (is this normalization or is that the breaking apart of unlinked items into separate tables?). I'm not sure what the best method of doing this would be. I'm thinking of generating the creation scripts for the database and creating a new database project in VS to then modify the database, then generating some scripts to populate the new database model from the old database.

I plan on using the Entity Framework and ASP. NET MVC 2 to build the website as I think it provides the most flexible model moving forward for the modification and maintenance of the website.

The reason I ask all of this is because I'm very familiar with using databases and modifying existing ones to be used in applications and websites but I'm trying to discover the best way to build one.

I'm curious if there is any material on the best way to do this or if I should be using a different tool to do this with?

Edit: Providing more information on the model

There are 4 major areas that we have that are used:

  1. Cases (Bugs, Features, Working Tasks, Etc) 2 .Tickets (Tech Support Events)
  2. Errors (Errors Generated from our logging Library, Basically a stack trace with customer information)
  3. License (Keeps track of each customers License allows modification to those licenses)

These are the Objects that are intermixed and used throughout the above 4 major areas.

  1. Users (People who use the system)
  2. Customers (People who use our software)
  3. Stores (Places where our customers use our software)
  4. Products (Our Software)


Cases: A Cases has to have a User, can have a Customer, Store, Error, Ticket and/or Product

Tickets A Ticket has to have a User and a Customer, can have a Store, Error and/or Product

Errors: A Error has to have a Product, Can Have a Case, Ticket, Store, and/or Product

Licenses: A Licenses has to have a Product and Customer, can have a Store

Like I said very basic website, with a not super complex database, if done correctly.

Currently the database has no FK constraints, replication of lots of information across each table and lots of extra tables that are duplicates with different names.


Each Case type has a separate table so there is a FeatureRequest, Bug, Tasks, Completed, etc table that all contain the same information.

Normalization is about storing data without redundancy or anomalies.

One example of an anomaly could be when attributes about a user in your main table are not in sync with the users table. Someone changes information about that user in one table without reflecting the changes in the redundant copy. The problem is that it's hard to know which change is the correct one.

Some people think that normalization is just about breaking apart tables into littler tables, because that's what they see as the most common type of change. But that's not the goal of normalization. It's just by coincidence that most mistakes of non-normalization involve stuffing too much data into one table where multiple tables would be correct.

It's hard to answer your question about whether to modify your database in-place or whether to create a whole new database and migrate to it.

What I would do in your case is to design a properly normalized database, and then examine the differences between that and your existing database. Imagine what you would have to do for each difference, to change your old database to the new one, versus a data migration. It could be that only a few changes are needed, only dropping the redundant columns. Or it could be that some major rework is needed. It's impossible to tell until you do the work of creating a normalized data model so you can compare.

The bigger task might be to adapt your application code that uses the database. One way to ease this transition is to create database views on top of the normalized database, which mimic your old non-normalized database. That way hopefully you don't have to rewrite every bit of code in your app all at once, you can keep some of it the same at least until you can refactor the code.

Also having a good set of regression tests in place is ideal, so you can be sure your app still does all the tasks it is supposed to do, as you refactor the database and the code that uses the database.

Re your comment: You mention that you're adding new functionality to the user model at the same time. I would find it too confusing to try to do this simultaneously with refactoring. Refactoring typically does not change functionality, it only changes implementation. But refactoring adds value because it makes the code easier to maintain or debug, improves efficiency, or prepares you to make future functionality changes more easily.

I would recommend that you bit the bullet and add your new user model features to the old non-normalized database. It's good to get the benefit of new features in the short term, and also you need to develop those features first to understand them well enough to account for them in your big refactoring project.

Here are some suggestions for resources to help you truly understand what normalization means:

Here are a couple of resources for managing changes to a database:

So I'm working on a shopping basket application that requires a persistent basket and am trying to decide whether to store my basket/items as a blob in the database or break them out into multiple tables (*e.g. - tbl_basket, tbl_basket_items, tbl_basket_item_variants*). I have no need for sorting or filtering of basket items. I will simply query the basket based on soldto (btw, there could be multiple baskets per soldto). Baskets will only be valid for a relatively short period of time (6-12 months max). They could have several hundred line items (rare case), but I don't expect anything really all that big that it would degrade performance. The number of users is relatively small...400 concurrent users max. Typical usage would be somewhere around 50-100 concurrent users.

I'm leaning towards simply storing my basket as a blob simply because it's simple and relatively clean (yes I'm lazy). My question is, am I missing something? What are the drawbacks to this approach. What are the benefits? The one drawback that comes to mind is if my Basket object changes, it could be a problem for active baskets.

Thanks for any insight you might have.

You better don't have a "smart column" in your database. They are supposed to be just data, this is what database for.

Smart column is some column which need further processing to make use of it, e.g.: A1234 => A represent male, and 1234 is the serial number. Not to mention your "blob" shopping cart, which is way "smarter" than the example before.

refer to

Having just started working for a new client I have noticed that their database schema could do with a serious overhaul. I've made some suggestions (naming conventions mainly) which would be acceptable for the new suite of applications we are developing going forward, however the system would also have to support the lagacy names used (i.e. 400 or so externally hosted web applications).

Is there any way I could do the following:

  • Change the naming of tables and columns for use in the new application.
  • Create a permanenet table alias or similar to allow the lagacy code to remain unchaged?

Bit of a long shot.......

What people do usually when refactoring is rename the table, create view with the old table name and structure and then make changes to the renamed table.

For instance if you renamed a column from person_id to personid (to enforce a consistent style) the view might reference personid but name it person_id so old code doesn't break.

If you are going to refactor a database and keep old code running, I highly recommend reading

Not sure the best way to do this or if there is even a single best answer but here is the problem:

I have a SQL Server 2000 database that consists of at least 13 tables. Some of the tables have a unique id (SysName) so that there can only be one unique SysName value (alpha001, alpha002, alpha003, etc).

With me so far?

Now, the client is wanting to be able to insert multiple sysnames values in the database (alpha001, alpha001, alpha001, alpha002, alpha002, alpha003, alpha003, alpha003, alpha003, alpha003, etc). On top of that, if you change the value for a sysname (alpha001 to beta001) in one table, the value in some of the other tables are also updated (I'm having a brain fart about what this is called at the moment).

I'd also like to add a new column (SysNameID) and set it to autoincrement. What is the best way to do this so that it will go through the database and just add the SysNameID values (1,2,3,4,5,6...) so that I do not have to rebuild the table?

I'm thinking I need to add a SysNameID column to all the affected tables and set the (brain fart term) between the tables to retain the correlation (for data integrity purposes).

If this is request is unclear, please leave a comment and I'll do the best I can to answer it. Some of you guys are very smart so you may have to dumb it down for me. :)

Basic process:

Make sure you have a current backup and do not attempt to do this on prod without a through test on develpment first. This is so extensive a change, you might want to restore prod to a new dev instance because it will be time-consuming and tricky and other development will be interfered with while you do this.

You add the identity column to the parent table called SysNameID (see @marc_s' answer for details)

You add an int column to each child table also called SysNameId. It is not an autoincrementing column and must allow nulls.

You Update this column by using the current sysname columns to find the id assciated with that sysname.

Once all the columns are populated you set the column to not allow nulls and create the foreign key to the parent table. The surrogate keys should not change, so you don;t really need to cascade update.

Finally you drop the sysname column from the child tables and adjust all the code that uses it to join to the parent table and look it up. Alternatively, you rename each child table and create a view that joins the child table to the parent and gets the sysname column from there. That should ensure existing code doesn't break.

There is no simple way to meet your requirement. You are changing the very fundamentals of how your database works. It has the potential to affect virtually every query against the child tables. It may affect reports (which are likely sorted by sysname which is now no longer unique). This is a major change and to do it properly could take months.

Suggested reading: