Tuesday, February 8, 2011

❤Database Management System❤

When you build an information system, of which a Web site is one example, you have to decide how much responsibility for data management your new custom software will take and how much you leave to packaged software and the operating system. This chapter explains what kind of packaged data management software is available, covering files, flat file database management systems, relational database management systems (RDBMSs), object-relational database management systems, and object databases. Because RDBMS is the most popular technology, it is covered in the most depth and the chapter includes a brief tutorial on SQL.

What's wrong with a file system (and also what's right)

Arizona The file system that comes with your computer is a very primitive kind of database management system. Whether your computer came with the Unix file system, NTFS, or the Macintosh file system, the basic idea is the same. Data are kept in big unstructured named clumps called files. The great thing about the file system is its invisibility. You probably didn't purchase it separately, you might not be aware of its existence, you won't have to run an ad in the newspaper for a file system administrator with 5+ years of experience, and it will pretty much work as advertised. All you need to do with a file system is back it up to tape every day or two.
Despite its unobtrusiveness, the file system on a Macintosh, Unix, or Windows machine is capable of storing any data that may be represented in digital form. For example, suppose that you are storing a mailing list in a file system file. If you accept the limitation that no e-mail address or person's name can contain a newline character, you can store one entry per line. Then you could decide that no e-mail address or name may contain a vertical bar. That lets you separate e-mail address and name fields with the vertical bar character.
So far, everything is great. As long as you are careful never to try storing a newline or vertical bar, you can keep your data in this "flat file." Searching can be slow and expensive, though. What if you want to see if "philg@mit.edu" is on the mailing list? You computer must read through the entire file to check.
Let's say that you write a program to process "insert new person" requests. It works by appending a line to the flat file with the new information. Suppose, however, that several users are simultaneously using your Web site. Two of them ask to be added to the mailing list at exactly the same time. Depending on how you wrote your program, the particular kind of file system that you have, and luck, you could get any of the following behaviors:
  1. Both inserts succeed.
  2. One of the inserts is lost.
  3. Information from the two inserts is mixed together so that both are corrupted.
In the last case, the programs you've written to use the data in the flat file may no longer work. For a personal Web site the consequences of a corrupted database aren't very serious: you spend your nights and weekends fixing up the flat files with a text editor. But imagine that you build an online bank with Perl scripts and flat files. Old fishing hamlet of Helgumannen.  Faro, Gotland. Sweden
All the checking account balances are stored in one file, checking.text, and all the savings balances are stored in another file, savings.text.
A few days later, an unlucky combination of events occurs. Joe User is transferring $10,000 from his savings to his checking account. Judy User is simultaneously depositing $5 into her savings account. One of your Perl scripts successfully writes the checking account flat file with Joe's new, $10,000 higher, balance. It also writes the savings account file with Joe's new, $10,000 lower, savings balance. However, the script that is processing Judy's deposit started at about the same time and began with the version of the savings file that had Joe's original balance. It eventually finishes and writes Judy's $5 higher balance but also overwrites Joe's new lower balance with the old high balance. Where does that leave you? $10,000 poorer and wishing you had Concurrency Control.
After a few months of programming and reading operating systems theory books from the 1960s that deal with mutual exclusion, you've solved your concurrency problems. Congratulations. However, let's suppose that this is an Internet business circa 2003 and therefore you're running it out of your mom's house with the server under the living room sofa. You're feeling sleepy so you heat up some coffee in the microwave and simultaneously toast a bagel in the toaster oven. The circuit breaker trips. You hear the sickening sound of disks spinning down. You scramble to get your server back up and, glancing at the logs notice that Joe User was back transferring $25,000 from savings to checking. What happened to Joe's transaction?
Santa Karin. Visby, Gotland.  Sweden The good news for Joe is that your Perl script had just finished crediting his checking account with $25,000. The bad news for you is that it hadn't really gotten started on debiting his savings account. Maybe it is time to look at what the transaction processing experts have come up with over the last 50 years...
Data processing folks like to talk about the "ACID test" when deciding whether or not a database management system is adequate for handling transactions. An adequate system has the following properties:

What Do You Need for Transaction Processing?


Atomicity
Results of a transaction's execution are either all committed or all rolled back. All changes take effect, or none do. That means, for Joe User's money transfer, that both his savings and checking balances are adjusted or neither are.
Consistency
Welcome to Universal City  (shopping mall built in the style of a city street; Los Angeles California). The database is transformed from one valid state to another valid state. This defines a transaction as legal only if it obeys user-defined integrity constraints. Illegal transactions aren't allowed and, if an integrity constraint can't be satisfied then the transaction is rolled back. For example, suppose that you define a rule that, after a transfer of more than $10,000 out of the country, a row is added to an audit table so that you can prepare a legally required report for the IRS. Perhaps for performance reasons that audit table is stored on a separate disk from the rest of the database. If the audit table's disk is off-line and can't be written, the transaction is aborted.
Isolation
The results of a transaction are invisible to other transactions until the transaction is complete. For example, if you are running an accounting report at the same time that Joe is transferring money, the accounting report program will either see the balances before Joe transferred the money or after, but never the intermediate state where checking has been credited but savings not yet debited.
Durability
Once committed (completed), the results of a transaction are permanent and survive future system and media failures. If the airline reservation system computer gives you seat 22A and crashes a millisecond later, it won't have forgotten that you are sitting in 22A and also give it to someone else. Furthermore, if a programmer spills coffee into a disk drive, it will be possible to install a new disk and recover the transactions up to the coffee spill, showing that you had seat 22A.
That doesn't sound too tough to implement, does it? A "mere matter of programming" as our friend Jin likes to say. Well, you still need indexing.

Finding Your Data (and Fast)

Stockholm airport, hopskotch One facet of a database management system is processing inserts, updates, and deletes. This all has to do with putting information into the database. Sometimes it is also nice, though, to be able to get data out. And with medium popularity sites getting 20 requests per second, it pays to be conscious of speed.
Flat files work okay if they are very small. A Perl script can read the whole file into memory in a split second and then look through it to pull out the information requested. But suppose that your on-line bank grows to have 250,000 accounts. A user types his account number into a Web page and asks for his most recent deposits. You've got a chronological financial transactions file with 25 million entries. Crunch, crunch, crunch. Your server laboriously works through all 25 million to find the ones with an account number that matches the user's. While it is crunching, 25 other users come to the Web site and ask for the same information about their accounts.
You have two choices: buy a 1000-processor supercomputer or build an index file. If you build an index file that maps account numbers to sequential transaction numbers, your server won't have to search all 25 million records anymore. However, you have to modify all of your programs that insert, update, or delete from the database so that they also keep the index current.
This works great until two years later when a brand new MBA arrives from Harvard. She asks for "a report of all customers who have more than $5,000 in checking or live in Oklahoma and have withdrawn more than $100 from savings in the last 17 days." It turns out that you didn't anticipate this query so your indexing scheme doesn't speed things up. Your server has to grind through all the data over and over again.

No comments:

Post a Comment