Databases Archives - Samuel J. Greear

February 3, 2008

PowerDNS / PostgreSQL & Web Interfaces 2

After a bit of eat and drink, as well as a half hour of zOMG why is this not werking!?!?! (iptables), Supermaster/Superslave is operating famously. It seems to “just work”. No complaints thus far, which is, well… highly unusual for me to put it lightly.


Feb 03 05:35:55 Received NOTIFY for flixn.com from 72.232.239.130 for which we are not authoritative
Feb 03 05:35:55 Created new slave zone 'flixn.com' from supermaster 72.232.239.130, queued axfr
Feb 03 05:35:55 gpgsql Connection succesful
Feb 03 05:35:55 No serial for 'flixn.com' found - zone is missing?
Feb 03 05:35:55 AXFR started for 'flixn.com', transaction started
Feb 03 05:35:55 AXFR done for 'flixn.com', zone committed

February 3, 2008

PowerDNS / PostgreSQL & Web Interfaces

http://evilcode.net/sjg/infernal/PowerDNS/SCHEMA/

I have been looking at PowerDNS for a while now, and after regular confirmation that it is in fact performing extremely admirably over at DreamHost I decided that it was time to deploy it.

While PowerDNS is the least braindead DNS server I have ever come across, there were a couple of things that I was not 100% happy with, at least in terms of coupling it to a web frontend.

SOA records are stored space-delimited. This would hardly be a problem except that our serial is stored here. In its defense, PowerDNS has an alternate method of handling serials that is probably better in most circumstances. Hardly, but we would still have to break it apart and put it back together again to edit the minimum (default in practice) TTL, etc.

Record types are stored textually. Even when implemented as an enumerated value this still violates DRY, as you must re-state these values in your frontend code.

Everything must be represented fully qualified. This = FAIL from a normalization perspective.

Here I have come up with a somewhat optimal schema from the point of view of my web interface, and I have tied it to PowerDNS’s preferred table structure via domain logic. This could have been handled in other ways of course, but I tend to like this one for a number of reasons.

First, the alternative is to add custom queries to the PowerDNS configuration file to make it understand whatever schema we might have in place, PowerDNS actually makes this very easy.

Another alternative would be to use dynamic (normal) views.

On to the benefits, some being quite minor.

Querying against serialized views will have performance benefits versus the above two options, this of course has to be weighed against the cost of maintaining the views.

As mentioned, PowerDNS has two methods of handling serials, either in the SOA record, which we are keeping up to date with our domain logic. Alternatively PowerDNS will scan each record for you to find the most recently updated (if you maintain change_date). The former should logically be more performant, so we have implemented that option. This could have been handled either way in the domain logic, but most importantly we aren’t relying on our web frontend to keep our serials up to date.

Most importantly, namely for debuggability, data on master’s and slave’s “looks the same”.

To get you rolling your PowerDNS configuration file need not be any more complicated than this:


launch=gpgsql
gpgsql-host=hostname
gpgsql-user=powerdns
gpgsql-password=password
gpgsql-dbname=pdnstest
daemon=yes

I haven’t tried slaving yet, but I suspect it will work without a hitch. Will update here when I do and when this rolls out.

February 9, 2007

Introspecting PostgreSQL and MySQL, starting out

Starting at the top, PostgreSQL and MySQL have wildly different concepts of what constitutes a schemata. In MySQL information_schema is a database that can be used directly or via cross-database queries. Whereas in PostgreSQL, you cannot perform cross-database queries because you connect to a single database. Here, the information_schema is contained within what PostgreSQL calls a schema, or a seperate namespace within a database.


Common Fields (schemata):
catalog_name
    PostgreSQL: Always the name of the parent database
    MySQL: Always NULL
schema_name
    PostgreSQL: Name of the schema
    MySQL: Name of a database
default_character_set_name
    PostgreSQL: Feature not available
    MySQL: Default character set (utf8, latin1, etc.)

PostgreSQL:
schema_owner
    User (role) that owns the schema

MySQL:
default_collation_name
    Default character set collation (utf8_general_ci ...)

So, what we get out of all this is that for PostgreSQL, we connect to a database and schemata tells us what schema’s exist. In the case of MySQL, it tells us what databases exist. Of course application dependant, but in many cases you can treat these as the same thing.

Once we know all about our databases, or whether we wanted to know about them at all, the next thing we will probably want to dig into are tables.


Common Fields (tables):
table_catalog
    PostgreSQL: Always the name of the parent database
    MySQL: Always NULL
table_schema
    PostgreSQL: Name of the schema that contains the table
    MySQL: Name of the database
table_name
    Name of the table
table_type
    'BASE TABLE' for a regular persistent table
    'VIEW' for a view
    PostgreSQL: 'LOCAL TEMPORARY' for a temporary table
    MySQL: 'TEMPORARY' for a temporary table

MySQL:
table_comment
    Comment
   
To set comments on PostgreSQL tables:
    COMMENT ON TABLE table_name IS 'This is a table comment';
To retrieve comments we must dig into the system information schema:
    SELECT obj_description(
        (SELECT c.oid FROM pg_catalog.pg_class c
         WHERE c.relname='table_name'), 'pg_class')
         AS comment;

Next, we’ll want to dig into information about the columns of those tables


Common Fields (tables):
table_catalog
    PostgreSQL: Always the name of the parent database
    MySQL: Always NULL
table_schema
    PostgreSQL: Name of the schema that contains the table
    MySQL: Name of the database
table_name
    Name of the table
column_name
    Name of the column
ordinal_position
    Position of the column within the table, the ordering of the columns
column_default
    Default (expression) for the column
is_nullable
    Whether or not the column can be null, 'YES' or 'NO'
data_type
    Name of the built-in type
    PostgreSQL:
        'ARRAY' - see the element_types view
        'USER-DEFINED' - identified in udt_name+
        If domain-based, refers to the underlying
        type and the domain is identified in domain_name+
character_maximum_length
    Maximum length of character data if defined, otherwise NULL
character_set_name+collation_name
    MySQL: NULL if not defined

PostgreSQL:
udt_schema
    Schema the user-defined type is defined in
udt_name
    Name of the user-defined type

domain_schema
    Schema the domain is defined in
domain_name
    Name of the domain
   
MySQL:
column_type
    More specific column type than data_type (varchar(255)
column_key
     'PRI' or blank ?
column_comment
    Comment

To set comments on PostgreSQL columns:
    COMMENT ON COLUMN table_name.column_name IS 'This is a column comment';
To retrieve comments we must dig into the system information schema:
    SELECT col_description(
        (SELECT c.oid FROM pg_catalog.pg_class c
        WHERE c.relname='table_name'),
        (SELECT a.attnum
        FROM pg_catalog.pg_class c,
             pg_catalog.pg_attribute a
        WHERE a.attrelid=c.oid
        AND c.relname='table_name'
        AND a.attname='column_name'));

January 24, 2007

ORM: Metadata mapping

Using Metadata – Martin Fowler

The most time spent working on a site under exhibition was the model or database layer. So before really getting into the rewrite I wanted to see what else was out there that could simplify this. Obviously I was turned to Fowler’s Patterns of Enterprise Application Architecture and ActiveRecord (rails), Hibernate (java) etc. One common element with every ORM solution I have run across is that they partially violate the DRY principle. Those that don’t, or that give you the option of completely defining your schema in a format native to the ORM are not expressive enough to use the full power offered by the database. This is when I remembered that PostgreSQL implements an extension to the SQL standard information schema that allows you to add an arbitrary comment to nearly any database object, tables and columns inclusive. After some poking around I found that MySQL supports the same, although it is not nearly as well documented. SQLite does not share this non-standard extension. So, my thought at current is to express whether relations are one-to-one, many-to-one or many-to-many directly in my table declarations. The only drawback I can see is a lack of portability, but when it comes to PHP, how many people use anything other than PostgreSQL or MySQL? SQLite should be used a great deal more, in my opinion, but in practice I do not believe it is. If at some point I must be portable to something other than the common case, well, in that event I guess I can always just do what all of the ORM’s are already doing.

References:
http://www.postgresql.org/docs/8.2/interactive/sql-comment.html
http://dev.mysql.com/doc/refman/5.0/en/information-schema.html
http://dev.mysql.com/doc/refman/5.0/en/alter-table.html

March 4, 2006

Receiving asynchronous notifications of database changes , part 2

As I trudge forward with this explanation please keep in mind that Epidemic is a proof of concept. I’m not saying that it wouldn’t work fine in production as-is, but odds are it will bring down your entire infrastructure AND club all of your pet baby seals.

Let’s quickly step through the code, starting with a simple entry point mail_server.py. After pulling in the required classes and whatnot, the first thing this piece of code does is instance SQLNotifyDispatch which inherits SQLNotify. These two classes form the core of the code that does all the heavy lifting we wanted to avoid in our frontend. It cannot even fathom what to do, however, without a little help. Rather than telling it what tables to watch, and also implementing code to take action when something happens to a watched table, the framework has been designed to allow one to simply implement the code that takes action, and let it inform SQLNotifyDispatch what exactly needs to be watched.

This brings us to mail/server.py, in which I will focus on the mail_users class. This class performs actions when modifications are made to the mail_users table. The classname does not necessarily need to mimic the table name, as can be seen if you look back at mail_server.py. First, the mail_users class is instanced, and then it is registered with the dispatch object we created before. You will notice the first argument to the Register method is ‘mail_users’, this is where the name of the table to watch is defined. Back to the mail_users class, now that we’ve let the Dispatch end of things know what table to watch, we need to let it know what columns we are interested in. This is where the GetCol method of the registered class comes in. As you can see its implementation is very simple, return [‘gid’, ‘dba’, ‘username’, ‘quota_bytes’]. This tells Dispatch that we are interested in changes to those 4 columns, changes to other columns in the table are fantastic, but we don’t need to be notified of them.

The real meat is contained in backend/sql.py Digging down through SQLNotifyDispatch and into SQLNotify, you will notice a number of static definitions in the SQLNotify class, such as TableCreate, Function, Trigger and Rule. This is the heart of the whole operation. When SQLNotify is told the table and columns to watch, it crafts a table to log changes to, and a function/trigger/rule trio specific to the table being watched that serves two purposes. The first, it ensures that any operations happening on the watched table get stuffed into the log table. Second, it fires off PostgreSQL’s NOTIFY, to let us (SQLNotify) know that a change has happened. That’s right, it lets “US” know, because once this is all registered with the database it is out of our hands. This means that SQLNotify only has to make these changes the first time it watches a database. It also means, and much more importantly, that no matter what happens to “US”, the application which setup the watching, and is acting on changes, no changes will be lost. The application could crash, no matter, as soon as we come back up, we can poke into the log tables and see what has happened while we were away. It is possible in some cases to receive notification of changes a bit later than one would like, but you will always, always know if changes were made.

When a change does happen, PostgreSQL fires off NOTIFY as dictated by a rule the SQLNotify class created for us. As a result of this our application is asynchronously notified of the change (that’s right, no polling!). When SQLNotifyDispatch receives one of these notifications, it calls the appropriate method in our worker class, INSERT(), UPDATE() or DELETE(), dependant upon what the operation was that happened in the database. As you can see in mail/server.py, those three methods do various operations, such as creating directories on the filesystem, setting up or changing quota’s, or deleting directories.

It is all a just a bit more complicated than what is typically done when this type of functionality is needed, I admit. My personal experience dictates, however, that this type of approach actually ends up being much simpler and easier to maintain down the road when one starts to grow an infrastructure.

Those who are tied to MySQL are not prevented from using a nearly identical solution. With the introduction of MySQL 5.0, it is all very possible save the asynchronous notification features. While elegant and conducive to good performance, are certainly not required. Keep in mind that polling log tables which are consistently pruned is in general going to be faster than checking a datestamp or even an indexed “updated” boolean column on a very large table.

February 28, 2006

Receiving asynchronous notifications of database changes

The following is directly pertaining to a PostgreSQL-based internet mail solution, but be not discouraged. Most, if not all, of the techniques can be applied elsewhere, and I intend to explain how.

For about six months in 2003 I worked for a company based out of Spearfish, SD, called Altaire Enterprises, Inc., a small floundering dialup ISP. This is where my first real experiences with PostgreSQL took place, up until this time I had been a die-hard user of MySQL for all of my database needs, whether it be as a backing store for a web application or otherwise. The largest project I took on while I was there was the implementation of a database backed mail system. That is to say, all mail accounts were entirely virtual, no system accounts, and all data associated with them was stored in PostgreSQL. It could just as easily have been LDAP, but it wasn’t, and that isn’t the point of this little ditty I’m writing now. Architecting the mail system was the easy part, as I found, plenty of mail applications are perfectly happy to talk to PostgreSQL. There are two hard parts. Performance and Management.

I will get to performance later. Management comprises more than one would think at first. Obviously, you need some sort of frontend or tools to add users, domains, etc. to your mail database. In this case it was a collection of C applications, rather than the typical web frontend, because they could be executed by the internal billing system (platypus). In consideration of the scope of this text, how the data is being entered is moot.

There is a flip-side to management, and specifically putting information into the database in this scenario. There are cases where you need to know when modifications are made to that data, so that you can perform operations on disk, for instance. Such operations may be setting quota’s, or creating a users Maildir if your MTA doesn’t handle that for you. The typical way to do this is to have your frontend perform that action as well, which is logical in a small installation, and is exactly the method used at Altaire. The C applications performed whatever on-disk operations were necessary.

What happens when your (web) fronted is hosted on a different server, though? I suppose you could export a set of web services, or similar, from the mailserver, allowing your frontend to connect to it and perform the necessary operations. What then if you have 10 mail servers? The frontend has to decide which one to connect to, and does its thing, fine, but what if one of those web services runners dies, what do you do? Write additional logic to record failures, and play them back later? Oh lords, there must be a better way. Yes, yes.. Of course there is. There is another method that has been used time and again, it is proven and reliable. My first major experience with it was during the time I helped architect ITMom.com, a web hosting provider, back in 99/00. You create an addition column in the tables that will require external operations, and when modifications happen, you toggle that field on with your frontend. You can then have a runner that polls those tables watching for modifications, and performing the correct operation when one is found. This works well, with a number of drawbacks. One, you are polling. Two, you are littering your carefully constructed, optimized and normalized schema with columns whose sole purpose is to notify external applications of changes. You could easily avoid this by logging changes to seperate tables, sure, but now we come to three. You are stilly relying on your frontend to dictate what actions the backend should take.

Would it not be better to just let the management application do what it should do best, and insulate it from the underlying technical details? This becomes even more important in an organization where the developer(s), database administrator(s) and system administrator(s) are all different people.

Enter Epidemic, a proof-of-concept framework to easily make such things possible.

To Be Continued…

January 29, 2006

MySQL 5.0 standardized join syntax

I am sure the revised / SQL:2003 standardized join syntax in MySQL 5.0 is old news to many out there. My guess is they are in the minority, and most haven’t heard a thing about it. Some may have even upgraded only to be frustrated that their queries weren’t working as they should any longer. Here’s the skinny, taken directly from the MySQL manual.

Beginning with MySQL 5.0.12, natural joins and joins with USING, including outer join variants, are processed according to the SQL:2003 standard. These changes make MySQL more compliant with standard SQL. However, they can result in different output columns for some joins. Also, some queries that appeared to work correctly in older versions must be rewritten to comply with the standard.

I would like to applaud MySQL AB on their latest release. Throughout its history, MySQL has been vastly out-gunned in terms of useful features by many other commercial and free databases. It has also taken a great deal of heat on many occasions due to its poor standards conformance in comparison to the other options on the market. With this release, even if they have not completely closed that gap, they have narrowed it by an impressive margin.

I do have a major gripe, however. Whilst the previously mentioned changes improves standards conformance and portability, it breaks a large enough percentage of MySQL-bound applications to warrant serious scrutiny. Apparently MySQL AB has forgotten that not the entire world is open source, and that many of us must every day maintain databases accessed by scripts and applications controlled by a third party or to which the source code is simply not available. This makes it rather impossible for any of us in such a situation to move those databases to servers running 5.0.

In my particular case, I was looking forward to leveraging triggers, stored procedures and views to reduce my administrative burden and deprecate a number of external scripts (hacks) that we use to transform data for use by other applications. The primary application sitting on this database is commercial and Zend encoded, so “fixing” the broken queries is simply not an option. Yes, we have talked to the vendor. I find it hard to believe that I am the only one in this situation.

Seriously now, how hard would it have been to add an option to enable the legacy behavior?

Update: 2/19/2006, offending commit
http://mysql.bkbits.net:8080/mysql-5.0/patch%401.1886.80.1