At my job I'm currently working on a module for abstracting objects and relationships in a SQL database through (what I deem to be) a very clean and generalized object oriented interface. It is (I hope) DBD agnostic, using the DBI and allowing you to plug any DBD into it fairly seamlessly by sub-classing a configuration parser file. At present, I'm pretty happy with the organization of the class hierarchy and interactions, have written fairly extensive implementation and API documentation, and have automated tests that employ the Test::Unit framework. I have gotten the go ahead from my boss to release the module to CPAN, and will probably do so in a few weeks. However, I am presently concerned about "when in Rome" issues, and want very much for my module to conform to the various CPAN mores.
The first thing I'm wondering about is name spacing issues. Presently, the top level package name is just SQL, so I have sub-packages such as SQL::Object, SQL::Link, SQL::Statement, SQL::Object::ResultSet, SQL::Link::ResultSet, etc. I realize that this probably isn't a very good name, but I'm not sure what would be. The SQL name is nice and short, but not really all that descriptive of what the module is doing. What would be a sensible name? What is the best mechanism via which to receive feedback on naming issues?
With regards to testing, I'm fairly confident in my ability to write good tests as far as verifying correctness is concerned, but I also want to adhere to the various conventions for CPAN modules. Right now I am using the Test::Unit framework, and for each class Foo::Bar, I have a corresponding Foo::BarTest which is a subclass of Test::Unit::TestCase. Furthermore, in each directory of the library hierarchy, I have a package of the form Foo::TestSuite which subclasses Test::Unit::TestSuite, and contains all of the tests in that directory, as well as any TestSuite subclasses in subdirectories. This gives me a nice tree structure of tests that allows me both to focus on particular pieces of the code tree, while also allowing me to run all of the tests from a single command line invocation of the top level TestSuite. I realize, however, that while this is great for me, it might not meet the expectations of others... How much leeway do I have in this regard? Is it bad that I'm using Test::Unit which does not come standard with Perl? Should I be using something more mainstream and better maintained like Test::More? If I do that, am I going to have to scrap my current testing framework layout?
On a related thread, I'm curious as to how I should go about testing interactions with a database on a user's system. It would be very rude for the testing procedures of my module to carelessly stomp on the database residing on someone's machine. What precautions and paradigms should I employ so as both to thoroughly test my module on the user's machine and avoid trashing his environment? I've read horror stories of hapless users having rogue modules smash their data, and I'd rather not be a purveyor of such a module.
With regards to writing documentation, what are some tips for adhering to the Principle of Least Astonishment? I'm trying my best to avoid making the kind of assumptions that developers tend to make about code that they have written that results in unreadable documentation. I'll just have to hope that the world agrees. There are, however, more general issues... Is there any kind of de facto standard for laying out documentation? I've been writing POD in each library file, at the end, not mixing POD and code because I think that makes for hideously ugly source files. I have been using the following four sections: Name, Description, Synopsis, Methods. I guess I should probably also add "Bugs" and "See Also" sections. What kind of things should I be keeping in mind?
I'm also wondering about error handling, specifically how to trap and report errors to the user. I personally find it loathsome to have a script die deep within the bowels of DBI, leaving me with no indication as to what the root cause was. As such, I'm trying to trap third party errors by wrapping code in eval blocks, and handling errors when $@ is populated, re-throwing the exception with some contextual information of my own added to it. I'm not sure, though, how to fully go about this... Should I try to re-trap my own exceptions at successive levels, or would it be better just to call Carp::confess at the lowest level where I expect potential problems? Also, while I realize that for some errors it makes more sense to cite the error from the perspective of the caller, e.g. in the case of invalid subroutine parameters, I'm left with the the quandary of Carp::croak versus Carp::confess. It seems to me that confess is always more useful, with the only drawback being that it dumps a ton of information to the screen. It seems to me that the utility of croak quickly becomes diminished when invocation of the croaking method becomes deeply nested in library routines. What are good criteria for choosing one over the other, or should I always favor the stack trace of confess?
I'm sure there are many other important considerations when publishing a CPAN module, so please feel free to chime in about the things that I may have forgotten.
Presently, the top level package name is just SQL, so I have sub-packages such as SQL::Object, SQL::Link, SQL::Statement, SQL::Object::ResultSet, SQL::Link::ResultSet, etc.
This will be a problem as SQL::Statement is already taken, and possibly some of the others too. A quick search on http://search.cpan.org will tell you. A possible solution to this is to put your entire framework under a single top level namespace, this can usually be accomplished with a careful search and replace. Personally I prefer the top-level namespace method, as it tends to make it much easier to use a framework within other projects since I don't need to worry about namespace conflicts.
What is the best mechanism via which to receive feedback on naming issues?
The module mailing list (actually there are a two, one for module authors, and the other for announcements), you can find them http://lists.perl.org/. And of course, there is always here too, a node prefixed with "RFC" is all it takes, and I am sure you will get plenty of feedback.
With regards to testing, I'm fairly confident in my ability to write good tests as far as verifying correctness is concerned, but I also want to adhere to the various conventions for CPAN modules.
As long as it plays well with Test::Harness, you should be ok. I would assume Test::Unit does that. You can also talk to the people on the perl-qa list, you can find out more about that at http://qa.perl.org/. Personally I have never used Test::Unit, so thats about all I can offer on that subject.
On a related thread, I'm curious as to how I should go about testing interactions with a database on a user's system.
Try using mock objects, in particular you can use DBD::Mock.
With regards to writing documentation, ....
POD is the standard, your categories sound about right, there is no "must-have" set out there, its really up to you. Although I would recommend the BUGS and SEE ALSO sections, as well as a AUTHOR and COPYRIGHT AND LICENSE. I personally also like to have a CODE COVERAGE section to include my Devel::Cover report in.
I'm also wondering about error handling, specifically how to trap and report errors to the user
I prefer exceptions (with plain old die), but TIMTOWTDI. However, in larger frameworks I have written I have sometimes created exception objects with stack tracing abilities and such, but personally I have not decided myself on the best way to do that (Carp, Exception::Class, etc) so I am actually interesting in what the other monks would say about that.
I'm sure there are many other important considerations when publishing a CPAN module, so please feel free to chime in about the things that I may have forgotten.
Well, I would actually say that you have thought of/covered a lot of them. But to avoid putting my foot in my mouth, I will let the others comment on that :)
SK> a module for abstracting objects and relationships in a SQL database
JZ> Sorry to be pedantic, but you are asking about namespace so I'll get picky - there is no such thing as a "SQL database" - there are database management systems which use the SQL language to interact with data.
SK> The first thing I'm wondering about is name spacing issues.
JZ> You should cc these questions to dbi-dev@perl.org, that is the definitive place for DBI related namespace issues.
SK> Presently, the top level package name is just SQL, so I have sub-packages such as SQL::Object, SQL::Link, SQL::Statement, SQL::Object::ResultSet, SQL::Link::ResultSet, etc.
JZ> Several of those are already taken
SK> I realize that this probably isn't a very good name, but I'm not sure what would be. The SQL name is nice and short, but not really all that descriptive of what the module is doing.
JZ> As the author/maintainer of several CPAN modules in the SQL namespace (though this of course gives me no more rights to comment than anyone else), I would strongly prefer that the namespace be for modules whose primary purpose is dealing with the SQL language. Modules whose primary purpose is dealing with Databases or with Data or with DBI belong elsewhere.
SK> What would be a sensible name?
JZ> To answer that, we need to know what your module does. A decription like you provide "a module for abstracting objects and relationships in a SQL database" is so general that it's hard to comment. I suggest you come up with three different descriptions of your module - a 40 character one for use in the module list, a two sentence one for general descriptions like this posting, and a one page one for a README. Look through other CPAN modules and make sure that each of your descriptions not only describes what your module does, but does that in a way that distinguishes it from other modules. Post them back here, and I'll be glad to comment further. Good luck!
At present, this is the top level POD I have that describes the module as a whole in an attempt to give a broad view of its purpose and organization. Then there is POD within each class that describes its usage which is not posted here.
As someone else suggested, it would probably be best for me to simply move the top level name to some new unique name, so as to avoid collision with pre-existing things such as SQL::Statement.
Oh, and please don't hesitate to be pedantic and highly critical. Submitting my first CPAN module stems not just from the desire to contribute something useful, but also from selfish motivations to learn the process of doing so. You are explicitly authorized to pull no punches. :-)
Name
SQL
Description
This module serves as a framework for object persistence via a SQL
database. Interaction with an underlying database is accomplished via
Perl objects that abstract away all of the tedious construction,
issuance, and processing of SQL queries. Within this framework, users
may create, modify and delete objects, as well as link them together in
arbitrary ways, and perform a variety of querying operations.
Divided into several conceptual components, this module provides several
well isolated sub-components that are reasonably easy to understand on
their own. Short descriptions of these modules may be found in the
following section.
Modules
SQL::Object
If any class were said to lie at the center of this module, this
would be the one. This class serves as the base class for all object
types, and provides a collection of methods for performing all
common operations, such as saving, modifying, and deleting it.
Sub-classes to this class must provide a get_table_name() method,
and will probably also include various and sundry wrapper methods to
flesh out the functionality of the class. One can load single
objects via themselves, or obtain collections of objects by using
the SQL::Object::Query class.
SQL::Object::Query
This class is used to specify a query of a collection of objects.
Queries of this type are specified with conditions, bindings of
condition variables to values, and orderings. One may additionally
specify a list of SQL::Link::Query objects which will specify that
the objects loaded should be loaded with certain parent and/or
children objects. When the execute() method is invoked, a
SQL::Object::ResultSet object is returned.
SQL::Object::ResultSet
A SQL::Object::ResultSet object is the result of executing a query
specified by a SQL::Object::Query object. By repeatedly invoking the
get_next_result() method until it returns undef, one can iterate
over the results of a query, getting back instances of classes that
derive from SQL::Object.
SQL::Link
A SQL::Link object specifies a linking relationship in the database,
and also provides mechanisms for creating, modifying and deleting
relationships between objects as specified by the link in question.
SQL::Link::Query
A SQL::Link::Query object is used to specify a query of a collection
of linked objects. Queries of this type are specified with a
SQL::Link object, conditions, bindings of condition variables to
values, and orderings. When the execute() method is invoked, a
SQL::Link::ResultSet object is returned.
SQL::Link::ResultSet
A SQL::Link::ResultSet object is the result of executing a query
specified by a SQL::Link::Query object. By repeatedly invoking the
get_next_result() method until it returns undef, one can iterate
over the results of a query, getting back instances of the
SQL::Link::Result class.
SQL::Table
A SQL::Table object contains all available information about a
database table. It holds a collection of SQL::Column objects, and
indexes them in useful ways. Users will probably never have need to
create a SQL::Table object directly. Rather, the SQL::Object class
provides a get_table() method that returns an object of this type.
The table returned is contingent upon the derived class's
get_table_name() method. Using the name of the table, SQL::Row
constructs a SQL::Table object which populates itself with table
structure information by querying the database.
SQL::Column
A SQL::Column object contains all of the attributes of a column of a
database table. Such information includes name, type, and keying.
SQL::Statement
The SQL::Statement class is a base class for several derived
classes, each of which provides for a certain type of SQL statement.
This class takes table names, field names, conditions, orderings,
and other such statement parameters, and then via the stringify()
method returns a scalar that holds the statement string.
SQL::Util
This class holds a collection of static utility methods that allow
for performing common database tasks.
SQL::Conf
The SQL::Conf class is a base class for classes that parse
connection configuration information for various RDBMS, e.g.
SQL::Conf::MySQL.
SQL::DBH
The SQL::DBH class has but a single static method, create_handle(),
which creates database handles via calling DBI->connect(...).
Environment Variables
SQLRDBMS
This environment variable specifies the default value to be used by
SQL::DBH in specifying the type of database to which a connection
will be made. Presently the only supported value is 'MySQL'. This
value allows SQL::DBH::create_handle() to figure out which derived
class of SQL::Conf to instantiate, though one may instead pass an
argument to it to specify the RDBMS explicitly.
SQLCNXNCONF
This environment variable specifies the default value to be used by
SQL::DBH::create_handle() when using a SQL::Conf subclass to parse
the configuration information for a database connection.
It seems to me that your module uses SQL to accomplish things but *users* of your module will not be dealing with SQL at all. So therefore you should name the module something more related to what the users will be doing - using objects, persistence, relationships, etc. Take for example your SQL::Column - how is that at all related to SQL? You are providing information about the structure of the table not about the structure of a SQL statement.
I think your module belongs in the DBIx:: namespace since it is primarily an alternate programming interface to DBI. I'm afraid I don't get any sense from your POD of how your module differs from the dozens of other similar modules (some of which don't do any better of a job at distinguishing themselves in their PODs). Why would someone want to use your module rather than one of the others? I am not at all saying this to discourage you from CPANing your module, rather suggeting you get as good a sense as you can of how a potential user would react to your description of the module - the more you can define what your module does that is unique and helpful and usable, the better the chance that a potential user will look at it. The more you have a defined sense of how your module is unique, the easier it will be to pick a name for it.
This is an excellent point... DBIx does seem like a much more logical home for this module. Indeed SQL is merely a component of the underlying engine, not something with which the user has much interaction, apart from writing conditions for object lookups. SQL::Column, for example, should become something like DBIx::Foo::Table::Column, where Foo is whatever I end up naming my module.
Your point about the need to distinguish my module from the myriad others is also well taken. In retrospect, while I think my module's implementation and interfaces are quite good, you are dead on in saying that I've done a poor job selling its uniqueness. I shall take this to heart in the next revision. It doesn't matter how great my module is if people don't take the time to explore its capabilities.
Obviously this is the kind of thing that would be very useful, and as such I harbored no illusions that I'd be the first to market. I am aware of the existance of the Class::DBI module, and have spent some time looking at it. It is my hope to create a solid competing implementation that more closely approximates my vision of how such a module should work. While doubtless many will disagree that it is "better", I would hope that it will culminate in something to which people give serious consideration as an alternative to Class::DBI. Time will tell whether I will be shown up as a useless crank...
To be honest, I have always been less than impressed with the current state of RDBMS-OO mapping modules out there. There are some that are nicer than others, but none of them have felt comfortable to me. I welcome another player in the market, maybe skyknight's will be the one that strikes my fancy.
Of course, many will say that there is no point in re-inventing the wheel, but in the case of RDBMS-OO mappers, I am not sure that the current state of wheels (in any language) are really as best as they can be.
Here are some interesting links on the subject though:
Indeed... In fact, the inspiration for the module that I have written comes from a home grown module to which I was exposed at a former job. It had some very good concepts (that I've cherry picked as best I can), but also many serious flaws that were very problematic time and time again. On my own, I have written three different iterations, all inspired by that original module for various projects or consulting jobs. This will be the fourth iteration of my attempt to forge something that lives up to the potential of the original.
I've really struggled with this module to avoid kludgey compromises, and I hope that that will shine through in the (purported) elegance of the API. Execution efficiency has also been a big concern, as the database at my job which is serving as a pilot project is of the order of tens of millions of records.
One of the key issues with which I wrestled was how to instantiate a large collection of objects, associated with their children in one-to-many has-a relationships, while neither executing superfluous queries nor pulling in extra information in the join operations. The solution to this was to issue one query for the collection of objects being loaded, and one query for each of the has-a relationships, ordering the object query by its surrogate primary key, and each of the has-a link table queries by the associated foreign primary key. The best image I can conjure for this process is the zipping of an n-threaded zipper where different segments of the various zippers have to be shifted to match up. :-) Presumably other OO-mappers have successfully wrangled with this problem before me, but I was pretty happy with myself for working out all the details. The original module of which I spoke totally punted on the issue, instead assuming that the user would never simultaneously load multiple objects that had one-to-many relationships.
Oh, and both of your links seem like excellent resources to promote the uniqueness of my own module, something that, as jZed pointed out in another comment, is of utmost importance if I am going to stimulate any interest in my module.
Thanks again.
One of the key issues with which I wrestled was how to instantiate a large collection of objects, associated with their children in one-to-many has-a relationships, while neither executing superfluous queries nor pulling in extra information in the join operations.
I think with these kind of tools, you cannot avoid having some superfluous queries as well as occasionally pulling in extra information. If your tool is so fine grained that it loads too little, it will likely slow to a crawl when trying to retrieve a lot of information. However the same can be true if your tool is too coarse grained, and it grabs much more than it needs each time. Personally I would recommend two possible approaches (which themselves are not mutually exclusive):
Lazy Loading
Load your main objects completely, but all relationships lazily. This will work pretty well for your more basic of DB schemas which have few relationships and contain much of the information for an individual entity in a single row/table. This way when fetching a collection of objects, you can get only as much as you need. But when you really are looking for a single object and all its related information, you can take the time and fetch it all.
Multiple Views for each entity
The idea here is that you are not always going to need the same "view" of an entity and its relationships for all situations. For instance, sometimes you only need the user_name and password information for your User object (both fields contained in the same hypothetical table), while other times you need their first_name, last_name, address, zip-code and phone number as well (contained in one or many hypothetical tables linked to the user table). It makes sense maybe to have two kinds of User objects, one for verifying user_name and password upon login, and the other for printing out mailing labels, each optimized for their specific usages.
Oh, and both of your links seem like excellent resources to promote the uniqueness of my own module,...Don't get too caught up in the "uniqueness" of your module. There are many here who will cry "Your Re-Inventing the Wheel!", but the fact of the matter is that just because its been on CPAN for 5 years, and has made it into the core module set, blah blah blah, doesn't mean its still the best tool for the job (and in particular the job you are doing) and might not be old and krusty. There is always room for someone to "Build a better Mousetrap" out there, and unless you try, you will never know. However, always keep in mind that in order to build Mouse::Trap::Better.pm you should know about all the other mousetraps available out there so that you can learn from thier mistakes and build upon their success.
Don't get too caught up in the "uniqueness" of your module. There are many here who will cry "Your Re-Inventing the Wheel!"I agree that uniqueness should not be a deciding factor in whether or not to develop and CPAN the module and hope the OP didn't take my remarks that way. But in naming and describing the module it's best to focus on what its strengths are, what it shares with other similar modules, and how it differs from other similar modules.
Actually, the way I've done it you can have your cake and eat it too, in that you can choose precisely how lazily objects are loaded with respect to their relationships, and the only "waste" is the foreign key with which I perform my own "join" external to the database, which isn't really waste because it is necessary to match up objects and their children/parents. Perhaps "overhead" is a better word.
To query a collection of objects, you instantiate a SQL::Object::Query object. Its constructor takes as an argument an array ref that holds zero or more objects of type SQL::Link::Query. When you issue its execute method, it returns to you a SQL::Object::ResultSet object which is more or less an iterator.
Basically, what happens internal to the execute method is as follows... It invokes the execute method of each SQL::Link::Query object, but not before specifying that each of the results should be ordered by the primary key of the objects being specified by the SQL::Object::Query object. You basically end up with n+1 streams of results, where n is the number of SQL::Link::Query objects being used. For each item that you pull off the object stream, you look in each link stream and see if there is a run of one or more links related to that object, and if so you pull them off the stream and associate them with the object.
As an additional thing to note, you can specify that the queries pull in a maximum number of records at a time. When any given stream runs out of results, the stream is refilled by issuing the query again with the LIMIT parameter modified appropriately. In the case of relationships, this logic lives internal to SQL::Link::ResultSet. For objects, the logic lives within SQL::Object::ResultSet. As such, the whole process is transparent to the user. He thinks that he is getting a steady stream of objects, when under the hood my code is performing queries piece wise to avoid clobbering system memory, and sewing things together without his knowledge.
The laziness with which you load objects depends on how many SQL::Link::Query objects you pass into the SQL::Object::Query constructor, and you can ostensibly pass in none at all, resulting in complete relationship laziness. The SQL::Object objects are capable of later doing on-demand loading of relatives also by using a SQL::Link::Query object, this time specifying as a condition that the child id (or parent id, if you like) be equal to its own primary key.
Did all that make sense? If there is a logical hole in my implementation then I'd love to hear about it sooner rather than later. As best I can tell it meshes the need for performance and flexibility pretty well. I just hope that there isn't some kind of limiting flaw that I have missed that cripples its capabilities.
Did all that make sense?
For the most part, Yes. But without really seeing all the documentation and being able to try things out, I can't really say how much sense it makes, as I may be mis-interpreting you.
If there is a logical hole in my implementation then I'd love to hear about it sooner rather than later.
Again, nothing I can see from what you describe, but again, the sooner we can all read the real docs, the better we could tell.
I just hope that there isn't some kind of limiting flaw that I have missed that cripples its capabilities.
Thats what version 0.02 is for :). A few things I have learned after uploading several modules to CPAN (and there are more to come, I am converting our entire internal framework to open source CPAN modules). 1) Very few people will jump to use version 0.01 of a brand new modules from an unknown author (I know I don't), 2) You will never know who is using/downloading your module, as there is no way to tell really (well there are ways, but they will get your banned by merlyn) and 3) spell check your documentation :)
On a related thread, I'm curious as to how I should go about testing interactions with a database on a user's system. It would be very rude for the testing procedures of my module to carelessly stomp on the database residing on someone's machine. What precautions and paradigms should I employ so as both to thoroughly test my module on the user's machine and avoid trashing his environment? I've read horror stories of hapless users having rogue modules smash their data, and I'd rather not be a purveyor of such a module.Require DBI::SQLite as a dependency and run your own database, or perhaps DBD::Anydata.
At least, until your differences exceed what is required to share the same branch.
I believe that when there are so many alternatives, there must be something yet to discover that would finish the searching of alternatives. And so, it would then start improving what exists. Nevertheless, I am impressed to discover the level of any of these modules.
I would suggest adding dependencies at the starting of every sub, and every module and at the pod (to read before installing).Also, sub input and output ref. types.
The better documented it is, the less dissapointed should any user get!
Bone chance!
When I evaluate modules, the synopsis is a productivity tool. I cut and paste the synopsis to create a small test program. Modules with a bug in the synopsis, or with a synopsis that doesn't show the strength of the module, have trouble getting my attention.
I first watched the pod-synopsis cut-and-paste technique used by a tremendously strong perl programmer during a perl programming competition. I have used it ever since.
Your challenge will be to come up with a synopsis that functions without having a known database to work with. So if you follow BUU's advice and test the install with a small test database, could you use this same database in your synopsis?
perlmonks.org content © perlmonks.org and BUU, chanio, clscott, jZed, skyknight, stvn, toma
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03