Both data science employers and candidates stress the eclectic nature of the required skills, programming in particular. Indeed, coding has acquired such an elevated role, that it now entirely replaces education. Aside from the societal destructive consequences of this trend, in the context of data management it is a regressive self-fulfilling prophecy that obscures and disregards the core practical objective of database management to minimize programming. You can frequently encounter it in comments like:
"Anything you can model in a DBMS you can model in Java. The next paradigm shift is business rules centralized in Java business objects, rather than hard-coded in SQL for better manageability, scalability, etc. The only ones that should reside in a database are referential integrity (and sometimes even that isn't really necessary). Don't let pushy DBAs tell you otherwise -- integrity constraints slow down development as well as performance."
Upside down and backwards.
First, programming languages such as Java are computationally complete and as such are based on higher logic than first order predicate logic (FOPL). While expressively more powerful, they allow self-referencing and, therefore:
- Are not decidable i.e., are susceptible to the halt problem
- Cannot support data independence -- the insulation of applications and users from physical and logical data reorganizations
- Are imperative (procedural)
- Are more complex than the relational algebra
What, unfortunately, remains misunderstood to this day, is that the relational data model (RDM) was introduced to address these drawbacks. FOPL-based relational data sublanguages are decidable, support data independence, declarative (non-procedural) and are much simpler. Computational completeness for application development is achieved by hosting them in programming languages.
Second, many years ago I wrote an article titled “Integrity is Not Only Referential” with a double-entendre: I was criticizing a DBMS vendor that was claiming, misleadingly, that its product’s application-enforced integrity was actually relational DBMS-enforced integrity and also deploring data professionals’ poor grasp of this important distinction that was letting vendors get away with it. Sadly, nothing much has changed since then.
Out of the several types of relational integrity constraints -- Domain; Attribute; Tuple; Multi-tuple; and Database (multi-relation) --
data professionals are superficially familiar with only one type of multi-tuple constraint -- key constraint -- and one type of database constraint -- foreign key (referential) constraint. Not only don’t they demand DBMS support of all the integrity constraints because they are unaware of them, they often see even the enforcement of even the two that they are aware of as unnecessary, without realizing the deleterious implications.
"Constraints out of data access code" is a regress to the pre-database days of application-enforced integrity and is the opposite of improving manageability and productivity. Aside from preventing subversion of constraints that are conducive to database inconsistency -- which should be every analyst’s concern -- DBMS-enforced integrity would actually improve performance by allowing the DBMS to integrate constraints into its optimization strategy, something which is difficult to do if the constraints are enforced with triggered procedures and impossible if they are scattered in application code.
It is argued that "The analyst shouldn't be worrying about which [DBMS] solution is being implemented, that is the data engineer's job, but rather what business value or insights can be extracted from the data." Well, those insights depend on the integrity of the data in databases, which in turn depends on how reliably it is enforced. Unfortunately, SQL DBMSs do a very poor job of it and NoSQL ones practically none, which should put the analyst on guard.
This is an excerpt from The DBDebunk Guide to Misconceptions About Data Fundamentals, available at dbdebunk.com.