Database Fundamentals for Analysts
View Comments: Newest First | Oldest First | Threaded View
Page 1 / 2   >   >>
Re: A Difficult Situation
  • 7/11/2015 11:02:36 PM

"....I will go back and re-read as repetition is the mother of learning."

@ctevert    Agreed.   Good to know I am not the only one who uses this method of learning.

Re: A Difficult Situation
  • 7/11/2015 10:57:19 PM

" ... I'm disappointed over the lack of commentary on this topic.  I have a sick sense that most readers don't even grasp the implications of what is being discussed here."


I think most readers understand the issue, however understanding how to address it ( much less solve it ) is another issue altogether.   Which is in part why A2 was created and we have experts such as yourself and Fabian within the community.   Thank you for adding to our continuing education on this subject.


Re: A Difficult Situation
  • 6/30/2015 1:18:25 PM

What about creating another table like "Prev95_Admisions" or "Historic_Admissions", there all admissions before year 95 will be entered.

Also there are things like constraints, you can enter all data without constraints and then add the constraints with NOCHECK, if you try to clean absolutely all the data, it will take way too much effort for the business to appreciate it.

Later on once it is implemented you can hire some technicians or data entries to do a constant clean-up of the data.

One thing is the design of an R-DB and another the data migration into a new Relational DB. There are different tools that allow you to do the clean-up and look for some data integrity before doing a build entry into the new DB.

A DB design should follow Codd rules, and as far as possible it should not change just because you want to enter whatever you want, if that will be the case then better go for a NoSQL approach, or flat files like XML or JSON.

In this scenario it is not just the final RDB design, but also do like a staging phase for migration, then a cleanup and finally a load into the new RDB database.

Disable constraints or allow null values in important fields, or even adding more tables just to insert big chunks of data that does not comply with good integrity but that you also know will not be part of new transactions. Should be the last options when migrating or merging old DBs with new standardized DBs.

Re: A Difficult Situation
  • 6/25/2015 5:01:48 PM

Well, yes, but it should not be--the distinction between reality and our knowledge of it is obvious, but the implications for inferencing are not, they were not even to Codd, who initiated the problems with his 4VL that ended up messed up further by NULL's.




Re: "optional" tables?
  • 6/25/2015 4:53:26 PM

Something like that. However the whole idea of migrating the database was to get away from a "gateway" database person and complex design and retrieval. With variables both added and dropped, the schema changes continuously and The lack of support of ESS relationships makes things even more complex.

It's difficult to persuade analysts to tolerate this--they always prefer one big table with disregard for the consequences.

The core point, however, is this is associated with the RDM, rather than with the SQL products failing to implement it, making sure that it'll not get implemented.



Re: A Difficult Situation
  • 6/25/2015 4:49:27 PM

"The real world obeys two-valued logic (true/false) whether we know it or not but knowledge does not"

This is a frameable quote. 

"optional" tables?
  • 6/25/2015 4:32:53 PM

In the absence of a RDBMS that works like your paper #3, we could try something sort of like Darwen's solution to missing data?




With constraints created as best we can on the 3 tables, and deferred constraint checking, we can "sort of" stay consistent when we enter data into the system since either the whole transaction works or it rolls back. Our constraint logic might possibly require triggers or stored procedures depending on the SQL vendor's constraint support.

It's annyoing and frustrating to do this for any table where we have missing data though. But for an analyst that might be relying on THOSP_ID, it's probably better to make them explicitly state what they want. They can query the two tables independently and union the results (perhaps they are doing counts or sums based on THOSP...) Or create a view that gives them the results after an explicit discussion of what should be placed in the "missing" THOSP_ID value...




Re: A Difficult Situation
  • 6/23/2015 1:24:36 PM

The core difficulty with missing data is that we don't record reality in the database, but our knowledge of it, which is imperfect. The real world obeys two-valued logic (true/false) whether we know it or not but knowledge does not, which introduces undecidability and breaks FOPL and the RDM. The only way to preserve provably logically correct results is to (1) make inferences only from what is known and (2) monitor known unknowns in the database catalog.

If database professionals do not know and understand data fundamentals, can we expect analysts to? Neither can be blamed because there is no proper education and employers do not insist on such knowledge--in fact together with vendors pressure schools to focus on tool training. That's one of the reason I am skeptical of "data scientist"--how can somebody without scientific training and knowledge of what science means be a scientist? Complaining about the RDM being "just a theory" and "impractical" is the best evidence that I am right.


Re: A Difficult Situation
  • 6/23/2015 11:28:06 AM

>Well, you have my papers, they should educate you

Yes you are right.  I am especially intrigued by the meta-data solution to missing data proposed and it seemed to me, after having read it, the only viable solution coupled along with specialization (ie sub-typing) to avoid optional columns which do not apply in the logical schema.  The 4 part series you mention was actually my introduction to deeper study on the topic as it upended all my previous notions I had on the topic (as well as exposed my inability to write concisely).  I will go back and re-read as repetition is the mother of learning.

Missing data is *the* supreme difficulty.  You can accomplish creation of R-tables in all other ways using a SQL DBMS through discipline *except* in dealing with missing data.  I would simply not make columns NULLable until I realized how much time data analysts spent finding "suspicious" data like a birth date of 1/1/1900 resulting from the fact that 1/1/1900 was the fastest way around a mandatory data entry field.  Outer Joins inject NULLs violating closure.  Date and Darwen's approach I think highlights precisely the need for this to be a DBMS, not application, function.  And it is by no means trivial.

Two.  I'm disappointed over the lack of commentary on this topic.  I have a sick sense that most readers don't even grasp the implications of what is being discussed here.

Re: A Difficult Situation
  • 6/22/2015 5:58:57 PM

>I think by "R" Fabian meant "relational" not the R programming langauge.  


>Regarding the question as to a single table vs. a multi table solution, I would lean toward a multi-table solution with one table for each complete variable set.

In general there are practical advantages to a "one R-table per entity class and one entity class per R-table" design (can you figure out why?). This would apply in your solution if the relationship between the classes of entities represented by the R-tables were 1:M. However, they are 1:0/1 due to discontinuation/adding variables.

>Our data analysts would of course want a single table solution - actually a set of single tables for each area of analysis with much duplicate data.  They would call these "analytic records."  So they would want say an admissions analytic record, a deliveries analytic record, and so on, with all the variables there from other tables that relate.  

Yes indeed and, in fact worse: they may not insist even on that and want one big table biased to a particular analysis which, if generated in SQL, may lose guaranteed provably logically correct analysis results. e.g.

>My fundamentals aren't strong enough to make a clear argument why the multi-table solution that eliminates missing variables is correct.

Well, you have my papers, they should educate you. In particular "The Final NULL in the Coffin" which proposes a logically correct relational solution to missing data without NULL's. The other part of the solution to the problems in this case is the entity supertype-subptypes relationship, which the RDM supports, but SQL does not (there is a 4 part series on this subject

>Regarding the in database analytics, I do think using it with a single table with a lot of missing data would be difficult.  A good deal of time is spent "imputing" missing data for analytic models and in SAS for example that is done by creating a new data set with imputed values - say a default, a average or median, or using some algorithm.  I don't see how that can be done "in database" and so the tendency would be to still extract all the data to a SAS data set where records could either be dropped or missing values imputed.

Which is why we distinguish between database and application functions, which escapes so many data professionals.

Page 1 / 2   >   >>