How Bad Things Happen … in the Data Layer

Doug Wilson
15 min readDec 13, 2023

--

The unintended but entirely predictable consequences of quick & dirty data design and structures, overly simplistic demo code, ceding control to ORMs, unexamined assumptions, and the lack of standards and discipline

Or “Look, fuckers. You people ain’t thinkin’ about this stuff.”

Everything is not fine. Your code’s not great. Your customer’s aren’t happy. This ship is sinking. And it’s your fault. M’kay?

A man sits looking at a wrecked ship on the rocks
Photo by Walid Ahmad from Pexels

This is another in a series of articles about the sorry state of “modern” software development and how we got here. These articles will all be part of my Look, Fuckers publication here on Medium, which debuted with the general introduction How Bad Things Happen … in Software Design & Development.

So let’s start with the foundation: data.

Data Analysis & Design

Data is absolutely foundational to information systems. If the data ain’t right, ain’t none of it gonna be right. At the very least, teams will spend lots of unnecessary time coding around or in place of things that could and should have been established clearly in the data layer with just a little thought and discipline.

The data layer — database roles, schemas, tables, columns, data types, indexes, constraints, functions/stored procedures, Structured Query Language (SQL) scripts to create all this structure and to provide reference data, etc — is there to work for us, not the other way around. It’s important that we understand and use the right tools for a given job and that we wisely invest the time to design and create a structure that will pay dividends for years. Yes, this is not only possible but absolutely crucial to your long-term success and that of the organizations you serve. Use the data layer to model business entities and relationships and to persist their data. Design it to collect a complete, consistent data set from day one rather than only adding data elements when they’re asked for. I know: heresy, right?

But not insisting on a reasonably complete data set from day one means that we can only answer questions that the business has anticipated. Unanticipated questions are often urgent but must wait for:

  • Analysis,
  • Design,
  • Estimation, prioritization, and inclusion in an upcoming sprint,
  • Implementation,
  • Testing, and
  • A period of time to collect a statistically significant amount of data

Altogether, this can take months and is one of the leading reasons that proudly, purely “agile” organizations are not really agile at all, being unable to respond efficiently or effectively to new opportunities and challenges.

Database Roles

Database roles other than the built-in “super user”, e.g. the postgres role in PostgreSQL, the root account/role in MySQL, the db_owner role in SQL Server, etc, are necessary for the same reasons that accounts other than root are important in operating systems, cloud infrastructure, etc. Logging in regularly to the root account and using it to do everything is a huge security risk because the root account can do literally anything in the system. It’s easy to make mistakes, the effects of which could be catastrophic when operating as root, and if that account were to be compromised … game over. The structures and data you’ve worked so hard to create and collect could be stolen and leaked, ransomed, or destroyed.

So, at a minimum, our databases should have owner, administrator, writer, and reader roles, which we explicitly create and to which we intentionally grant specific permissions.

What does this look like (in Postgres)?

Note: Naming is vitally important and therefore should be intentional. We wisely invest time thinking carefully about names or adopt naming established by those who have thought carefully in the past. Ahem. We do NOT lazily cede vitally important aspects of system design like naming to our tools. I’m lookin’ at you, Database Management Systems (DBMSs), application frameworks, and Object Relational Mappers (ORMs). Instead, we instruct our tools to follow the naming we have established. We are human beings, who have worked hard to become highly skilled in our craft. They are tools, designed to help realize our designs. We decide, not them. Therefore, the double quotes in the code blocks below instruct Postgres to preserve our intentional Pascal case.

While logged in as the database server’s “super user” or root role, which has all permissions, create the ApplicationNameOwner and other roles, which have no permissions by default. Then log out.

CREATE ROLE “ApplicationNameOwner” WITH
LOGIN
NOSUPERUSER
NOINHERIT
CREATEDB
NOCREATEROLE
NOREPLICATION
PASSWORD ‘*TEMP-PWD*’; -- TODO: Change this temporary password (used only for initial database scripting) to the value generated by your secrets manager

While logged in as ApplicationNameOwner, create the application database, schemas, tables, functions, and other structures, and explicitly grant only the absolutely necessary permissions to each role. Then log out.

CREATE DATABASE "ApplicationName"
WITH
OWNER = 'ApplicationNameOwner'
[Other Parameters] ...;
CREATE SCHEMA "BusinessEntity1" AUTHORIZATION "ApplicationNameWriter";
CREATE TABLE "BusinessEntity1"."BusinessEntity1"
(
"Id" bigint NOT NULL,
"Uuid" uuid NOT NULL,
"EntitySubtypeId" bigint NOT NULL,
"TextKey" character varying(100) NOT NULL,

"LocalizedName" character varying(100) NOT NULL,
"LocalizedDescription" character varying(2000) NOT NULL,
"LocalizedAbbreviation" character varying(15) NOT NULL,

"ResourceName" character varying(100) NOT NULL,
"Ordinal" bigint NOT NULL,
"IsActive" boolean NOT NULL,
"CorrelationUuid" uuid NOT NULL,
"Digest" character varying(500) NOT NULL,
"CreatedAtDateTimeUtc" timestamp without time zone NOT NULL,
"CreatedByInformationSystemUserId" bigint NOT NULL,
"UpdatedAtDateTimeUtc" timestamp without time zone NOT NULL,
"UpdatedByInformationSystemUserId" bigint NOT NULL,
"DeletedAtDateTimeUtc" timestamp without time zone NOT NULL,
"DeletedByInformationSystemUserId" bigint NOT NULL,

CONSTRAINT "BusinessEntity1_PK" PRIMARY KEY ("Id")
)

-- NOTE: This script makes a pretty reasonable template for any entity data table.
-- NOTE: Copy, paste, slug in the entity name, and replace (or add to) the LocalizedName, LocalizedDescription, and LocalizedAbbreviation columns,
-- NOTE: and you've got completeness and consistency covered with zero extra time and effort.
-- NOTE: Why do you suppose people insist on starting from scratch every time when it's so easy to do things well (and even better) every time?
GRANT CREATE ON DATABASE "ApplicationName" TO "ApplicationNameOwner";

GRANT SELECT, INSERT, UPDATE ON "BusinessEntity1"."BusinessEntity1" TO "ApplicationNameWriter";

Note: We don’t grant DELETE permissions in order to enforce “soft” deletes, which are actually just SQL UPDATEs.

Please also note the consistency, completeness, care, and specificity in the table column declarations:

  • No null values are allowed (default values will be supplied instead)
  • All integer data types are bigint
  • All varying-length character (varchar) types used for similar purposes, e.g. names or unique identifiers or “keys”, are a standard, 100-character length
  • The “Localized” prefix indicates columns that may contain the same information in other languages, currency types, date formats, etc
  • The timezone in which created, updated, and marked for deletion timestamps are stored is clear, letting “downstream” developers, report writers, analytics developers and data scientists know what to expect without the need to create and maintain a separate data dictionary

Now the application can safely and securely connect to the database to enable its interactive users to create, read, update, and mark data for deletion (CRUD) using the much more limited but capable ApplicationNameWriter role for the system’s day-to-day database interactions without the risk of creating or modifying existing structures.

The same approach can be taken with other appropriately-permissioned roles like the read-only ApplicationNameReader role, which is perfect for reporting, analytics, etc.

Note: Since many of today’s developers don’t know, aren’t cautious, or don’t wisely invest the time in laying a sound foundation, leaving that work instead to their tools, which are not nearly as smart or as thorough as their alleged masters believe them to be, there is a good chance that your application(s) are running with a dangerously over-permissioned database owner role, e.g. the “super user”, root, or database owner role. Eek!

Database Namespaces/Schemas

Many database management systems (DBMSs) support the concept of logically grouping database objects like tables, functions/stored procedures, etc for purposes of organizing large collections of these objects, managing permissions, etc. These logical grouping mechanisms are often called namespaces or “schemas”.

Schemas can also be used to logically partition microservice data objects rather than implementing an entire database per microservice as advocated in some approaches.

It’s important here to have a pragmatic and logically consistent microservice definition, i.e. more compelling than just a “smaller” service, which immediately begs the questions “How much smaller?” and “Smaller how? Bytes or lines of source code? Size in memory?”

To me (after a lot of thought about and dissatisfaction with other definitions), a microservice is a thin service layer “wrapper” around the four basic database operations (Create, Read, Update, and Delete or “CRUD”) for a given business entity, plus the ability to publish event data to a queue when any of these operations occur, e.g. a new Order with Id = 123 was just created.

Entity Data Tables

Having tightened up our database security by granting only the necessary permissions to our database connection role, ensuring a complete, consistent data set by basing each database table creation script on a good template, and logically partitioning our database using schemas, we are now ready turn our data analysis and design attention to the database tables themselves.

How can we decide what tables to create, what data each table should contain, and how they should relate to each other?

Central to my Metis approach to application technical design is the assertion that business information systems are essentially composed of three basic kinds of things: business entities, processes, and rules.

  • Entities represent the nouns — the persons, places, and things that these systems are intended to manage.
  • Processes are the verbs — the actions that operate on and orchestrate the interactions of these entities.
  • Rules (or “decisions”) are the logic and conditions that inform and constrain these processes.

Each business entity model consists of:

  • Name (PascalCase, no punctuation)
  • Attributes (including common data attributes like Id, Uuid, EntitySubtypeId, etc)
  • Data types
  • Relationships (cardinality, ordinality) and
  • Dependency data and metadata (e.g. minimum and maximum values, validation criteria, etc), as well as
  • Any entity-specific scripted lookup/reference data.

These business entity models can be used to create corresponding data structures, e.g.

  • Database tables in the data layer
  • Microservices and Application Programming Interface (API) endpoints in the service layer
  • API documentation and
  • User interface (UI) components in the presentation layer

They can also be grouped in the business domains, e.g. General, InformationTechnology, Product, etc, to which we decide they belong.

By basing our analysis (and naming) on things as they actually are, we can create (or adopt) business entity models that always work and can be easily extended because they work in the same way the things they represent actually work. When we don’t understand or invest the time to do this well, we get into trouble and carelessly create poor quality data structures that may seem to work initially and that get baked deeply into our systems until some future date when their inaccuracies become problematic and are discovered but usually prove too difficult, time consuming, and expensive to change. In other words, by failing to model data well up front, we fail the organizations we are supposed to serve and those who depend on them.

Note: The Common Entity Attributes below are part of each business entity model and are the same set of database column definitions included in the example CREATE TABLE SQL script above.

A two-column entity diagram showing common entity attribute names and Postgres data types

Note: The entity-specific attributes below, e.g. the Person entity’s LegalGivenName, LegalSurname, and BornAtDateTimeUtc, are in addition to the Common Entity Attributes above, which are not included in the diagram to its size and visual complexity due to duplication.

An Entity-Relationship Diagram (ERD) showing sixteen business entities (with entity attribute names and Postgres data types), e.g. EntityType, EntitySubtype, Person, Organization, Employee, etc, the relevant EntitySubtypes, and entity relationships in the General business domain. The set of Common Entity Attributez are present in each entity but not included in this diagram to reduce its visual complexity.

Note: These entity names should be used both for the schema and table associated with each business entity.

Note: As you can see from the EntityType diagram, there are three kinds or “subtypes” of entities:

  • Business Entity (e.g. Person, Organization, etc, which can contain scalar text-based and numeric data, as well as foreign key references)
  • Association (e.g. Employee, which associates a Person with an Organization using their Id values as foreign keys) and
  • Hierarchy (e.g. OrganizationalUnitHierarchy, which can represent hierarchical structures using OrganizationalUnit Id foreign key values as ParentOrganizationalUnitId and ChildOrganizationalUnitId).

Using these structures, we can

  • Build up complex business entities by composition, based on “Is A” and “Has A” Object-Oriented Analysis (OOA) criteria, i.e. an Employee is a Person, and a Person has a date of birth (BornAtDateTimeUtc).
  • Associate independent entities like PhoneNumber sand EmailAddresses with Persons, OrganizationalUnits, etc and
  • Model complex organizational structures.

Anti-Pattern

Poor attribute grouping often results from the lack of a clear business entity concept, leading to data attributes that actually belong to several different entities being thrown together in a single database table like this actual nightmare I encountered.

The variation in attribute naming and data types often occur over time when new developers add attributes without paying attention to the patterns established before them. This happens so often that there actually seems to be an aversion to doing the same thing twice in the same way.

All of the following attributes were part of an accounttable:

  • first_name (U.S.-centric rather than positioned for localization)
  • last_name (U.S.-centric, many cultures have more than one family name)
  • born_year
  • user_id (character-based for absolutely abysmal SQL JOIN performance, and a “user” in the context of a system, which belongs to or is controlled by an organization)
  • email (people and organizations can have more than one email address)
  • password (no indication about whether the password values are hashed or in the clear)
  • is_active (might this be more valuable as a timestamp, e.g. deactivated_at?)
  • created_at (no indication of the timezone in which these values are captured and stored — local? GMT? UTC? — or who initiated this action)
  • updated_at (same concerns as above)
  • expiration_date (inconsistent naming — why not expires_at like created_at and updated_at? — and date only, no time?)
  • institution (we can guess about what this indicates, but we can’t be confident)
  • last_login (overwritten here every time? a session data table would capture a history, including first, last, timeouts, etc on each type of client application without frequent and risky updates in the accounttable)
  • unlimited_sessions (this really drags in the concept of authorization and permissions)
  • last_token_refresh_native (see the session data table suggestion above)
  • first_login_native (see the session data table suggestion above)
  • is_superuser (see permissions above)
  • is_staff (see permissions above and/or indicate with a subtype)
  • phone (people and organizations can have more than one phone number)
  • address_id (inconsistent implementation — phone is stored directly in this account table but address is not — and people and organizations can have more than one postal address)

After just a little thought and some basic clean-up, based on the principles above this is much more clear and performant as:

Person

  • LegalGivenName (strongly indicates the expectation that this information should come from a legal document like a birth certificate)
  • LegalSurname (strongly indicates the expectation that this information should come from a legal document like a birth certificate)
  • BornAtDateTimeUtc (makes clear to developers and to downstream data consumers like report writers, etc what kind of data this column should contain)

InformationSystemUser

  • EmailAddressId
  • EmailAddressId
  • PasswordHash
  • IsActive
  • CreatedAtDateTimeUtc
  • UpdatedAtDateTimeUtc
  • ExpiresAtDateTimeUtc

Organization

  • LegalName (strongly indicates the expectation that this information should come from a legal document like articles of incorporation)

InformationSystemUserSessionEvent

  • EntitySubtypeId (indicating LoggedIn, LoggedOut, Expired, or Renewed)
  • SessionLoggedInAtDateTimeUtc
  • SessionLoggedOutAtDateTimeUtc
  • SessionExpiredAtDateTimeUtc
  • SessionRenewedAtDateTimeUtc
  • IsAllowedUnlimitedSessions

InformationSystemRole, e.g. Staff, Administrator, etc

InformationSystemPermission

PhoneNumber

  • InternationalCode
  • AreaCode
  • LocalPrefix
  • SubscriberNumber
  • Extension

PostalAddress

  • Id

Attributes from eight (8) different entities crammed into two: accountand address.

If we saw this kind of “god” class or violation of SOLID principles in the service layer, it would be called out and refactored with extreme prejudice, but this unexamined, unchallenged garbage gets baked into the data layer all day every day and then impedes progress and costs the organization, its stakeholders, and its customers money forever, rarely if ever getting fixed because it’s in the foundation, and messing with the foundation can put the whole Jenga stack at risk.

SQL Scripts/Migrations

By automating the creation of database roles, schemas, tables, constraints, functions/stored procedures, etc as well as fixed, “canonical” sets of reference or “lookup” data, e.g. countries, states, languages, etc, we can ensure the consistency of data sets and even id values across our environments, making testing and troubleshooting easier, while making it quick, easy, and reliable to tear down and re-create any Infrastructure as Code (Iac) environment.

But some migration tools can’t do what we really want them to do because they are afflicted with the same short-sighted, pedestrian approach to data design.

Here again, I will state that we do NOT lazily cede vitally important aspects of system design to our tools. Instead, we instruct our tools to do things the way we want them to. We are human beings, who have worked hard to become highly skilled in our craft. They are tools, designed to help realize our designs. We decide, not them.

Database Functions/Stored Procedures

I read almost weekly that stored procedures are dead and/or a bad idea, but this is simply incorrect.

There’s a lot of value to being able to essentially conduct an application’s data operations through the same stored procedure data access layer called from the service layer, and there are certain things that can only be done server side in the database.

There are also simple but useful things that functions or stored procedures can do like generating Type 4 UUID values (that interoperate with Microsoft GUIDs), standardized “minimum” and “maximum” timestamp values that do not exceed the datetime data type ranges in the service or presentation layers and that enable us to avoid null values, which are EVIL because allowing them requires every nullable column value to be inspected to see if it contains a null value before attempting to read or use it, needlessly complicating the creation and maintenance of service and presentation layer code.

A stored procedure data access layer can also provide:

  • Consistent default values regardless of the client application,
  • The foundation of layered data validation with thoughtful selection of data types and sizes, and
  • Record digest values that can be used for data tamper-proofing, non-repudiation, etc

The unique, integer-based Id values discussed above can only be supplied by the database server since the client has no idea what the next counter value is going to be.

Uuid values could be supplied client-side but should not.

Values like the common EntitySubtypeId can be created as required in the data layer by defining each column as not nullable and not providing default values for any required column

TextKey default values can be derived from LocalizedName of the EntityType and the LocalizedName of the EntitySubtypeId using a pattern, e.g. organizationalunit-division-39jq2. These human-recognizable identifiers are hugely helpful in troubleshooting and debugging, but it should be possible to override these default values using an optional client-side parameter.

Ordinal and IsActive should have default server-side values, e.g. -1and true, but should also accept optional client-side values in order to alter default sort orders and active status, respectively.

CorrelationUuid should be required and must be supplied by the client and persisted in any database table affected by a given business process (as well as included in a standard log message format) to aid troubleshooting and debugging.

Meaningful, performant Digest values can only be generated server side because they must include server-generated values like Id, Uuid, CreatedAtDateTimeUtc, UpdatedAtDateTimeUtc, and DeletedAtDateTimeUtc timestamps. Digest values should never be accepted from the client.

CreatedByInformationSystemUserId, UpdatedByInformationSystemUserId, and DeletedByInformationSystemUserId should be the logged-in (authenticated) user’s Id, taken from the valid access token passed in with every request.

By thinking carefully about where each value should originate, what it signifies, and how it will be used, we can set ourselves and the organizations we serve up for long-term success and business agility rather than unnecessary expense, time, and suffering.

Commitment Time

My fellow sinners, we’ve all cut corners in the past.

Perhaps we were in darkness, hadn’t heard the Data Analysis & Design gospel, and didn’t know the right thing to do. Perhaps we gave into “gotta get to market”, “just the MVP”, or “not enough time to do it right” pressure from our leaders or peers. Perhaps we misjudged the severity of the long-term consequences of not doing what we now know is right. I hope that the word of truth that I’ve shared with you today has convicted and inspired you to go and sin no more and to carry the Data Analysis & Design gospel out into the world with you.

So now with every head bowed and every eye closed … the buses will wait, as the choir sings “Just As I Am” … I challenge you to rededicate yourself to good data analysis & design!

Coming Soon

I’m launching an Open Source product that encapsulates and simplifies all of these Data Analysis & Design best practices and more.

Up Next

Next time we’ll dig into How Bad Things Happen … in Application Frameworks, including their Object Relational Mappers (ORMs).

About the Author

Doug Wilson is a mission-focused technologist, software development leader, and trusted advisor with 25+ years of proven innovation and problem solving experience, putting technology to work in the service of business.

Today, he advises select organizations on how to increase business agility while avoiding serious business and technology risks. He believes strongly that our systems should encourage not penalize us for learning and improving.

Learn more about the services he provides through his Cygnus Technology Services consulting organization, or schedule a free consultation to learn how to put his experience and unique point of view to work for you.

--

--

Doug Wilson

Doug Wilson is an experienced software application architect, music lover, problem solver, former film/video editor, philologist, and father of four.