Doing real deletes on user accounts is a surprisingly challenging problem and I'd be willing to bet very few companies do real deletes where all of your data is wiped permanently from the company. For legal and financial reasons, companies often need to keep track of historical user activity. If a company states in their investor quarterly report that they had 1M active users, they better be able to prove it in an audit.
And in a naive relational database implementation, deleting a user would cascade and delete activity associated with that user.
The easiest way around this is to do soft deletes where the data stays in the db, but the flag deactivates the user's account. Looks like the NYT just did a poor implementation of a soft-delete.
I used to work where (not a service for the general public) there was an "is deleted" flag for everything, but every now and then a client would insist that data be really deleted, and depending on who it was and how they asked, we might go and do it, which was a huge hassle and would cause no end of problems down the line.
On the other hand, "is deleted" flags end up causing issues when you forget to put "where not is_deleted" in your queries.
Lately I've faced kind of an inverse situation - I have a system that I can't control where things are permanently deleted once in a while for multiple reasons (rogue users, aging out of old versions) and so as I accumulate information in a little data warehouse for reporting, I decided to implement an "is deleted" flag there. Eventually though, deleting from the source was turned off because it's really not necessary.
I also worked at a similar place, and fantasized about rewriting everything so that soft-deletion wasn't a per-row technical detail, but was instead an explicit modeled business-flow. Perhaps stored as a flag on an Aggregate Root (like "Customer" or "Project") at a much coarser level of detail. Sure, your queries still need to account for it, but at least you don't have a potential patchwork of inconsistent flags.
P.S.: Random advice to anybody working on enterprisey stuff:
1. "Deletion" is too vague and broad. I strongly suggest you call it "deactivation" and some other word like "purging."
2. Deactivation is typically what a company actually wants, even if they don't know to ask for it. By phrasing it that way you also encourage stakeholders to think about "reactivation" before it becomes an architectural problem.
3. True purging is rare, and tends to be related to either disk-space issues or legal requirements. In the latter case, you'll want an audit-trail or tombstone of some sort, meaning it's still a real workflow and not just an easy SQL DELETE statement or something.
The discard gem for rails has put a lot of thought into this and works pretty well, all things considered.
It's not perfect, but I find using a library (either directly for inspiration) is often a shortcut to learning about a lot of edge cases from doing your own things.
Eh, the situation that I've been in, if I recall correctly, is that one has read/write access to all data for reporting, but not the ability to create views (or stored procedures etc) to share.
Where I am now, (as far as Oracle goes) you can't even create your own tables under your own schema. A view requires a meeting with a DBA and their manager and really special, compelling arguments.
Can you just have a general policy where every table that has an is_deleted flag also automatically gets a view, and clients must use the view unless they have a particular need to access deleted data?
Not sure about drawbacks but another solution would be to change the point where you do the queries instead. So you have a "user" object that is largely saved the same as most other "crud" objects, so you have a layer for those, add the flag there.
as far as I understand this would only work for users who don't have access to "deleted" rows. Are access rules a good way to handle this? Serious question. My solution would be to have views as "guards" for every table.
I don't understand what you mean re "access". RLS is what removes access to those rows, that's the point of RLS.
In postgresql, anyways, superusers aren't subject to RLS and the table owner, by default, isn't either. But RLS can be enforced for the table owner by a single alter statement.
I've been on both sides of this insisting, if a company annoyed me too much (e.g. headhunters mailing too frequently) I'd drop the "data privacy laws" (nowadays GDPR) bomb and ask for my data to be deleted.
On the other side, a customer got really pissed off by an online shop we maintained for a client, and asked for his data to be annihilated, we thought "What a douche.".
And something I had to learn the hard way and then teach quite a few people is that hard deletes don’t just turn your tables into Swiss cheese, they also can cause table scans.
When you delete a row, every inbound foreign key constraint has to be checked to look for any rows that refer to the deleted row, and most likely you didn’t set up an index for the foreign key, so now you have a table scan. Possibly several.
It’s not that much more work these days to set up a partial index on the table instead and add another WHERE clause, you have smaller problems with people accidentally deleting the wrong thing, and you’ve started down the path to audit trails.
It's been standard practice at all the companies I've worked at to index all foreign key fields, no matter what they're used for, I've yet to run into a situation where it's been more harmful than helpful, but these companies all had <10TB data in SQL so idk if it's good general advice.
There is also a user-valuable reason to not do hard deletes. Doing a soft delete prevents another malicious user from immediately reclaiming your now-available ID and pretending to be you.
> keep track of deleted users so that their usernames can't be reused
This seems to violate GPDR, no? Attacker attempts to create an account (say: victim@gmail.com) on AshleyMadison and is prevented because the server tracked past users. Attacker could them demonstrate victim@gmail.com was at one point a user on AshleyMadison.com
As others have mentioned, that's an issue already. The solution is to never acknowledge if a user does or doesn't exist on register/sign-up/forgot-password pages and simply state that instructions have been emailed to you in all cases. The key is that you don't act differently if the user does or doesn't exist.
In this case, where you're probing for user names or emails, you don't own the email, so you wouldn't receive the verification yourself, and thus wouldn't know if the account exists.
This is exactly why most password reset emails say "if you didn't request this, please let us know, as someone may be attempting to access your account".
You shouldn't use usernames in that scenario, just emails. After Signup, you just show a general message that a confirmation Email has been sent. If the account already exists, some policy to notify the account owner can be put in place.
Verifying the email keeps someone from hijacking the account without leaking that an account formerly existed. At least so long as their email isn't also compromised - in which case they have bigger problems.
That's not much different than not being able to create an account with victim@gmail.com because victim@gmail.com already has an account. Both instance leak information
You don't have to track their emails unless you are reusing emails as usernames. Just tracking the username suffices.
This is also one of those situations where people often put too much shit in the user table. "we have to delete the user row" -- I mean, you have to delete some of the user row, yes.
I like to solve this by proper namespacing. Suppose you instead deliberately have an authUser table which just has what you need for auth -- a UUID to hook into the rest of the system, salts and passwords for direct logins, maybe a nullable date "banned_until" if you want banning; assuming you use crypto bearer tokens rather than an auth tokens table then you also want a column with a date date for "tokens last reset on"; etc. You can put the username in there just fine, that's needed for auth. Maybe you let people log in with email+password and thus you also put their email address in there, also fine.
As long as the authUser table does not grow to encompass all of your other business logic you are good. Other tables foreign key to authUser and you delete rows from them and that doesn't upset the foreign key. You leave the row in authUser to indicate that the username is taken.
An additional "deleted" field on authUser can be used to block logins and thus the username is taken but they can't log in. As for the email address, even if you insist on a UNIQUE and NOT NULL constraint for it (and I would find this surprising in an age where we log in a lot with social media) you can auto purge by setting it to CONCAT(id, "@purged.example") and then you have a valid email address which is nowhere else used in your auth flow, no personally-identifiable information at all. Heck then you don't even need the boolean flag if you would rather forbid the .example TLD from logging in.
So that has worked well for me in the past and it seems to solve those sorts of problems with only a little tweak. The key is that the PII need is to delete the "user row" but that does not have to be the authUser row -- if you separate the two rows out then you can leave the authUser row while still having a table appUser which lives in your application and contains all the cool stuff about this user using that app. It also naturally lends itself to you thinking about a sort of SSO for all of your different applications up-front.
the real GDPR problem is if the user has asked to delete data and you do this soft delete but keep all their old data as well, and then someone hacks your system and gets that data.
You're obligated by GDPR to disclose to affected parties that their data has been compromised, but you were also obligated to delete the data by GDPR.
Yeah, that is about the worst possible way to do it. If you can't do a hard delete for whatever reason, the right way to do it is to set a flag that prevents any activity on that account. They can keep the name in order to prevent anyone else from stealing it, but still delete all the profile data attached to the account.
Whatever for? Just delete it. Keep the account tombstoned so its name can't be reused. Keep what content you can and want. Delete the PII and any metadata you're contractually and/or legally required to.
Did a glance through that thread and it didn't seem like there was a strong consensus on how to respect GPDR while maintaining historical data for reporting purposes. Any best practices?
Having worked in a HIPAA regulated space, I can say that hashing the username, such as the email address, for login purposes can allow for account recovery if the credentials are retained. At the same time the cleartext username and other PII can be stored in an object that is both encrypted at rest for its lifetime, and on top of that has its sensitive fields overwritten upon logical deletion. Account recovery cannot recover non-credential derived PII but that is a small annoyance to the user in order to be compliant and trustworthy. The internal user ID should be used throughout downstream reporting rather than actual PII for the sake of continuity and privacy.
the easier way, assuming neither is a primary key, is to convert the field values to UUIDs, which has the added advantage of anonymizing the data. that's disadvantageous if you want to prevent re-signups though, unless you take other measures.
In general you don't want to delete absolutely everything. For example, usernames should not be reused, so you can't "delete" them -- you can tombstone them though, and you should. Besides tombstoning to prevent reuse, you can and should delete as much associated metadata as you're willing to / contractually or legally required, naturally.
Even what you can delete can (and will) survive in logs and backups, web archives, screenshots, etc. Deleting things on the Internet is just difficult.
>If a company states in their investor quarterly report that they had 1M active users, they better be able to prove it in an audit.
Is this even legal? I've never heard of a company letting an outside firm go through their database to confirm any sort of statistic like that. Who is doing this auditing?
It does if they take subscribers from the EU (or california) and apply this process to them. It's incredibly straightforward. If you do business in some jurisdiction, then that business is subject to the jurisdiction's laws.
Hmm, it seems I was wrong then. I had thought that they didn’t localize to any EU countries, but I guess they have more of a global market than the other papers I am more familiar with.
Some violators might successfully keep their assets out of the reach of EU enforcement, but that's going to be really tough to do for any large business with global operations.
Are soft deletes even legal in the context of privacy laws like GDPR? If I’m writing in to delete my data, I don’t really give a crap how hard it is. I want that permanently wiped, so that even if you wanted to you can’t find it again.
How that messes up your technical implementation is your problem
And in a naive relational database implementation, deleting a user would cascade and delete activity associated with that user.
The easiest way around this is to do soft deletes where the data stays in the db, but the flag deactivates the user's account. Looks like the NYT just did a poor implementation of a soft-delete.