Doing real deletes on user accounts is a surprisingly challenging problem and I'...

perl4ever · on April 28, 2020

I used to work where (not a service for the general public) there was an "is deleted" flag for everything, but every now and then a client would insist that data be really deleted, and depending on who it was and how they asked, we might go and do it, which was a huge hassle and would cause no end of problems down the line.

On the other hand, "is deleted" flags end up causing issues when you forget to put "where not is_deleted" in your queries.

Lately I've faced kind of an inverse situation - I have a system that I can't control where things are permanently deleted once in a while for multiple reasons (rogue users, aging out of old versions) and so as I accumulate information in a little data warehouse for reporting, I decided to implement an "is deleted" flag there. Eventually though, deleting from the source was turned off because it's really not necessary.

Terr_ · on April 28, 2020

I also worked at a similar place, and fantasized about rewriting everything so that soft-deletion wasn't a per-row technical detail, but was instead an explicit modeled business-flow. Perhaps stored as a flag on an Aggregate Root (like "Customer" or "Project") at a much coarser level of detail. Sure, your queries still need to account for it, but at least you don't have a potential patchwork of inconsistent flags.

P.S.: Random advice to anybody working on enterprisey stuff:

1. "Deletion" is too vague and broad. I strongly suggest you call it "deactivation" and some other word like "purging."

2. Deactivation is typically what a company actually wants, even if they don't know to ask for it. By phrasing it that way you also encourage stakeholders to think about "reactivation" before it becomes an architectural problem.

3. True purging is rare, and tends to be related to either disk-space issues or legal requirements. In the latter case, you'll want an audit-trail or tombstone of some sort, meaning it's still a real workflow and not just an easy SQL DELETE statement or something.

lostapathy · on April 28, 2020

The discard gem for rails has put a lot of thought into this and works pretty well, all things considered.

It's not perfect, but I find using a library (either directly for inspiration) is often a shortcut to learning about a lot of edge cases from doing your own things.

https://github.com/jhawthorn/discard

dorgo · on April 28, 2020

>On the other hand, "is deleted" flags end up causing issues when you forget to put "where not is_deleted" in your queries.

My solution would be a view for every table. Are there drawbacks? Other solutions?

perl4ever · on April 28, 2020

Eh, the situation that I've been in, if I recall correctly, is that one has read/write access to all data for reporting, but not the ability to create views (or stored procedures etc) to share.

Where I am now, (as far as Oracle goes) you can't even create your own tables under your own schema. A view requires a meeting with a DBA and their manager and really special, compelling arguments.

lilyball · on April 28, 2020

Can you just have a general policy where every table that has an is_deleted flag also automatically gets a view, and clients must use the view unless they have a particular need to access deleted data?

capableweb · on April 28, 2020

Not sure about drawbacks but another solution would be to change the point where you do the queries instead. So you have a "user" object that is largely saved the same as most other "crud" objects, so you have a layer for those, add the flag there.

dbbk · on April 29, 2020

Most ORMs have built-in support for this. For example Laravel's Eloquent automatically filters out `deleted_at` records

latch · on April 28, 2020

You can use row level security so that forgetting the filter isn't an issue.

dorgo · on April 28, 2020

as far as I understand this would only work for users who don't have access to "deleted" rows. Are access rules a good way to handle this? Serious question. My solution would be to have views as "guards" for every table.

latch · on April 28, 2020

Views also work.

I don't understand what you mean re "access". RLS is what removes access to those rows, that's the point of RLS.

In postgresql, anyways, superusers aren't subject to RLS and the table owner, by default, isn't either. But RLS can be enforced for the table owner by a single alter statement.

mekster · on April 29, 2020

> would cause no end of problems down the line

That's just how badly the system is designed. Write a script to delete all relevant data and the user and it would be a simple op.

> forget to put "where not is_deleted"

Use views.

netsharc · on April 28, 2020

I've been on both sides of this insisting, if a company annoyed me too much (e.g. headhunters mailing too frequently) I'd drop the "data privacy laws" (nowadays GDPR) bomb and ask for my data to be deleted.

On the other side, a customer got really pissed off by an online shop we maintained for a client, and asked for his data to be annihilated, we thought "What a douche.".

hinkley · on April 28, 2020

And something I had to learn the hard way and then teach quite a few people is that hard deletes don’t just turn your tables into Swiss cheese, they also can cause table scans.

When you delete a row, every inbound foreign key constraint has to be checked to look for any rows that refer to the deleted row, and most likely you didn’t set up an index for the foreign key, so now you have a table scan. Possibly several.

It’s not that much more work these days to set up a partial index on the table instead and add another WHERE clause, you have smaller problems with people accidentally deleting the wrong thing, and you’ve started down the path to audit trails.

nikisweeting · on April 28, 2020

It's been standard practice at all the companies I've worked at to index all foreign key fields, no matter what they're used for, I've yet to run into a situation where it's been more harmful than helpful, but these companies all had <10TB data in SQL so idk if it's good general advice.

munificent · on April 28, 2020

There is also a user-valuable reason to not do hard deletes. Doing a soft delete prevents another malicious user from immediately reclaiming your now-available ID and pretending to be you.

pinusc · on April 28, 2020

You can always hard delete all the data _and_ keep track of deleted users so that their usernames can't be reused.

Once you have hard delete, this solution is almost trivial and by far the most user-valuable.

DevX101 · on April 28, 2020

> keep track of deleted users so that their usernames can't be reused

This seems to violate GPDR, no? Attacker attempts to create an account (say: victim@gmail.com) on AshleyMadison and is prevented because the server tracked past users. Attacker could them demonstrate victim@gmail.com was at one point a user on AshleyMadison.com

ecnahc515 · on April 28, 2020

As others have mentioned, that's an issue already. The solution is to never acknowledge if a user does or doesn't exist on register/sign-up/forgot-password pages and simply state that instructions have been emailed to you in all cases. The key is that you don't act differently if the user does or doesn't exist.

cutemonster · on April 29, 2020

If they don't get a verification / instructions email, they'll know an account with the username they typed, existed?

ecnahc515 · on May 12, 2020

In this case, where you're probing for user names or emails, you don't own the email, so you wouldn't receive the verification yourself, and thus wouldn't know if the account exists.

This is exactly why most password reset emails say "if you didn't request this, please let us know, as someone may be attempting to access your account".

lytefm · on May 1, 2020

You shouldn't use usernames in that scenario, just emails. After Signup, you just show a general message that a confirmation Email has been sent. If the account already exists, some policy to notify the account owner can be put in place.

lostapathy · on April 28, 2020

Verifying the email keeps someone from hijacking the account without leaking that an account formerly existed. At least so long as their email isn't also compromised - in which case they have bigger problems.

RandomBacon · on April 28, 2020

That's not much different than not being able to create an account with victim@gmail.com because victim@gmail.com already has an account. Both instance leak information

crdrost · on April 28, 2020

You don't have to track their emails unless you are reusing emails as usernames. Just tracking the username suffices.

This is also one of those situations where people often put too much shit in the user table. "we have to delete the user row" -- I mean, you have to delete some of the user row, yes.

I like to solve this by proper namespacing. Suppose you instead deliberately have an authUser table which just has what you need for auth -- a UUID to hook into the rest of the system, salts and passwords for direct logins, maybe a nullable date "banned_until" if you want banning; assuming you use crypto bearer tokens rather than an auth tokens table then you also want a column with a date date for "tokens last reset on"; etc. You can put the username in there just fine, that's needed for auth. Maybe you let people log in with email+password and thus you also put their email address in there, also fine.

As long as the authUser table does not grow to encompass all of your other business logic you are good. Other tables foreign key to authUser and you delete rows from them and that doesn't upset the foreign key. You leave the row in authUser to indicate that the username is taken.

An additional "deleted" field on authUser can be used to block logins and thus the username is taken but they can't log in. As for the email address, even if you insist on a UNIQUE and NOT NULL constraint for it (and I would find this surprising in an age where we log in a lot with social media) you can auto purge by setting it to CONCAT(id, "@purged.example") and then you have a valid email address which is nowhere else used in your auth flow, no personally-identifiable information at all. Heck then you don't even need the boolean flag if you would rather forbid the .example TLD from logging in.

So that has worked well for me in the past and it seems to solve those sorts of problems with only a little tweak. The key is that the PII need is to delete the "user row" but that does not have to be the authUser row -- if you separate the two rows out then you can leave the authUser row while still having a table appUser which lives in your application and contains all the cool stuff about this user using that app. It also naturally lends itself to you thinking about a sort of SSO for all of your different applications up-front.

bryanrasmussen · on April 28, 2020

the real GDPR problem is if the user has asked to delete data and you do this soft delete but keep all their old data as well, and then someone hacks your system and gets that data.

You're obligated by GDPR to disclose to affected parties that their data has been compromised, but you were also obligated to delete the data by GDPR.

abiogenesis · on April 28, 2020

Renaming your username to username-1000 has the exact same side effect though.

mcv · on April 28, 2020

Yeah, that is about the worst possible way to do it. If you can't do a hard delete for whatever reason, the right way to do it is to set a flag that prevents any activity on that account. They can keep the name in order to prevent anyone else from stealing it, but still delete all the profile data attached to the account.

ars · on April 28, 2020

Another option is to add an "archive" table, where deleted records are moved to that table.

It can get complex (for example foreign keys need special handling).

One option is table partitioning: https://www.2ndquadrant.com/en/blog/postgresql-12-foreign-ke...

You partition the table based on deleted or not, and then query either both tables together, or just the active table.

jackcosgrove · on April 28, 2020

Partitioning sounds like a great solution, thank you for this.

vincvinc · on April 28, 2020

Soft deletion violates the GDPR.

Read article 17:

https://gdpr-info.eu/art-17-gdpr/

And previous HN discussions:

https://news.ycombinator.com/item?id=16366050

tomwojcik · on April 28, 2020

Yes and no. PII needs to be removed. The rest of the data needs to be anonymized. Right?

Dahoon · on April 28, 2020

But an email is PII so clearly this would breach GDPR.

ericd · on April 28, 2020

Keep a hash instead of the original email address?

cryptonector · on April 28, 2020

Whatever for? Just delete it. Keep the account tombstoned so its name can't be reused. Keep what content you can and want. Delete the PII and any metadata you're contractually and/or legally required to.

DevX101 · on April 28, 2020

Did a glance through that thread and it didn't seem like there was a strong consensus on how to respect GPDR while maintaining historical data for reporting purposes. Any best practices?

jackcosgrove · on April 28, 2020

Having worked in a HIPAA regulated space, I can say that hashing the username, such as the email address, for login purposes can allow for account recovery if the credentials are retained. At the same time the cleartext username and other PII can be stored in an object that is both encrypted at rest for its lifetime, and on top of that has its sensitive fields overwritten upon logical deletion. Account recovery cannot recover non-credential derived PII but that is a small annoyance to the user in order to be compliant and trustworthy. The internal user ID should be used throughout downstream reporting rather than actual PII for the sake of continuity and privacy.

ska · on April 28, 2020

Yes this is all quite do-able. It is often much easier to implement from the beginning, than after the fact.

clairity · on April 28, 2020

the easier way, assuming neither is a primary key, is to convert the field values to UUIDs, which has the added advantage of anonymizing the data. that's disadvantageous if you want to prevent re-signups though, unless you take other measures.

cryptonector · on April 28, 2020

In general you don't want to delete absolutely everything. For example, usernames should not be reused, so you can't "delete" them -- you can tombstone them though, and you should. Besides tombstoning to prevent reuse, you can and should delete as much associated metadata as you're willing to / contractually or legally required, naturally.

Even what you can delete can (and will) survive in logs and backups, web archives, screenshots, etc. Deleting things on the Internet is just difficult.

nexuist · on April 29, 2020

>If a company states in their investor quarterly report that they had 1M active users, they better be able to prove it in an audit.

Is this even legal? I've never heard of a company letting an outside firm go through their database to confirm any sort of statistic like that. Who is doing this auditing?

chrisshroba · on April 28, 2020

Doesn't doing soft deletes on user data breach GDPR laws regarding deletion of user data when requested?

greendestiny_re · on April 29, 2020

>investor quarterly reports

God forbid the needs of the user harm the interest of the investor.

mesozoic · on April 28, 2020

Pretty sure that breaches GDPR and maybe CCPA. Maybe they can get away with anonomyzing the data but that doesn't sound like what is being done.

true_religion · on April 28, 2020

Sure, but a local paper like the New York Times is hardly subject to the laws of the EU, so GDPR doesn’t apply.

6gvONxR4sf7o · on April 28, 2020

It does if they take subscribers from the EU (or california) and apply this process to them. It's incredibly straightforward. If you do business in some jurisdiction, then that business is subject to the jurisdiction's laws.

mopsi · on April 28, 2020

Not only do they take subscribers, they actively target the European market. When I open nytimes.com, a pop-up offers 0,50€/week digital access.

true_religion · on April 29, 2020

Hmm, it seems I was wrong then. I had thought that they didn’t localize to any EU countries, but I guess they have more of a global market than the other papers I am more familiar with.

wtfishackernews · on April 28, 2020

If their website is accessible from the EU then they are subject to GDPR.

kube-system · on April 28, 2020

The EU says that GDPR applies globally.

Some violators might successfully keep their assets out of the reach of EU enforcement, but that's going to be really tough to do for any large business with global operations.

cyberowl · on April 29, 2020

Are soft deletes even legal in the context of privacy laws like GDPR? If I’m writing in to delete my data, I don’t really give a crap how hard it is. I want that permanently wiped, so that even if you wanted to you can’t find it again.

How that messes up your technical implementation is your problem

antsar · on April 28, 2020

> poor implementation

Well that's an understatement. NYT has the tech resources to do this a million better ways.

> investor quarterly report [...] active users

Speaking of "active user" counts, it's convenient that "jsmith1000" is plausibly an active user, whereas "jsmith(state:deleted)" is not. Hm.