There's a wealth of videos on youtube showing AIs finding unexpected ways to cheat at any given task. There's no reason to believe they wouldn't cheat at programming either.
Exactly. And once you’ve got your tests made to be specific enough to make it impossible for them to be evaded? You don’t have a testing suite anymore, you have a compiler!
Which, honestly, is a great argument for choosing languages with highly specific and derivable type systems: I’d rather deal with compile time errors than with writing runtime tests.
If there is a case that isn't properly tested, the AI is likely to use it.
For example, if the test always uses 1,2,3,4 and 5 as inputs, the AI could just generate a list that returns the results for those 5 but not any other.