Well, anyone who'd type people based on their height, or on wearing glasses or not, would not be using VI properly. "Thin" is not quite as bad but it brings us to the issue of body type which is anyway more complex than just whether you are thin or not.
The point I'm making is that what you are illustrating is a case of incompetent VI, not a demonstration of the weakness of VI as such.
VI from videos - where you can see and hear the person talking and moving - has as much evidence as anything in socionics, imo.
As to VI from pictures: precisely in order to check its validity, now and then I post here (or to some people in private) pictures of people I have known for a very long time and whom I have typed through other methods, and ask them to VI them. Not everybody gets the correct type, but the percentage of correct or near-correct answers is far higher than mere chance would account for.
Now, it's not "exact science". Personally I see VI from pictures as a sort of educated guess. But, in the absence of any other evidence, it's correct often enough to be useful.
The best thing to do is perhaps to leave VI alone for a while, and concentrate on typing people you know through other methods. After you have typed enough people, your own VI method will emerge. Or not.