ChatGPT Vision: What It Can and Cannot Do Currently

By Hongkiat Lim. in Internet. Updated on November 4, 2023.

The OpenAI team has been hard at work. They’ve not only integrated DALL·E into ChatGPT, but they’ve also added a new Vision feature to it.

Vision enables interaction with ChatGPT through images and photos. You can upload a photo from your phone, or via a browser if you’re using the desktop version, or you can take a new picture and upload it. After selecting the photo, click ‘Confirm,’ and then provide the question or instruction to ChatGPT.

ChatGPT will use your image as a reference, and you can ask it all sorts of things. I’ve tested it extensively, pushing it to its limits to discover its capabilities and limitations with vision. To find out more about what vision can do and assess its accuracy, continue reading.

✅ Recognizing Objects with Limited Info

First, I snapped a photo of a mobile game to see if ChatGPT could figure out what it was.

Results:

While it didn’t give the exact name of the game – since it wasn’t visible in the picture – it did correctly identify it as a Monopoly-like mobile game. To me, that’s a pretty accurate guess for an AI.

Prompt:

Output:

✅ Extracting Text from an Image

Then, I snapped a photo of an article on hongkiat.com to see if ChatGPT could read the text within the image.

Result:

It managed to read and reproduce the website’s name, article title, and body text flawlessly.

Prompt:

Output:

✅ Extracting Selected Text from an Image

I also tested if ChatGPT could read just a part of an image by circling the text I was interested in.

Results:

It successfully followed the instruction and output the required text just as well.

Prompt:

Output:

✅ Interpreting a Real-World Photo

Later, I took a photo of a restaurant menu that included text and pictures and asked ChatGPT to itemize all the dishes along with their prices.

Result:

It did this perfectly.

Prompt:

Output:

✅ Analyzing Data from a Real-World Photo

I gave it another menu and this time asked for the total cost of certain items.

Results:

It calculated the total correctly.

Prompt:

Output:

✅ More Complex Analysis of a Real-World Photo

To further test the vision feature, I took a picture of a bookshelf to see if it could estimate the number of books in the column.

Results:

It counted 42 book spines, which is close enough, considering I estimate the actual number to be between 40 and 50.

Prompt:

Output:

✅ Creating Content from a Product Photo

Then I snapped a photo of a mug to see if it could recognize the object and generate some content for it.

Results:

The output it gave were pretty good!

Prompt:

Output:

❎ Retrieving EXIF Info from a Photo

However, there were tasks ChatGPT’s Vision couldn’t handle. For instance, it was unable to extract the EXIF data from the uploaded image.

Prompt:

Output:

❎ Recognizing Objects in a Photo

It also can’t use internet browsing to acquire information it doesn’t know. For example, when I showed it a picture of a Pokémon and asked for its name, it guessed incorrectly, likely because it can’t reference the internet.

Prompt:

Output:

❎ Recognizing Languages in a Photo

It struggled with foreign languages too. I showed it Chinese text, and it didn’t recognize the characters or their meaning.

Prompt:

Output:

So, those were my tests of ChatGPT’s vision feature. Overall, it’s quite a useful tool that can be employed creatively. It’s also worth mentioning that, at the time of writing this article, ChatGPT’s Vision is only available on desktop browser versions and the iOS app.