Illustration of iPhone screen using Apple Vision Framework with Core ML for real-time object detection, showing bounding boxes and confidence labels in SwiftUI interface.

Introduction: You Know Core ML. Now, Meet Vision.

Welcome to Part 3 of our On-Device AI series! In Part 1, we integrated a Core ML model into our app. In Part 2, we learned how to test it properly. But here’s where things get interesting: how do we actually use that model on a live video feed? Or figure out where an object appears in a photo, not just what it is?

That’s where the Vision Framework comes in.

I’ll be honest—when I first started working with Core ML directly, I spent way too many hours wrestling with image preprocessing. Converting UIImage to CVPixelBuffer, handling EXIF orientation data, normalizing pixel values to match what the model expected. It was tedious, error-prone, and frankly frustrating. Then I discovered Vision, and everything just clicked.

Think of it this way: if Core ML is the powerful engine under the hood, Vision is the smart transmission and steering system that makes it actually drivable. Vision handles all the messy preprocessing work for you, so you can focus on building features instead of wrestling with pixel buffers.

Here’s what Vision gives you out of the box:

  • Automatic image format conversion (UIImage, CGImage, CVPixelBuffer—it handles them all)
  • Correct orientation handling using EXIF data
  • Input normalization for your model
  • Built-in requests for common tasks (text recognition, face detection, barcode scanning)
  • Easy integration with both still images and video streams

By the end of this article, you’ll build a real object detection feature in SwiftUI. We’ll draw bounding boxes around detected objects, handle coordinate transformations, and do it all with surprisingly little code.

Let’s dive in.


1. The Core Concepts: How Vision Works

Before we jump into code, let’s understand Vision’s architecture. The framework is built around three core concepts that work together. Once you get these, everything else falls into place.

1.1 The Request (VNRequest)

A request tells Vision what you want to find or analyze in an image. Simple as that.

Vision provides many built-in request types:

  • VNRecognizeTextRequest for OCR (finding text in images)
  • VNDetectFaceLandmarksRequest for face detection and facial feature mapping
  • VNDetectBarcodesRequest for QR codes and barcodes
  • VNCoreMLRequest for running your custom Core ML models

Each request type is specialized for its task. You create a request once, configure it with any options you need, and then reuse it multiple times. This is way more efficient than creating new requests for each image.

The VNCoreMLRequest is your bridge between Vision and Core ML—this is what we’ll focus on for object detection.

1.2 The Handler (VNImageRequestHandler)

The handler is the “worker” that performs your requests on an image. You initialize it with an image source—this can be a CGImage, CVPixelBuffer, CIImage, or even a URL to an image file. Then you call perform(_:) with an array of requests.

Here’s what makes the handler powerful: it can run multiple requests on the same image in a single pass. Need to detect faces and recognize text in the same photo? Just pass both requests to one handler. Vision optimizes the processing automatically.

The handler also handles orientation corrections. If your UIImage has EXIF orientation metadata, Vision automatically adjusts the image before processing. This alone saves you from countless headaches.

1.3 The Observations (VNObservation)

Minimal diagram showing Vision Framework pipeline: image input, VNImageRequestHandler, VNRequest, and VNObservation result flow for object detection in iOS.
How Apple’s Vision Framework processes an image — from input to observations using Core ML.

After the handler performs your requests, you get results back as observations. Each request type returns a specific observation subclass:

  • VNTextObservation from text recognition
  • VNFaceObservation from face detection
  • VNRecognizedObjectObservation from object detection (Core ML)
  • VNBarcodeObservation from barcode detection

Observations contain the actual results—bounding boxes, confidence scores, labels, and more. They’re the payload you’ve been waiting for.

Here’s the flow in a nutshell:

Image → VNImageRequestHandler → [VNRequest] → handler.perform() → [VNObservation]

[Insert Diagram: Vision Framework Pipeline (Request, Handler, Observation)]

That’s the entire Vision pipeline. Clean, simple, and repeatable.

Now let’s put it to work.


2. Practical Example: Object Detection in SwiftUI

Let’s build something tangible: an app that detects objects in a static image and draws bounding boxes around them in SwiftUI. This is the kind of feature that impresses users and demonstrates real AI capability.

We’ll need a Core ML model trained for object detection. YOLOv3-tiny works great for this—it’s fast and lightweight. You can download it from Apple’s Core ML model gallery or use one from Part 1 of this series.

2.1 Step 1: Load the VNCoreMLModel

First, we need to wrap our Core ML model in a Vision-specific container called VNCoreMLModel. This is a one-time setup step that prepares the model for use with Vision requests.

private var visionModel: VNCoreMLModel? {
    do {
        let configuration = MLModelConfiguration()
        let model = try YOLOv3Tiny(configuration: configuration)
        return try VNCoreMLModel(for: model.model)
    } catch {
        Logger().error("Failed to load Core ML model: \(error.localizedDescription)")
        return nil
    }
}

I learned this pattern late in my career: always return optionals for resource loading and use Logger instead of print for production apps. It makes debugging so much easier when things go wrong in the field.

Notice we’re using MLModelConfiguration() here. In a real app, you might configure this to use the Neural Engine or GPU specifically. For now, the default works fine.

2.2 Step 2: Create the VNCoreMLRequest

Now we create the request that will run our model. We set this up as a lazy property so it’s only initialized when first accessed, and we can reuse it for multiple detections.

The key part here is the completion handler. This closure is called after Vision finishes processing the image. It’s where we’ll receive our observations.

private lazy var detectionRequest: VNCoreMLRequest? = {
    guard let model = visionModel else { return nil }
    
    let request = VNCoreMLRequest(model: model) { [weak self] request, error in
        guard let self = self else { return }
        
        if let error = error {
            Logger().error("Detection request failed: \(error.localizedDescription)")
            return
        }
        
        self.processObservations(request.results)
    }
    
    request.imageCropAndScaleOption = .centerCrop
    return request
}()

That imageCropAndScaleOption property is important. It tells Vision how to fit the image into the model’s expected input size. .centerCrop is usually the best choice for object detection because it preserves the aspect ratio and focuses on the center of the image.

2.3 Step 3: Perform the Request

Time to actually run the detection. We’ll create a function that takes a UIImage, extracts its CGImage, creates a handler, and performs our request.

func performDetection(on image: UIImage) {
    guard let cgImage = image.cgImage,
          let request = detectionRequest else {
        Logger().warning("Unable to perform detection: missing image or request")
        return
    }
    
    let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
    
    DispatchQueue.global(qos: .userInitiated).async {
        do {
            try handler.perform([request])
        } catch {
            Logger().error("Failed to perform detection: \(error.localizedDescription)")
        }
    }
}

Notice we’re dispatching to a background queue. Vision processing can be heavy, and we never want to block the main thread. The completion handler will be called on the same queue where we called perform(_:), so we’ll need to dispatch back to the main queue when updating our UI.

2.4 Step 4: Handle and Cast the Observations

Inside our completion handler, we receive an array of generic VNObservation objects. For object detection, we need to cast them to VNRecognizedObjectObservation to access the bounding boxes and labels.

private func processObservations(_ results: [Any]?) {
    guard let results = results as? [VNRecognizedObjectObservation] else {
        Logger().warning("Unable to cast observations to VNRecognizedObjectObservation")
        return
    }
    
    // Filter out low-confidence detections
    let filteredResults = results.filter { $0.confidence > 0.5 }
    
    DispatchQueue.main.async { [weak self] in
        self?.detectedObjects = filteredResults
    }
}

That confidence threshold of 0.5 is a good starting point. Below that, you get too many false positives. Above 0.7, you might miss valid detections. Tune this based on your specific model and use case.

Here’s something I learned the hard way: I once forgot to dispatch back to the main queue when updating my @Published property. SwiftUI threw purple runtime warnings everywhere, and the UI froze. Always remember—Vision callbacks happen on background threads, but SwiftUI updates must happen on the main thread.


3. Drawing Bounding Boxes in SwiftUI

Now comes the visual payoff: displaying those bounding boxes on the image. This is where many developers hit a wall because of coordinate system mismatches.

Let’s solve that properly.

3.1 The Coordinate Space Challenge

Diagram comparing Vision and SwiftUI coordinate systems for bounding box rendering.
Coordinate conversion from Vision’s bottom-left origin to SwiftUI’s top-left origin using convertVisionRect().

Here’s the tricky part: Vision’s bounding boxes are in a normalized coordinate space. That means the values range from 0.0 to 1.0, regardless of the actual image size. A box at (0.5, 0.5, 0.2, 0.2) means: start at 50% across and 50% down, with a width of 20% and height of 20%.

But there’s a second gotcha: Vision uses a coordinate system with the origin at the bottom-left corner (like Core Graphics). SwiftUI uses a coordinate system with the origin at the top-left corner.

If you don’t account for this, your bounding boxes will be flipped vertically. I’ve seen this mistake in countless Stack Overflow posts.

[Insert Diagram: Vision (bottom-left) vs. SwiftUI (top-left) Coordinate Systems]

func convertVisionRect(_ visionRect: CGRect, to size: CGSize) -> CGRect {
    // Vision coordinates: origin at bottom-left, normalized (0.0 to 1.0)
    // SwiftUI coordinates: origin at top-left, in points
    
    let x = visionRect.origin.x * size.width
    let width = visionRect.width * size.width
    
    // Flip the y-coordinate
    let y = (1 - visionRect.origin.y - visionRect.height) * size.height
    let height = visionRect.height * size.height
    
    return CGRect(x: x, y: y, width: width, height: height)
}

That (1 - origin.y - height) is the magic formula. It flips the y-axis and accounts for the box’s height. Memorize this pattern—you’ll use it in every Vision + SwiftUI project.

3.2 Building the SwiftUI View

Let’s put it all together. We’ll use a ZStack to overlay the bounding boxes on top of the image, and a GeometryReader to get the container size for our coordinate conversion.

struct ObjectDetectionView: View {
    @StateObject private var detector = ObjectDetector()
    let image: UIImage
    
    var body: some View {
        GeometryReader { geometry in
            ZStack(alignment: .topLeading) {
                Image(uiImage: image)
                    .resizable()
                    .aspectRatio(contentMode: .fit)
                
                ForEach(detector.detectedObjects.indices, id: \.self) { index in
                    let observation = detector.detectedObjects[index]
                    let boundingBox = convertVisionRect(
                        observation.boundingBox,
                        to: geometry.size
                    )
                    
                    Rectangle()
                        .stroke(Color.green, lineWidth: 3)
                        .frame(width: boundingBox.width, height: boundingBox.height)
                        .position(
                            x: boundingBox.midX,
                            y: boundingBox.midY
                        )
                    
                    if let label = observation.labels.first {
                        Text("\(label.identifier) \(Int(label.confidence * 100))%")
                            .font(.caption)
                            .padding(4)
                            .background(Color.green.opacity(0.8))
                            .foregroundColor(.white)
                            .position(x: boundingBox.midX, y: boundingBox.minY - 10)
                    }
                }
            }
        }
        .onAppear {
            detector.performDetection(on: image)
        }
    }
}

[Insert Image: Final app showing object detection with bounding boxes and labels]

A few things worth noting here:

First, we use ForEach with indices instead of the observations directly because VNRecognizedObjectObservation doesn’t conform to Identifiable. Using indices is the simplest workaround.

Second, notice how we position the boxes using .position() with midX and midY. SwiftUI’s position modifier places views by their center point, which is exactly what we need after our coordinate conversion.

Third, that label overlay showing the class name and confidence percentage? Users love that. It makes the AI feel transparent and trustworthy. I always add this in production apps.


4. What Else Can Vision Do? (Without Core ML)

Vision is powerful even without a custom Core ML model. Apple has built in several high-quality requests that solve common problems out of the box.

I often try these built-in requests first before reaching for a custom model—they’re surprisingly capable.

4.1 Text Recognition (OCR)

VNRecognizeTextRequest is shockingly good at extracting text from images. It handles multiple languages, different fonts, curved text, and even handwriting (on supported languages). I’ve used this to build document scanners, receipt parsers, and business card readers. It’s production-ready and requires zero model training.

4.2 Barcode and QR Code Detection

VNDetectBarcodesRequest identifies and extracts data from barcodes and QR codes. It supports dozens of barcode formats automatically. This is perfect for inventory apps, ticketing systems, or any scenario where you need to scan codes without third-party libraries.

4.3 Face Detection and Landmarks

VNDetectFaceLandmarksRequest not only finds faces but also identifies facial features—eyes, nose, mouth, jawline, eyebrows. Each landmark comes with precise coordinates. I’ve seen developers build AR face filters, emotion analysis, and face-swap effects with this. The data quality is excellent.

4.4 Image Saliency (What’s Interesting?)

VNGenerateAttentionBasedSaliencyImageRequest identifies the most visually interesting or attention-grabbing parts of an image. It’s based on where human eyes naturally focus. This is incredibly useful for automatic photo cropping, thumbnail generation, or smart content-aware scaling. The algorithm is based on eye-tracking research, so it genuinely reflects human perception.


Conclusion: Vision is Your AI Co-Pilot

The Vision framework is the essential bridge between your Core ML models and your app. It handles the difficult image processing so you can focus on building features that users actually care about.

Let’s recap what we’ve covered in this three-part series:

Part 1: Core ML Integration — We loaded a Core ML model and integrated it into our app. We learned how to convert models, handle predictions, and structure our code for maintainability.

Part 2: Core ML Testing — We wrote comprehensive unit tests for our model integration. We learned how to test predictions, handle edge cases, and ensure model performance stays consistent.

Part 3: Vision Framework — We used Vision to build a complete object detection feature in SwiftUI. We handled coordinate transformations, drew bounding boxes, and explored Vision’s built-in capabilities.

Together, these three pieces give you everything you need to ship AI-powered features on iOS. You’re no longer dependent on cloud APIs or third-party SDKs. Everything runs on-device, protecting user privacy and working offline.

Here’s my challenge to you: this week, find one place in your current project where you’re manually converting images for processing. Replace that code with a VNImageRequestHandler. I guarantee you’ll delete more lines than you add, and your code will be more reliable.

The future of iOS apps is on-device AI. Vision makes that future accessible today.

Now go build something killer.


References

Leave a Reply

Your email address will not be published. Required fields are marked *

eleven + 19 =