Nothing more than just injecting a custom global! We can create that same Apps SDK for any Agent/LLMs, on any Chat-based Service (Web or Mobile)!
⭐⭐ Full Demo On GitHub! ⭐⭐
___________
Apps SDK? A framework to build apps for ChatGPT?
So many posts out there bragging about how this new app for ChatGPT app ecosystem will set Apple and Android in trouble!
But just like what I have shared with you in my Build An App for ChatGPT! Step By Step!,
All this apps for ChatGPT is about is displaying some HTML fetched from MCP server and injecting a custom global object to allow two-way communication!
OpenAI is trying to make it sounds a lot more amazing than it actually is!
To a point that makes me wonder why OpenAI dares to call it an App!
We can support the same thing, ie: create our own App SDK to display “Apps”, in our own chat service in just three steps!
- Fetch the HTML Resource from the MCP Server if there is one embedded to the tool called
- Display the HTML with a
WebView
if mobile,iframe
if Web! - Inject the Custom Global Object (Javascript)
And obviously!
Those steps above are not limited to any Agent/LLM or any platforms! We will be making a Swift App with Foundation Models, but you can use the same idea to create a Web App with strands agents!
Let’s go!
Basic Approach
First of all, let’s recap how does the App for ChatGPT or that OpenAI Apps SDK work under hood. Quickly! If you want a little more details, please check out my previous articles Build An App for ChatGPT! Step By Step!.
- We (as MCPServer) have some bundled HTML (single JavaScript/CSS module that the server can render inline, with an HTML as the entry point, this is basically the APP rendered by ChatGPT) and we registered those as MCP Server Resources.
- If we want ChatGPT to display our App (HTML) on an MCP Tools call, we add the resourceURI above as an embedded resource to the tool using the
_meta
key while registering the tool. - We Start the MCP Server and register it as an MCP Server for ChatGPT to use with ChatGPT Connector
Now, if ChatGPT decided to use our App with that HTML defined, it will then display our HTML using an inline iframe where our App (HTML) will be able to talk (get the tool input, output, metadata, as well as calling tools programmatically, sending follow up message and etc.) to the host (ChatGPT) via the window.openai
API.
_________________
With the above in mind, to support Apps for Foundation Models, ie: Create our own Apps SDK framework to allow MCP servers to display apps within our Chat App, here is all we have to do in addition to what we had in Swift: Power Foundation Models With MCP Servers (Tools), ie: we have already connected to the MCPServers, discovered the tools available and added those as Foundation Models Tools to our LanguageModelSession
!
If the FoundationModels decided to call one of our MCP Tool, after getting a CallTool.Result
from the MCP server, we will also
- Check if the tool has an embedded resource.
- If yes, we retrieve the resource (HTML)
- We display the HTML with a
WebView
- We inject that custom global object into the
WebView
so that the HTML gets to talk to our app the same way as it is able to with ChatGPT!
Since I don’t want to rewrite the App or rebuild my MCP Server from what we had in my previous Build An App for ChatGPT! Step By Step!, we will be
- Using the same metadata keys as OpenAI does,
openai/outputTemplate
for resourceURI,openai/widgetAccessible
to set whether the tool is callable from the widget (HTML, App, whatever you call it…), and etc. - Call our global object
window.openai
and provide the same interface.
But of course, you can change it if you like!
Set Up
Local MCP Server
If you have already built an App for ChatGPT, you can use that one if you like. If you have not yet, you can grab the one I made from my GitHub, which basically exposes a single get_pokemon
tool that is linked to a resource (App, HTML, whatever you call it), and started it like following.
- Run
npm install
for under both theserver
and theweb
directory - Build the web by running
npm run build
. This will compile the react app into a single JS/CSS module, as well as generating an HTML to be used as MCP Server resource. - Start the MCP Server by runing
npm run start
. This will start the MCP server listening tohttp://localhost:3000/mcp
.
SwiftUI App
What we have in this article will be built on top of what we had from Swift: Power Foundation Models With MCP Servers (Tools) so please make sure to grab that as well if you want to code along as you read! (Or you can just use the final version here!)
Now, unfortunately, the official Swift SDK for MCP if falling like a thousand years behind the actual MCP specification and those SDKs for Typescript, python and etc. Specifically, when retrieving the tool, we don’t get the _meta
field, and when calling a tool, we don’t get structuredContent
or _meta
for the result!
So!
I have forked it to add in the necessary ones! Let’s use this one here!
WebView to Display The App
It is basically the same thing as what we had in my previous SwiftUI: Webview ↔ JavaScript. Two-Way Communication., except for changing that window.itsuki
to window.openai
with necessary properties!
Window.OpenAI Definition
Since window.openai
is what we will be injecting, let’s first take a look at how it is defined. (I know, I have already did it in my previous Build An App for ChatGPT! Step By Step!, but! Just to make sure we are on the same page!)
export type OpenAiGlobals<
ToolInput = UnknownObject,
ToolOutput = UnknownObject,
ToolResponseMetadata = UnknownObject,
WidgetState = UnknownObject
> = {
// visuals
theme: Theme;
userAgent: UserAgent;
locale: string;
// layout
maxHeight: number;
displayMode: DisplayMode;
safeArea: SafeArea;
// state
toolInput: ToolInput;
toolOutput: ToolOutput | null;
toolResponseMetadata: ToolResponseMetadata | null;
widgetState: WidgetState | null;
setWidgetState: (state: WidgetState) => Promise<void>;
};
// currently copied from types.ts in chatgpt/web-sandbox.
// Will eventually use a public package.
type API = {
callTool: CallTool;
sendFollowUpMessage: (args: { prompt: string }) => Promise<void>;
openExternal(payload: { href: string }): void;
// Layout controls
requestDisplayMode: RequestDisplayMode;
};
export type UnknownObject = Record<string, unknown>;
export type Theme = "light" | "dark";
export type SafeAreaInsets = {
top: number;
bottom: number;
left: number;
right: number;
};
export type SafeArea = {
insets: SafeAreaInsets;
};
export type DeviceType = "mobile" | "tablet" | "desktop" | "unknown";
export type UserAgent = {
device: { type: DeviceType };
capabilities: {
hover: boolean;
touch: boolean;
};
};
/** Display mode */
export type DisplayMode = "pip" | "inline" | "fullscreen";
export type RequestDisplayMode = (args: { mode: DisplayMode }) => Promise<{
/**
* The granted display mode. The host may reject the request.
* For mobile, PiP is always coerced to fullscreen.
*/
mode: DisplayMode;
}>;
export type CallToolResponse = {
// result: the string (text) content return by the tool using { type: 'text', text: JSON.stringify(structuredContent) },
result: string;
};
/** Calling APIs */
export type CallTool = (
name: string,
args: Record<string, unknown>
) => Promise<CallToolResponse>;
/** Extra events */
export const SET_GLOBALS_EVENT_TYPE = "openai:set_globals";
export class SetGlobalsEvent extends CustomEvent<{
globals: Partial<OpenAiGlobals>;
}> {
readonly type = SET_GLOBALS_EVENT_TYPE;
}
/**
* Global oai object injected by the web sandbox for communicating with chatgpt host page.
*/
declare global {
interface Window {
openai: API & OpenAiGlobals;
}
interface WindowEventMap {
[SET_GLOBALS_EVENT_TYPE]: SetGlobalsEvent;
}
}
Now, for simplification, we will only be injecting the crucial ones here (judged by me!). Specifically, the properties related to the tool use that triggered the display.
toolInput: ToolInput;
toolOutput: ToolOutput | null;
toolResponseMetadata: ToolResponseMetadata | null;
And the functions that allow the app to interacting with the main functionalities of the host (our SwiftUI App!).
callTool: CallTool;
sendFollowUpMessage: (args: { prompt: string }) => Promise<void>;
⭐ Script Manager ⭐
This is basically the key part of supporting “Apps”, allowing them to talk to the host (our SwiftUI App!) via the custom global object.
Again, same as what we had in SwiftUI: Webview ↔ JavaScript. Two-Way Communication., we will be using a separate manager class conforming to WKScriptMessageHandlerWithReply
to respond to messages from JavaScript code running in a webpage, in this case, respond to callTool
and sendFollowUpMessage
.
import SwiftUI
import MCP
import WebKit
class WebPageScriptManager: NSObject {
// Calls a tool on the tool's MCP. Returns the full response
var callSelfMCPTool: ((String, Dictionary<String, Value>) async throws -> CallTool.Result)?
// insert a message into the conversation as if the user asked it.
var sendUserMessage: ((String) async throws -> Void)?
// Messages we will be receiving from JavaScript code as well as responding to
enum MessageWithReplyName: String {
case callTool
case sendFollowUpMessage
}
// message keys for postMessage called on MessageWithReplyName.callFunction
// postMessage({
// "\(nameKey)": "some name",
// "\(argumentKey)": { key: "value" }
// })
private let toolNameKey = "name"
private let toolArgumentKey = "args"
func createUserContentController(
toolInputJson: String,
toolOutputJson: String,
toolResponseMetadataJson: String
) -> WKUserContentController {
let contentController = WKUserContentController()
// script to be injected
let script =
"""
window.openai = {
"toolOutput": \(toolOutputJson),
"toolInput": \(toolInputJson),
"toolResponseMetadata": \(toolResponseMetadataJson),
"callTool": async (name, value) => {
return await window.webkit.messageHandlers.\(MessageWithReplyName.callTool.rawValue).postMessage({
"\(toolNameKey)": name,
"\(toolArgumentKey)": value
})
},
"sendFollowUpMessage": async (args) => {
return await window.webkit.messageHandlers.\(MessageWithReplyName.sendFollowUpMessage.rawValue).postMessage(args)
}
}
"""
let userScript = WKUserScript(source: script, injectionTime: .atDocumentStart, forMainFrameOnly: true)
contentController.addUserScript(userScript)
// Installs a message handler that returns a reply to your JavaScript code.
contentController.addScriptMessageHandler(self, contentWorld: .page, name: MessageWithReplyName.callTool.rawValue)
contentController.addScriptMessageHandler(self, contentWorld: .page, name: MessageWithReplyName.sendFollowUpMessage.rawValue)
return contentController
}
}
// MARK: WKScriptMessageHandlerWithReply
// An interface for *responding* to messages from JavaScript code running in a webpage.
extension WebPageScriptManager: WKScriptMessageHandlerWithReply {
// returning (Result, Error)
func userContentController(_ userContentController: WKUserContentController, didReceive message: WKScriptMessage) async -> (Any?, String?) {
print(#function, "WKScriptMessageHandlerWithReply")
print(message.name)
guard let name: MessageWithReplyName = .init(rawValue: message.name) else {
return (nil, "Message received from unknown message handler with name: \(message.name)")
}
let body = message.body
do {
switch name {
case .callTool:
let result = try await self.handleCallToolCalled(messageBody: body)
return (result, nil)
case .sendFollowUpMessage:
try await self.handleSendFollowUpMessageCalled(messageBody: body)
return (nil, nil)
}
} catch (let error) {
return (nil, "Error: \(error.localizedDescription)")
}
}
// body: { "name": string, args: Record }
private func handleCallToolCalled(messageBody: Any) async throws -> Any? {
print(messageBody)
guard let body = messageBody as? Dictionary<String, Any> else {
throw NSError(domain: "InvalidMessageBody", code: 400)
}
guard let name = body[toolNameKey] as? String, let arguments = body[toolArgumentKey] as? Dictionary<String, Any> else {
throw NSError(domain: "InvalidParameters", code: 400)
}
print("Call Tool. Name: \(name). Arguments. \(arguments)")
print(name, arguments)
let encodedArgs = try JSONSerialization.data(withJSONObject: arguments)
let argDict = try JSONDecoder().decode([String: Value].self, from: encodedArgs)
guard let callTool = self.callSelfMCPTool else {
throw NSError(domain: "CallToolUnavailable", code: 500)
}
let result: CallTool.Result = try await callTool(name, argDict)
return try result.jsonDictionary
}
// body: { "prompt": string }
private func handleSendFollowUpMessageCalled(messageBody: Any) async throws {
guard let body = messageBody as? Dictionary<String, Any> else {
throw NSError(domain: "InvalidMessageBody", code: 400)
}
guard let prompt = body["prompt"] as? String else {
throw NSError(domain: "InvalidPrompt", code: 400)
}
print("prompt: \(prompt)")
guard let sendUserMessage = self.sendUserMessage else {
throw NSError(domain: "SendUserMessgaeUnavailable", code: 500)
}
try await sendUserMessage(prompt)
return
}
}
Now, what is this toolOutputJson
, toolInputJson
, and toolResponseMetadataJson
, or how do those look like?
They are basically just the tool input arguments, structured output, and response metadata in its JSON form.
In the case of my get_pokemon tool, it will look something like following in Swift.
let sampleToolInput = """
{ "name": "pikachu" }
"""
let sampleToolMetadata = """
{
sprites: {
back_default: 'https://raw.githubusercontent.com/PokeAPI/sprites/master/sprites/pokemon/back/25.png',
back_female: 'https://raw.githubusercontent.com/PokeAPI/sprites/master/sprites/pokemon/back/female/25.png',
back_shiny: 'https://raw.githubusercontent.com/PokeAPI/sprites/master/sprites/pokemon/back/shiny/25.png',
back_shiny_female: 'https://raw.githubusercontent.com/PokeAPI/sprites/master/sprites/pokemon/back/shiny/female/25.png',
front_default: 'https://raw.githubusercontent.com/PokeAPI/sprites/master/sprites/pokemon/25.png',
front_female: 'https://raw.githubusercontent.com/PokeAPI/sprites/master/sprites/pokemon/female/25.png',
front_shiny: 'https://raw.githubusercontent.com/PokeAPI/sprites/master/sprites/pokemon/shiny/25.png',
front_shiny_female: 'https://raw.githubusercontent.com/PokeAPI/sprites/master/sprites/pokemon/shiny/female/25.png'
}
}
"""
let sampleToolOutput = """
{
"result": {
"name": "pikachu",
"id": 25,
"height": 4,
"weight": 60,
"types": [
{
"slot": 1,
"type": {
"name": "electric",
"url": "https://pokeapi.co/api/v2/type/13/"
}
}
]
}
}
"""
And of course, it is String
in Swift, but when we injected like above without the string quotation marks, it will automatically be a json object in javascript!
WebView
A simple view wrapping around the WebView!
Get the resource HTML and display it! That’s it!
import SwiftUI
import WebKit
import MCP
struct MCPAppWebviewParameters {
var mcpClient: MCPClient
var resourceURI: String
var toolInputJson: String
var toolOutputJson: String
var toolResponseMetadataJson: String
}
extension MCPAppWebview {
init(_ params: MCPAppWebviewParameters) {
self.parameters = params
}
}
struct MCPAppWebview: View {
@Environment(ChatManager.self) private var chatManager
private let parameters: MCPAppWebviewParameters
private let scriptManager: WebPageScriptManager = WebPageScriptManager()
@State private var webpage: WebPage?
@State private var error: String?
var body: some View {
VStack(alignment: .leading, spacing: 24) {
if let error = self.error {
ContentUnavailableView("Oops!", systemImage: "exclamationmark.octagon", description: Text(error))
} else {
if let webpage = self.webpage {
WebView(webpage)
.webViewContentBackground(.hidden)
.overlay(content: {
if webpage.isLoading {
ProgressView()
.controlSize(.large)
.frame(maxWidth: .infinity, maxHeight: .infinity)
.background(.yellow.opacity(0.1))
}
})
} else {
ProgressView()
.controlSize(.large)
.frame(maxWidth: .infinity, maxHeight: .infinity)
.background(.yellow.opacity(0.1))
}
}
}
.frame(maxWidth: .infinity, maxHeight: .infinity)
.task {
do {
try await self.initWebpage()
} catch(let error) {
print(error)
self.error = error.localizedDescription
}
}
}
private func initWebpage() async throws {
let html = try await self.chatManager.getAppResourceHTML(client: parameters.mcpClient, uri: parameters.resourceURI)
var configuration = WebPage.Configuration()
var navigationPreference = WebPage.NavigationPreferences()
navigationPreference.allowsContentJavaScript = true
navigationPreference.preferredHTTPSNavigationPolicy = .keepAsRequested
navigationPreference.preferredContentMode = .mobile
configuration.defaultNavigationPreferences = navigationPreference
// userContentController: An object for managing interactions between JavaScript code and your web view, and for filtering content in your web view.
configuration.userContentController = self.scriptManager.createUserContentController(toolInputJson: parameters.toolInputJson, toolOutputJson: parameters.toolOutputJson, toolResponseMetadataJson: parameters.toolResponseMetadataJson
)
self.scriptManager.callSelfMCPTool = { name, args in
let client = self.parameters.mcpClient
guard let tool = client.tools.first(where: {$0.name == name}) else {
throw NSError(domain: "ToolDoesNotExist", code: 400)
}
guard tool.widgetAccessible else {
throw NSError(domain: "ToolNotCallableFromWidget", code: 400)
}
return try await self.chatManager.callTool(client: self.parameters.mcpClient, toolName: name, arguments: args)
}
self.scriptManager.sendUserMessage = { prompt in
return try await self.chatManager.respond(to: prompt)
}
let page = WebPage(configuration: configuration)
self.webpage = page
page.load(html: html)
}
}
Update ChatManager & MCP Manager
Now, we want to display the MCPAppWebview
when a tool that has an embedded resource is called, even before the Foundation Models provides a final response using the tool output. Of course, you can choose to show it at some other timing, but let’s do it the way OpenAI/ChatGPT does it here!
Couple little modification on our ChatManager
and MCPManager
from our previous Swift: Power Foundation Models With MCP Servers (Tools)!
MCPManager
If you recall, when our Foundation Models decides to use a tool, our MCPManager.callTool
will be triggered and is responsible for returning a generable ToolOutputType
back.
Here is where we will want to populate the MCPAppWebviewParameters
because
- we need the original
CallTool.Result
, not those ones processed for the Foundation Models - We want to show the App before the Foundation Models’ response
enum MetadataKey: String {
case outputTemplate = "openai/outputTemplate"
case widgetAccessible = "openai/widgetAccessible"
}
// ...
var onAppAvailable: ((MCPAppWebviewParameters) -> Void)?
// openAI uses skybridge for iframe sandbox, so we are including it as a target mime type
private let appResourceMimeTypes = ["text/html+skybridge", "text/html"]
private func callTool(client: MCPClient, toolName: String, arguments: GeneratedContent) async throws -> ToolOutputType {
print(#function)
let json = arguments.jsonString
let arguments: [String: Value]? = try? self.jsonDecoder.decode([String: Value].self, from: Data(json.utf8))
let message = "[Using Tool] Name: \(toolName). Arguments: \(arguments, default: "(No args).")"
print(message)
self.onToolUse?(message)
defer {
self.onToolUse?(nil)
}
// Call a tool with arguments
let result = try await self.callTool(
client: client,
toolName: toolName,
arguments: arguments
)
var output = try ToolOutputType(contents: result.content)
if result.isError == true {
output.texts.insert("Error executing tool.", at: 0)
} else {
do {
if let parameters = try await self.createAppViewParams(client: client, toolName: toolName, inputArguments: arguments, toolResult: result) {
self.onAppAvailable?(parameters)
}
} catch(let error) {
print(error)
}
}
return output
}
private func createAppViewParams(
client: MCPClient,
toolName: String,
inputArguments: [String: Value]?,
toolResult: CallTool.Result
) async throws -> MCPAppWebviewParameters? {
guard let resourceURI = getAppResourceURI(client: client, toolName: toolName) else {
return nil
}
let inputValue = inputArguments == nil ? Value.null : Value.object(inputArguments!)
let parameter = await MCPAppWebviewParameters(
mcpClient: client,
resourceURI: resourceURI,
toolInputJson: try inputValue.jsonString,
toolOutputJson: try toolResult.structuredContentJson,
toolResponseMetadataJson: try toolResult.metadataJson
)
return parameter
}
private func getAppResourceURI(client: MCPClient, toolName: String) -> String? {
guard let tool = client.tools.first(where: {$0.name == toolName}) else {
return nil
}
guard let meta = tool._meta?.objectValue else {
return nil
}
guard let appResourceURI = meta[MetadataKey.outputTemplate.rawValue]?.stringValue else {
return nil
}
print("App Resource Available for tool: \(toolName). ResourceURI: \(appResourceURI)")
return appResourceURI
}
I have also added a helper function for retrieving App HTML specifically.
// Retrieve the HTML string for the App Resource
func getAppResource(client: MCPClient, uri: String) async throws -> String {
let contents = try await self.retrieveResource(client: client.client, uri: uri)
guard let targetContent = contents.first(where: { content in
guard let mimeType = content.mimeType, content.text != nil else {
return false
}
guard self.appResourceMimeTypes.contains(mimeType) else {
return false
}
return true
}) else {
throw NSError(domain: "ResourceNotExist", code: 400)
}
return targetContent.text!
}
ChatManager
First of all, an additional MessageType
for the app,
enum MessageType: Identifiable, Equatable {
case userPrompt(UUID, String)
case mcpApp(UUID, MCPAppWebviewParameters)
case response(UUID, ResponseType)
// ...
}
And to allow it to be append to our messages
array
init() {
self.mcpManager.onToolUse = { self.toolUseMessage = $0 }
self.mcpManager.onAppAvailable = { self.messages.append(.mcpApp(UUID(), $0)) }
// ...
}
I have also added couple helper functions just wrapping around those defined in MCPManager
!
func getAppResourceHTML(client: MCPClient, uri: String) async throws -> String {
return try await self.mcpManager.getAppResource(client: client, uri: uri)
}
func callTool(client: MCPClient, toolName: String, arguments: [String: Value]? = nil) async throws -> CallTool.Result {
return try await self.mcpManager.callTool(client: client, toolName: toolName, arguments: arguments)
}
Update ContentView
Last bit!
We have added an additional cases to our MessageType
, so of course, we will need to handle that case in our view!
case .mcpApp(_, let params):
MCPAppWebview(params)
.environment(self.chatManager)
.listRowBackground(Color.clear)
.frame(height: 360)
.frame(maxWidth: .infinity, alignment: .leading)
.background(RoundedRectangle(cornerRadius: 24).fill(.clear).stroke(.yellow.opacity(0.5), lineWidth: 2))
.clipShape(RoundedRectangle(cornerRadius: 24))
Done!
Test It Out!
Make sure that our local MCP Server is started and let’s go!
Just like our App for Chat GPT from my previous Build An App for ChatGPT! Step By Step!
Without changing a single line of MCP Server code or that App HTML!
Final Thoughts
Above are all the important bits about this article!
Below is just some of my personal complaints!
If you don’t want to listen to me giving !!! about OpenAI, you can skip it and call it today!
As we can see, the approach, the idea is so simple and there is nothing limiting what we do here (or what OpenAI did for ChatGPT) to any specific Agent/LLMs, or any Chat-based services! Web Or Mobile!
Which means!
OpenAI is being really annoying for defining those metadata keys or custom global object after themselves!
Won’t it be so much better if those are defined as part of the MCP specification?!
We, as developers who make MCP Servers, can then reuse the same server, same app, for any Chat-based services! Not just that ChatGPT!
Anyway!
Thank you for reading!
Again, feel free to grab this little demo from my GitHub!
I guess how OpenAI calls this thing an App really bothers me so I decided to write this article and have all the chat-based services out there to create their own App SDK!
Happy creating SDKs!