Inside the Pipeline: How a TikTok or YouTube URL Becomes a 7-Day Trip in Jettova

Jettova Travel Team·Travel Editors·June 3, 2026(Updated Jun 11, 2026)

Key Takeaways

ffmpeg samples 5 frames evenly across the video duration. They go to our AI as a single multi-image vision call with an explicit instruction to trust frames over captions when they disagree.
The 7-day itinerary is generated in the same call, not a follow-up, two-step pipelines lost grounding on the frame content and produced more generic days.
Schema-level constraints in the tool definition (days 1-3 must reference frame content, no title may repeat, every slot must be a real-feeling named place) keep the model honest at the our model's quality level.
Total latency is 4-10 seconds end-to-end; the multi-image vision call dominates. Rate limit is 3/hour per user (or per IP for anonymous) via Upstash sliding window.
The 6-up photo strip on the result page (cover + 5 extracted frames) is the user-facing audit trail. It's the single design decision that moves the feature from 'magical' to 'transparent and useful.'

The simplest demo of 'AI plans your trip from a TikTok' uses the cover frame and the caption. It looks great on launch. Then someone pastes a video of ziplining over a Costa Rican jungle, captioned 'endless summer vibes,' and the AI returns a beach vacation in Aruba because the algorithm-optimized thumbnail showed a beach and the caption named the wrong vibe. Or the reverse: a video walking through the salt flats of Bolivia gets misread as 'desert beach' because the cover happened to be the brightest frame in the video. The first failure mode of cover-frame-plus-caption is structural: TikTok's recommendation system optimizes the thumbnail for click-through, not destination accuracy, and captions describe feelings rather than activities.

This is the technical writeup of how Jettova's video-to-trip pipeline actually works — stage by stage, with the trade-offs we considered and the ones we didn't.

Stage zero: URL detection. The route accepts both TikTok and YouTube URLs because the user-facing affordance is 'paste a video' rather than 'pick a platform.' Two regexes test the URL on the way in; rejections happen before any network call so we don't spend money on the wrong shape of input. Instagram Reels are technically possible but operationally painful — Instagram aggressively blocks scrapers, and the reliable paths are paid services we'd rather wait to wire until demand justifies the cost.

Stage one: metadata fetch. Platform-specific. For TikTok we use TikWM, a public unauthenticated API that returns the no-watermark mp4 URL, cover image, caption, author handle, and duration. It's free, which is great for a prototype and bad for a load-bearing dependency. We wrapped the call in a three-try retry with 500ms / 1s / 2s exponential backoff so transient blips resolve on the second attempt. The failure mode we'd hate but don't have yet is sustained TikWM rate-limiting against Vercel's IPs; the fix queued behind that day is swapping to RapidAPI's TikTok endpoint, which is the same shape API for $0.001 per call.

For YouTube we use @distube/ytdl-core, an actively-maintained fork of ytdl-core that handles Shorts, regular videos, and the youtu.be short-URL form uniformly. It returns the same shape, a play URL, a thumbnail, a title, a channel handle, a duration, which makes the rest of the pipeline platform-agnostic. We use ytdl.chooseFormat with quality 'lowest' and filter 'videoandaudio' to pull the smallest combined stream available; we only sample 5 frames from it so HD bandwidth is wasted spend.

Stage two: duration cap. Long videos balloon the mp4 download plus ffmpeg work past our 60-second function budget on the server, and trip-inspiration content in practice fits inside a Short, a Reel, or a clip from a vlog rather than a 45-minute documentary. We hard-cap at 10 minutes and return a friendly error message that names the actual length when a longer video comes in. The cap doesn't really limit which videos people import (almost everything users paste is under 90 seconds) but it bounds the worst-case server cost.

Stage three: frame extraction. The mp4 is downloaded to /tmp on the serverless function (Vercel Functions explicitly support writes to /tmp despite the static-analysis tooling that occasionally flags it). ffmpeg runs with a single command: an fps filter set to count/duration (which evenly distributes the sample across the runtime), a scale filter to 512px wide (vision cost is proportional to image area, so wider is wasted), and -frames:v capped at the target count. For a 30-second video we get five frames at t=0, 6, 12, 18, 24. For a 5-second video, we get five frames at t=0, 1, 2, 3, 4. Edge case: very short videos with fps > 1 produce fewer frames than requested — ffmpeg handles this gracefully and we send whatever came back.

Why 5 frames specifically. Three is too few, a video of someone moving through a market and then sitting at a sushi counter needs at least one frame per setting. Seven is wasteful — vision tokens are billed per image area and the marginal information from the 6th and 7th frame is small. Five turns out to capture distinct scenes in nearly every travel video we tested while keeping the multi-image call cost around $0.005 with our AI.

Stage four: vision call. All five frames go to our AI as base64-encoded image blocks in a single call to our AI, alongside a text block with the caption, the author handle, and an explicit instruction set: look at each frame individually before answering; build the activities list from the union of what you see; when the caption and frames disagree, trust the frames; pick a real-feeling destination as 'City, Country.' The output is constrained by a tool_use schema that requires destination, vibe (one-line sentence), activities (2-5 strings), vibes from a canonical 12-vibe list, duration_days, confidence (high/medium/low), reasoning (one sentence), and a 7-day itinerary in the same shape as the normal Jettova trip itinerary.

The itinerary is generated in the same call rather than as a follow-up. We tried it as a follow-up (destination first, then itinerary in a second call) and the failure mode was a noticeable disconnect. The second call couldn't access the frame context as cheaply, and the itinerary stopped referencing what was actually in the video. Generating both in one pass keeps the model grounded on the same input and costs about the same as a separate call because the input frames dominate the token cost.

Schema-level rules in the itinerary tool definition: days 1-3 must reference specific things visible in the frames; days 4-7 round out a believable trip in the destination; every slot title must be a real-feeling named place (not 'a local café'); no title can repeat across days. These constraints sound aggressive but the model honors them at the our model's quality level — under-constrained itineraries from the same model returned 'Day 3 lunch: local restaurant' which is the kind of generic that signals to users we didn't actually look at the video.

Stage five: response payload. The route echoes the frames back to the client as data: URLs in the response so the page can render them as a photo strip. This adds about 250 KB to the response — acceptable for a one-shot page render, not a hot path. The booking links (Expedia flights/cars, Booking hotels, Airalo eSIM, Viator activities) are built server-side from the destination plus a default date window (today+30 days, lasting durationDays). Outbound search URLs only; we don't fake live inventory because the affiliate behavior at click time is what matters, not in-line pricing.

Rate limiting. Three imports per hour for non-admins, enforced via Upstash sliding-window rate-limit. Identifier is user_id from the bearer token if signed in, IP if anonymous, so a single user across devices shares the same quota and an anonymous abuser can't multiply by switching browsers from the same connection. Each import costs about $0.005 in vision tokens plus a few cents of bandwidth, so 3/hour caps the unauthorized-abuse case at a few dollars a day before we'd notice, while staying tight enough not to throttle legitimate use ('I have five candidate videos for next month's trip').

Cold-start latency. Most of the wall time is the multi-image vision call (3-6 seconds for our AI with 5 images). Frame extraction is 1-2 seconds for a sub-minute video on the Vercel function. Metadata fetch is sub-second when the platform's downloader is healthy. Total typical latency is 4-10 seconds end-to-end; the rendering on the client adds maybe another second for the photo-strip layout. Most of the variability lives in the vision call; longer videos with more durational variety also tend to produce longer itinerary output, which slightly extends the call.

What we don't do that we could. We don't transcribe the audio, most TikTok travel content has either no narration or generic music. Transcribing would add another transcription model call and the benefit on Travel-genre content is small enough that we haven't pulled the trigger. We don't yet expose 'edit the itinerary' on the result page — for now the result is the read-only output, and editing routes back through the normal Plan flow with the destination prefilled. The two surfaces share the same itinerary schema, so unification is a small UI project and a planned upgrade.

Why the keyframes approach matters beyond accuracy. A user pasting a video has a single signal-strength expectation: the result page should prove the model actually watched the video, not just read the caption. The photo strip in the result (the cover plus the five extracted keyframes, in a 6-up grid) is the audit trail that makes the rest of the page credible. Without it, every wrong destination feels like the AI hallucinated. With it, you can see exactly what the model saw, and if the destination is wrong you can usually tell why from the strip. That moves the entire feature from 'magical and untrustworthy' to 'transparent and useful.'

If you want to try it: paste any public TikTok URL or any YouTube URL under 10 minutes into /plan-from-tiktok. The five-section result page (hero, photo strip, vibes, booking row, 7-day itinerary) renders in 4-10 seconds.

Frequently Asked Questions

Why not transcribe the audio?

Most TikTok travel content has either no narration or generic music. Transcribing would add another transcription model call and the marginal benefit on travel-genre content is small. We may add it for vlog-genre YouTube videos where narration carries real information, but it's not on the immediate roadmap.

Why a 10-minute duration cap?

Longer videos balloon the mp4 download and ffmpeg work past our 60-second function budget, and 'trip inspiration' in practice fits inside a Short, a Reel, or a clip from a vlog rather than a 45-minute travel documentary. The cap doesn't really limit which videos people paste (almost everything users try is under 90 seconds) but it bounds the worst-case server cost.

What happens if TikWM is down?

TikTok imports fail with a clear error message. The retry layer covers transients (a single 502 followed by a 200), but sustained TikWM downtime breaks every TikTok import until they recover. The escape valve is RapidAPI's TikTok endpoint, which is a paid drop-in replacement (~$0.001/call). We haven't switched yet because TikWM's uptime has been good enough, but the failover is plumbed.

Why is the rate limit 3 per hour?

Each import costs about $0.005 in vision tokens plus a few cents of bandwidth. Three per hour caps unauthorized abuse at a few dollars a day before we'd notice, while staying tight enough not to throttle legitimate use. The 'I have five candidate videos for next month's trip' workflow fits inside an hour with one carry-over to the next; the 'someone is scripting our endpoint' workflow hits the cap immediately.

Can the result page itinerary be edited?

Not on the result page itself. That's a planned upgrade. For now, if you want a fully editable itinerary, the destination from the result page can feed the normal Plan flow, which gives you the day-by-day storyboard with swap-this-activity and Viator integration. The two surfaces share the same itinerary schema, so they'll be unified in a future release.