Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Really interesting direction. The node-based canvas feels like a more scalable abstraction for video automation than the usual chat-only interface. I’m curious how you’re handling long-form content where temporal context matters (e.g., emotional shifts, pacing, narrative cues).

Multimodal models are good at frame-level recognition, but editing requires understanding relationships between scenes, have you found any methods that work reliably there?



Side note, just for context, since there seem to be primarily video hobbyists responding to the OP:

Node based workflows are typical in NLE software. See Fusion & Color panels in Davinci Resole, Fusion (color grading), etc. Industry folks will take to this node based canvas with ease.

Great question @danishSuri1994


hey, thanks for the comment!

we've actually found that multimodal models are surprisingly good at maintaining temporal context as well

that being said, there's also a bunch of additional processing using more traditional CV / audio analysis we do to extract this information out as well (both frame-level and temporal) in your video understanding

for example, with the mean-motion analysis — you can see how subjects move over a period of time, which can help determine where important things are happening in the video, which ultimately can lead to better placements of edits.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: