Speaker 1: This is an interview about Simon Says Assemble. The automatic transcription tool lets you now simply cut video by editing text.
Speaker 2: This is a CineD Gear News video.
Speaker 1: Welcome everybody to the CineD Gear News video. Today I am on a Zoom call with Shamir from Simon Says. How are you? Hi, I'm doing great. Thanks for having me. I'm catching you in the US, in California to be precise, and you guys just released something new. But before we talk about the new stuff, I think maybe you can just give a little introduction about yourself and Simon Says, the initial product, I would say.
Speaker 2: Yeah, absolutely. Well, Simon Says initially started as just doing transcription, transcription just in English, because I had a project that needed transcription work and I couldn't believe that I needed to do this either manually or with human transcriptionists and it was going to take a lot of time and money. And my original career was in documentary filmmaking before I moved into being a video tech entrepreneur. So I had this project and I thought, you know, we have Siri on our phone. Why can't we use speech recognition for this? And so that was the origin of Simon Says four years ago. Since then, we've added 100 languages, both in transcription and translation. So oftentimes video editors or post-production professionals are using us at the beginning of an edit when they have their rushes, their dailies, their interviews that they need transcribed or at the end when they need a subtitling captioning of their video edit. And so that was where Simon Says originated. And over this past week, what we've launched is Simon Says Assemble.
Speaker 1: Yeah, I mean, your technology is amazing. I mean, I think we've seen some similar ideas over the last few years, but I think, you know, like maybe the last 10, I remember seeing something like this 10, 15 years ago, but I never actually saw a working product from that because I myself am also a documentary filmmaker. I mean, when I remember my first big documentary, my master thesis film was like a one hour documentary, which was completely bootstrapped. And I think I had like 50 hours of interviews that I literally transcribed over a summer. And then, you know, like something like Simon Says would have been an enormous time saver. So why do you think there hasn't been anything like your product before? Because obviously, I mean, if it works, it adds so much usefulness to the whole process of especially documentary filmmaking.
Speaker 2: This is the bane of existence for anyone working in docs, especially indie docs. And you'd be surprised how just beyond just the indie community, how many other of those working in production have been so receptive and are now using Simon Says in their workflow. We have like huge reality TV programs on it. Even fiction, you know, they're using Simon Says for fiction content with Avid. And so it has grown a lot. The biggest challenge has been the technology. So AI's accuracy has been the issue up until over the last couple of years. So the writing has been on the wall that speech recognition was going to be a thing, and a thing specifically in transcription. We've seen speech recognition apply to your phone, as Siri or on the Android phone. So, you know, there's been a general awareness that speech recognition as a technology was going to have a number of use cases. But the transcription context, the accuracy hasn't been usable until, you know, in our view, like four or five years ago. And so that's been the biggest obstacle up until now.
Speaker 1: Well, actually, the funny thing is, you know, like with YouTube videos, they have this auto transcription for the subtitles. But I mean, Google is your neighbor in Mountain View, I guess. And they, for some reason, you know, like I just saw on our own YouTube videos, that transcription very often isn't really good.
Speaker 2: Why is that? Yeah, there's a number of reasons, actually. So, you know, over the last few years, the transcription accuracy, the speech recognition accuracy has become incredible. Like we have users telling us that, you know, this is perfect or close to perfection. And, you know, I no longer, you know, think that the accuracy is the challenge for us now. It's like, how do you now use this within the right workflows to help people, you know, move their story along faster? In terms of where there are limitations on accuracy, and in the example you gave, you know, if there is lots of music, if there is, you know, sound effects, those types of additional sounds reduce the accuracy of the speech engine. The speech engine has been trained on voice. And so for us, you know, this is the perfect environment, is video production professionals, where they have high quality audio, low, no background noise, one person speaking at a time, oftentimes very clearly. And so for us, the accuracy is incredible. And for our users, it's incredible to get that accuracy. And only when transcription is accurate, can it be usable. So that is part of the challenge until a couple of years ago is that, sure, you were getting, you know, some level of accuracy, but was it good enough to replace just like scrubbing through the audio, finding those sound bites? Was it good enough to, you know, does it, was it basically, you had to go through it again, so you might as well do it manually. And so that's been the biggest change is the accuracy hit a tipping point, where now this was significantly better faster, more cost effective than what you could do yourself. And so, you know, for this environment in video productions, you know, they love the accuracy of Simon Says and the environment in which they're recording is also very well suited to that.
Speaker 1: Interesting. Yeah, absolutely. So, I mean, your initial product was introduced four years ago, but now you just announced Simon Says Assemble. Can you just run us through what that is?
Speaker 2: Yeah, absolutely. So if you're at the beginning of post-production and you need to transcribe your interviews, you're transcribing your dailies and your recordings, no one actually really just wanted a transcript. The transcript was a stepping stone, a stepping stone to be able to find the key sound bites, to be able to order those sound bites in like a paper edit or an assembly edit or a rough cut. And so now that we have accurate AI and our AI is giving a reference point to the video per word, now you can just highlight those key sound bites, a sentence here, a sentence there, a sentence in another transcript of another interviewee. And so you can find these like key moments and then be able to drag and drop them in the order in which you like, creating the spine or the foundation of your story. And so it's really that transcription has been taken to the next level is that, yes, now you have this accurate transcript, but now you can build your story from it.
Speaker 1: So it's pretty much what I call a paper edit, right? I mean, like there are people who print out their transcriptions and they just cut out the paragraphs that they need and they just play puzzle with them. And you've kind of digitized it and automated it, amazing.
Speaker 2: I mean, most editors know that process of looking at the Word doc or looking at the paper, looking at the file name, the in point, the out point. It is so tedious. It is so frustrating. And with Simon Says Assemble, you no longer have to do that, right? You're just highlighting those sound bites. You're ordering them as you like. And then you hit, once everyone's aligned on the story, now you hit export and you get an XML straight into your video editing application. So Adobe Premiere Pro, Final Cut Pro, Avid or Resolve. And the timeline now seamlessly recreates and relinks to that original media at the correct in and out point. So you're no longer having to flip between NLE, Word doc, back and forth, in, out points. Yeah, that tedious, frustrating. All of us know we spent a lot of time and any of those of us in docs know that process very well.
Speaker 1: Well, just listening to you, I mean, I truly get excited trying it out because I just, you know, like you can't tell you how many hours I've wasted on this process. And that's the main reason why I don't like to do assembly edits because the initial assembly is exactly what that is, you know, like just creating the spine. I enjoy editing as soon as, you know, the basics are on the timeline, but this initial thing is horrible for me.
Speaker 2: And it sounds even the word, you know, the word assembly edit is sounds so mechanical, you know, maybe we need to rebrand this now that we have Simon Says Assemble, but it's so mechanical and like it doesn't need to be, you know, and for us, it was really that, you know, that kind of core problem that we were trying to tackle here, but it's also like with a number of layers on top, right? Is that, you know, with the work that I've done and doing, say, a web video or a conference video or a corporate video, you know, where you have teammates and clients, you, every one of us knows that process where you first make that first cut, export, send it out for feedback, get some notes, actually quote number two, that's really sensitive. Can you use the CEO's other quote and quote number five as well, maybe chop it earlier. What if you can use the customer's other quote for this product launch video? And so everyone, so you're going through this like endless loop of export, get feedback, implement those notes, export. And now that whole like, think the concept of that feedback loop is moot. You don't have that anymore because this is like a Google Doc. Those edits are live and real time and immediate. So, for example, you can create that spine of your story and I could share with you and you could say, hey, you know what, instead of quote number two, the CEO's quote number two being sensitive, you could just find an alternative quote, grab it from the transcript and move it over to the timeline on the web. And now you've done it yourself and you already have that edit there live and everyone can comment on it, collaborate on it. And you know that increasingly we are collaborating in a distributed like manner. And so this is really, you know, that's the beauty of the web and the permissionless web is that Simon Says Assemble lives in that. You can invite anyone to it, be able to give them view access, comment access or edit access. And once you've locked on the story, again, go complete it in your wonderful NLE. We're not trying to replace the video editing application. What we're trying to do is supplement to really align on that kind of core, frustrating part of the process, which is locking story, doing that assembly edit.
Speaker 1: Amazing. Actually, we should try the tool with the interview we're doing right now just to see how well it performs. One question that came to my mind, you said you're supporting 100 languages. Do you think the accuracy is the same for all those languages? Or, you know, like, will English be better than some other languages?
Speaker 2: Absolutely. So the nature of AI is it is a march to getting better and better and better. The more training material it has, the more it learns, the higher the accuracy. Up to until now, the dominant languages where there has been the largest body of training material has been in Western European languages, including English, French, Spanish, German, Italian, also Asian languages, Chinese, Japanese. And so those, I would say, are the ones with the highest level of accuracy. But I also think about, you know, when I used to do shoots in Afghanistan and I was using a fixer, interviewing someone in their local language, and I didn't understand other than what the fixer was translating to me in a kind of summary. And so I would have loved to be able to do this where you could first transcribe in Dari, then to be able to translate in English. And even if it wasn't as perfect as English is in the first transcription step, you know, it would have been super, super helpful because currently you have to, you know, find someone to translate it or transcribe it, then to translate it into your language. And that multi-day, multi-week process, which is also very expensive, has now been brought. But yes, you know, accuracy in Dari, in Urdu, in the other global languages continues to increase. So that's where this is all headed, that it will ultimately be very, very close to perfection one day.
Speaker 1: Great. Do you need to tell the system what language it's watching or will it recognize even if there are mixed languages?
Speaker 2: We one day want to have auto-detect of languages and be able to do, you know, have a video recording with, you know, say five or six interviewees speaking their own different language and be able to transcribe all of them. But yes, right now, today you have to set the language, a singular language and tell the system which one it is.
Speaker 1: Great. And so how does it work? So I have my video files. You said you upload them to the web and then it's automatically transcribed. So what kind of, do we need to convert the footage into a special format or does it understand different formats?
Speaker 2: Yeah. So you can actually ingest the footage from a number of different places. We have a Final Cut Pro 10 extension. We have an Avid Premiere Pro extension. We have a DaVinci Resolve for Mac extension. We have a Simon Says Mac application. And so you don't just need to use the web. You can live wherever your existing workflow exists, be able to ingest that footage from natively within your NLE and it creates a proxy and uploads it to Simon Says. From there, it transcribes it. You set the language, it transcribes it. You can also transcribe it locally again in your NLE. And now you have the transcript. Once you have that transcribed media, you then go into assembly mode and you bring in that footage. So think about a video edit project. You want to bring in the relevant media. You don't want all your video files from your local computer. You just want the specific files to that project. So that's the same as Simon Says Assemble. You have the list. You have a list of all transcribed media that you did on Simon Says. But you just bring in the ones you want. And then now they exist in your bin. And then you load them up. You see the transcript that you did for this product launch with your customer. And you want some key customer quotes. You just scan the transcript. You find those soundbites and then highlight them. And they go onto the timeline side where you order them.
Speaker 1: Let's talk about pricing and availability. I mean, I think you guys are not selling. You can't just buy Simon Says Assemble like that. It's basically a per minute cost, right?
Speaker 2: Simon Says Assemble is actually free to use. Anyone can use Simon Says Assemble. And so what we're excited by is like, you know, you can just jump into a project. You can invite other people to a project and they can start editing. The cost of Simon Says is that Assemble is a free tool. That you have already paid for the transcription part and the transcription, you can use pay-as-you-go. There is no subscription fee. And you just pay based on the duration of your recordings of the footage or the audio and video files that you upload the duration of that media. And so we have a number of subscription packages which include hours and a reduced rate to transcribe. But Simon Says Assemble is free to use for those who are on the pro tier. There's no restriction on the duration of the timeline edit.
Speaker 1: Okay, great. So it's an addition to your existing product
Speaker 2: in a way. Exactly, exactly. So we have tons of customers who are already transcribing, transcribing their media, transcribing their interviews. And it's really born out of that, that, you know, we were asking them, like, how does your workflow go? You know, what do you do from here? You know, they're already highlighting. They could already do that in Simon Says. They're already inviting their teams to comment on key parts of the interview. And we said, hey, you know, we have these reference points per word. Let's build Simon Says Assemble from here. But Simon Says Assemble is a free add-on to the Simon Says product.
Speaker 1: Okay, cool. But the basic product has a minute price, right? Because just for people who haven't worked with Simon Says before, and I think a lot of our audience will hear about it for the first time. Maybe you can just explain, what if I, you know, like, what if I shot a documentary with, I don't know, two or three hours of interviews, and then I want to use Simon Says? Or what is that going to cost if I want everything transcribed?
Speaker 2: Yeah, absolutely. So Simon Says is free to trial. Every new user gets 15 minutes free. If you had a two-hour documentary, you can jump on the starter plan. It's $20, includes two hours. On pay-as-you-go, it's $15 per hour, rounded to the closest minute. So it's pro rata. And so the subscription plans really give you included hours and a reduction in transcription costs. So if the documentary, you actually shot some additional footage and added 10 more hours, well, on the starter plan, the additional transcription cost is 50% less than the pay-as-you-go per minute rate.
Speaker 1: Okay, great. Well, I'm really excited to try it out and see how it works. And if it works as well, as you say, I'm really looking forward to it. Thank you for your time, Shamir. And yeah, have a great day in California.
Speaker 2: Thank you for having me. It's been a pleasure.
Speaker 1: Of course, of course, anytime. And if there's a new version, please let us know and we'll check it out as well.
Speaker 2: Yeah, absolutely. And the kind of request goes to you. As you're using this, Simon Says has been, the way we've got, the way Simon Says has got here over the past four years has been the wonderful, candid feedback from our users. I had some ideas as to what I wanted to see in the product, but what we've consistently added, the frame accuracy, the timecode-based AI, that has really been the suggestions of users. So as you're using this, I'd love to hear how you want, where you want Assemble to go, what features and suggestions you have for Assemble. So don't hesitate to reach out to me.
Speaker 1: Absolutely. And I think from the feedback you already got, the timecode-based, for example, is absolutely essential. So if that's there, you're ready. It's critical. All right. Thanks also to our audience for watching and please stay tuned to CineD for more news, like more gear news videos like this one and very interesting interviews with people from our industry, manufacturers and filmmakers and directors of photography alike. And please don't forget to subscribe to our YouTube channel. Thanks.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now