Ask a question
Welcome back to part two of our miniseries looking at how to use AssemblyAI to generate a transcript based on the audio file. And the key bit that we're doing here is separating out the different speakers.
Recap of AssemblyAI API in part 1
So in this part, I'm going to be showing how to take the JSON response that we generated in part one, which separates out the conversation or the speech into different utterances. And we're going to be running that through a backend workflow to iterate through all of the utterances and saving them to our database so that in the end, we get a repeating group of our transcript broken up by speaker. So I'm going to click Save. This is one of the transcripts that I've generated, and I'm going to go into Backend Workflows.
Use Backend Workflows
I'm going to create a workflow and call it Save Single Utterance. It doesn't need to be public. And then the key parts of the utterance that I want to save are going to be start, end, I believe these are time stamps, text and speaker. So I'm going to say start, end, text and speaker. And so start and end are going to be numbers.
And then I'm going to create a data type for handling all this. I'm going to call it utterance. And I also need a data type for grouping them together. So I'm going to call that transcript. Okay, and then utterance.
Iterating through AssemblyAI Utterance
So utterance is going to have those fields. We're going to have start, end. Is it end or finish? End. Speaker and text. So that when this workflow runs, this back end workflow, I'm going to create a new thing, create an utterance. Let's add in all the fields, connect it up to the data that goes into the workflow. And I need to pass one more bit of data in, which is going to be my transcript, because I need a way of grouping it all together, assuming my app is going to be generating more than one transcript at a time. So transcript. And then I'm going to group them by adding in the transcript field. Okay, let's build some really simple UI just to get us going with this. So speaker labels. And to keep this video short, I'm not going to go through the upload process and the initial call to AssemblyAI that's covered in previous videos.
I'm simply going to put in an input, and this is going to be my transcript ID. I'm going to add in a button. Now, I'm designing using fixed layouts. I always say that's a big no, no. Columns and rows, I'm just designing it quickly, though. That's why I'm taking shortcuts here. And then we'll have a repeating group. And in fact, I'll put this into a group which would contain a transcript. The repeating group is going to be of utterances. Do a search for utterance where transcript equals parent groups transcript. That's my way of displaying only the relevant utterances. Let's get rid of number of rows. And then in utterance, I'm going to say speaker. Let's change this into a row We'll make this bold. Copy and paste it. We'll make this just set 80. And then this is not bold and this is going to be our text. Taking up the remaining of the space. One of the reasons for saving the additional data, start and end, is I think that those are going to be useful for ordering. So we'll order just by the start stamp, and then I'll get rid of the min height and I'll add in just so that's really clear for us.
So let's call this generate Speaker generate Text by Speaker. So I'm going to add in the workflow. And what this is going to do is do the retrieve action for AssemblyAI. So how do I put that in? Or what do I call it? Get process transcript. And the ID is going to be from my input. And for the purpose of this demo, I'm just going to take the same ID and make it the initial content for the demo. Input. And then this is where the backend workflow comes into play. So I'm going to schedule API workflow on list. And this is one way in Bubble, in fact, I think it's the most robust way in Bubble to iterate through an uncertain amount of data. We don't know how many utterances it's going to return. So we can't just say, do this utterance one, do this utterance two, and so on. We need to be able to iterate through X number of utterances. So the type of thing we're going to go through is text. Or is it? It's not. It is going to be utterance. And then the list to go through is the result of the call.
I might just need to clear this up. Save single utterance. Okay, I just had to take a moment there to work out what wasn't working. And I found that in my initialised call to fetch the transcript, my utterance list was empty for some reason. So by making sure that that is set to get processed transcript utterance, it means that I can successfully fill in the type of thing as the get process transcript utterance, and I'm getting the data, the list from step one. So this means that the … It's all blue, it's not red, it's accepted data format. I'm going to say run it right away, and then I can begin to fill in the data for each utterance that we're saving.
So it's this. When we get this, it's referring to the single one, and it's basically going to loop through them. And we could say text, speaker, and then transcript. I need to create a transcript, data create new thing. And I'm going to display this transcript into the group transcript. That's just so that I can get the right data into my repeating group. View Result of step two. In fact, I'm going to move this up the top. Result of step one now, and then transcript is the result of step one. Okay, I think that I've now got every step in place. Let's test it.
So if this works, this process is going to retrieve an already processed transcript from AssemblyAI, and then it's going to use the backend workflow to loop through all of the utterances and save them to our database. Let's give it a go. I think I forgot to update a label, speaker, text. Okay, and because I was displaying data in the group, I just got to run it again. Okay, and there you have it.
So I've taken a lot of short cuts there, but I'm trying to demonstrate the technical side of how you would use a backend workflow to loop through all of the utterances to get back. And now we have something that looks closer to a script because we've got each utterance split by the speaker who says it.