Efficiently Managing Projects with Multiple Data Files in MRDCL
Learn how to handle projects with multiple data files and versions in MRDCL. Discover techniques for efficient data file management and running multiple scripts.
File
MRDCL Understanding Concepts Multiple data files and multiple versions of a project
Added on 09/30/2024
Speakers
add Add new speaker

Speaker 1: Hi, and welcome to this video in the MRDCL Understanding Concepts series. In this video we're going to look at how you handle projects with multiple data files, or where perhaps you've got a project that has several versions or several runs needed. You'll find the files for this particular video in the folder PRJ102. So let's start looking first of all at how you can handle different data files, and we're going to start by looking at some code here. This is a fairly straightforward project with just some extra things in it to handle multiple runs, and we'll work through all of these in series. I'm going to start with handling different data files and different ways that you can specify them. So the most common way is the first one here. You can see I've got these all commented out at the moment. If you have no mention of data files, so all of them at the moment as you can see are commented out, all these options here. So if you've got nothing in here and you go to run, what happens is that when it gets to this point, it will ask you what the data file is, and you can choose a data file. So it becomes interactive, and then it asks you whether you want another data file or not. If you say yes, you can pick another one. If you say no, it will go ahead and run. Sure enough, it's found my six records. I've actually got three records in four files here, so it's found six of the records because I picked file one and file two. So let's now go into this control stage and just change some things. So if I just put c equals file one dot ask, as we have in pretty much all the other videos, that's just going to run for me with three records because there's three records in each of the four files, and it's produced my table on three records there and only opened up file one to work. Let's comment that out. Now the next option you can have is you can say c equals less than greater than comma, and that's the same as actually having nothing, but what it will do for you, it will prompt you for a data file just like it did in the previous example. And if I say yes and pick file two, yes, pick file three, I should now get nine records because, as I say, there's three in each file, so it's got nine records on the file. Let's comment that out. Now in this case, I've just listed all the files that I want, and I've separated them by a semicolon. It's worth saying here that if there's no path, it will assume that the files are in the same directory in which the run file is. But there's nothing to stop you having something like this is perfectly fine, but you would need to repeat it for each of the files. So if file one is in that path and file two is as well, you would need to repeat it for each of the files so it knows where to look for it. You can use relative files as well. So if you've got a folder perhaps that's one level up, you can say dot dot backslash data. So that would go one level up, and then it would look in a folder called data, and that would be absolutely fine as well. So you can give it paths and relative paths. In my example here, it's just going to read file one, file two, file three, and file four. You'll note they're separated by semicolons, and there's not a semicolon on the end. It's a comma on the end. So let's run that now, and you can see it's passed straight through with 12 records in there. And that also works absolutely fine. So that's another way of handling multiple data files. Now the other way of working is to have a file that has a list of all the files that you want. So I've actually here said that my data is in a file, or my list of data files is in a file called list.txt. This can be any name. It hasn't got to be called list. It's best to put in a txt file. It'll assume it's a text file even if you give it a file extension. But I tend to always give it a txt file name. If we open up list.txt, it's a very plain file. It's a file that literally just lists them one per line. Now you must list them one per line in this particular case. No commas on the end. It can be as many lines as you like. Again, if you've got piles or relative piles, that's absolutely fine. So you can put a path in here, and it would work absolutely fine. So if you stored all your data, perhaps on a drive called J or something like that, you could just do that, and it would work absolutely fine finding all those data files. I'm going to quit that, and I'm just going to run again now, just save that I've got that list. And again, it's gone through and found 12 records because it's found those four files in my list. Now there's one other thing that you can do here that's quite useful, and this is taking it one step further. Here I've got my list of files in a text file called listsettings.txt. Now let's take a look at that and see what that does for me. It's different. So we've got a file called listsettings.txt. Now you'll see this file's a little bit different. It's got a heading at the top here, and they're separated by commas. The first one should always have the text data file followed by a comma. But what I've got in the next two fields are some settings. So I'm setting a variable called year and a variable called country. And so what this means is, and what I'm doing this for, is that file1.asc represents the 2018 data for country1, file2 is the 2017 data for country1, file3 is the 2018 data for country2, and file4 is the 2017 data for country2. So it's a way of saying that all the data in this file actually has a value associated with it. And that sort of saves me, for example, perhaps putting in a whole long list of serial numbers and saying, if it's one of the serial numbers in file1, make the year 2018 and make the country1, for example, which would be a very inefficient way of working. So this just tells MRDCL that I want to set a variable called year and country where the value is in the second and third fields of each line. And again, these can have paths, the same rules apply, and you can have as many files as you want to in this file, and it will work. It certainly works very well for a multi-year project, a tracking study where you're getting data every quarter, every month, whatever it is. And you could set more than two fields here. I've just set year and country. I could have set 10 fields if I wanted to. There's no limit to the number of fields that you can set. Let's look at how you actually pick that up in MRDCL now. So let's go to the script. And if I go further down the script, the way you do this, you remember those fields were called country and year. I would need to define an integer, and it must be an integer, which is just dummy defined somewhere in my script. So I'm dummy defining a variable name that matches that field in the txt file there, remembering that the first field in that txt field must be called data file. That is fixed. And then I'm defining the year variable here. Again, the dummy defined, it must be an integer. And having done that, I can define a variable from that field that it's picking up from the data file, effectively, or being assigned to the data file. So I now have a variable called vcountry that picks up the variable or the integer country that we're getting from the data file setting, and picking up 1 and 2, where 1 represents the USA and 2 represents China. And same with the year. I'm defining a variable called vyear, where the value year 2017 is the 2017 data and 2018 is the 2018 data. So that's some tricks that you can do with data files. So that's quite a useful trick. I'm going to leave that in there, because that looks good to me as a way of working. I'm now going to look or move on to handling multiple runs that you want for a project. Now, some of the worst things I've seen done with an MRDCL script is when I find people copying them. So let's say you've got perhaps 10 countries in a survey, and you want 10 similar scripts. But inevitably, the questionnaires are just slightly different for each country. Now it's very tempting to copy and paste each one STP file, so you have a run for each country. But that really does add to some huge inefficiencies, because it means in the future that any changes that you make or any errors that you've got, you've probably got to correct them 10 times. So really, the target is never to copy an STP file. If you find yourself copying an STP file, you're probably not making the most efficient use of MRDCL. And I've seen it where people perhaps have an unweighted run and a weighted run. So they just copy their STP file, put the weighting in the weighted run, and then run these two runs separately. Again, that's not a good idea. At the very least, and I still don't advise this as a way of working, at the very least, you should probably copy all the common scripts to an insert file. So one thing that you could do, so let's say, let's just pick this bit of code here. Let's say this was the common script between an unweighted and a weighted run, or a multi-country run. We could just take this code out, put in an insert file, I'm going to call it common.stp. The .stp is not necessary. And then open another file called common.stp, and hold that code in there. So what that's effectively doing, this common, imagine we've got a weighted and unweighted run, or a 10-country run, or something like that. All the code that's the same then is in a file called common.stp. And if anything were to change to any of the common code, I only need to change it once because it will get picked up by each of the 10 runs. However, having said that, I don't think that's the best way to operate. I think there's better ways to work. And there's two techniques that you can use. Now, you may have noticed there's a line of code in here which I've set and forced to the country. So in this example, we're going to work with two countries, USA and China. But it would apply even more so if there were a list of 10 countries rather than perhaps just two countries where the benefits would be even greater. And in the examples I've been running so far, I've actually manually set the country equal to one. And that was just to make this work. Otherwise, it would have crashed and not work. And the reason I'm doing that is for two reasons, and some of this you'll need to use your imagination. The example in here is a very simple one. In the USA, I've got a region variable that's got three codes and texts of West, Central, and East. In China, though, I want a completely different set of codes. I've got five codes. I want Shanghai, Beijing, North Mainland, South Mainland, and Hong Kong. And I want five codes. Now, if I had both of those definitions of region in the same run, it would not compile because it would say I'm redefining a variable with five codes when it has three originally. So I have to have a way of missing this code or only processing this bit of code when I'm handling China and a different piece of code is picked up when I'm handling the USA. Now, as I say, I've forced this to country one so far. So everything is going to work as though it's the USA because you can see down here I've set USA to one and China to two. So let's just see that in action at the moment then and see what we're getting. When I run this, you can see I'm getting 12 records. I get all 12 records in here. I get a title C in USA. We'll look at how that's achieved at the moment. But you can see the region that's getting picked up is West, Central, and East rather than Shanghai and Beijing. So how did I achieve that? Let's look at the script a little bit more detail now. And it's set to country equals one automatically by fixing it here. And what it's saying here is skip 99 on country not equal to one. So when we're not processing country one, it's actually going to skip these three lines of code and just jump to 99 because that's what the preprocessor does. It preprocesses before it does any compilation work. So if the country is not set to one, if the index country is not set to one, it's going to just jump that bit of code and then see whether it is country two. So it can pick up the region. If there were 10 countries, you might have 10 blocks of code like this. As country is one, it's going to process this because it's not going to skip it because it is equal to one. But it's going to skip this piece of code because when country's got a value of one, it's not equal to two. It's just going to jump over this. Now, let's see what happened when we change this to two and run our tables. And sure enough, it has still got 12 records in there. I'm not filtering the data on country. It's just letting all 12 records through. You can see this heading here changes to Shanghai, Beijing, North Mainland, South Mainland, and Hong Kong. So it looks as though it's picking up the right text, although actually, as I appreciate, it's actually using the same data in this example. And the heading, I think, there also said China. Now, how can we make this a little bit more efficient? You could just get into the habit of remembering that you've got to change that line of code so that it jumps around the parts of the code that are not required. Let's look at two better ways of doing that. I'm going to take this line out now. And I'm going to enable this line of code. And what this run control parameter says is I want to be set. I want the PP, the preprocessor. I want to set the index country interactively. I could put one in here, and that would force it to one, just like I put a set command. So that's the same as what we had before. But if I have less than and greater than, it's going to prompt me for a value. So let's see what happens now. And sure enough, it's going to prompt me for a value. And I would need to know that if I typed in one here, it's going to do USA. If I type two, it's going to pick up China. Let's do one, and we should go back to the USA. And sure enough, we've got the title USA, and we've gone back to West, Central, and East, rather than Beijing, Shanghai, and whatever other codes we had there. So that's another way of working, and that's quite efficient. However, that requires us to remember what the country numbers are. And if we had 10 countries, we might get confused. So what is better is I have a data statement here which has the list of the countries in it. At the moment, I'm skipping around this ask command. So this ask command here, where it asks what country I want, is being skipped around. But let's have a look at the syntax here. So what this is saying is I want to ask, and the answer I get I want to store in the index country. So it's another interactive way of finding that. And the list of countries are going to come from the data statement 5000, which you can see two lines above. So if I take these two lines out, it's now going to prompt me for my country. And now this is a lot easier to work with. This is, I think, the better way of working. I can now see at a glance what I want. If I pick China and click OK, it's going to run for me. And sure enough, my run has gone back to the five codes or five region codes for China. So this is a recommended way of working using the ask command. The other ways are possible ways of working. How else can I use this? Well, let's just go down to the tables down here. And here you can see that title was pulled in by picking out the value of country from data statement 5000. So when it was when we picked USA or we picked one as a setting, it pulled out the text USA. And when we picked two, it picked out the text for China. Banner automatically got updated because region was defined dependent on country. And the sort of thing that you might expect in here is to have something like if country is country, unless sorry, country is country, go to finish. So this would be a way of filtering the data. So we could say here, unless they picked up country one or two, we actually want to filter the data now on the variable country, picking up the relevant setting for country. They need. If that confuses you, you can give that a different name. You can call it country ID or something like that. But that's actually using the preprocessor setting that we pick up as the code for country. And that would work perfectly well. All right. So that's some different ways of working. Now, the final thing that you might want to think about where you're doing multiple runs is that if we just keep running this, we will keep outputting our tables to the same table file name, which means we manually got to go and adjust that or save them in some way. This is a good way of working. So if you have an O equals with a prompt, it will prompt you for the output file name. So when I come to run this, I could go into the interactive session again. I could say I'm doing the USA run, so I want to save these tables as USA. Click OK. Click the USA. And now my tables, as you can see down here, have come out in USATables.txt. And there's my run that I produced before. Now, if you've got a batch of tables you want to run for 10 countries, it is possible to set up a batch run so that you set up a batch run. You just click on and it will run all 10 runs for you, one after the other, and output them all to different names. You'll need to get an example file from us there to help you with that. But it's something that you can copy using a batch file process for where you want multiple runs or perhaps a series of runs feed other runs. So that's taken you through the sort of basics of handling data files and handling multiple runs. Just to stress the real point, you really shouldn't be copying runs to multiple STP files because if you get a mistake or you have changes, you're going to find yourself repeating the work several times. At the very least, copy any common code to some insert files as we have here. But using the preprocessor and using the prompts in the preprocessor can make your run much more efficient and less prone to error. Thank you.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript