Best approach to import large texts for searching

Hi everyone,

I’m building an app to replace (and enhance) the functionality in an existing WordPress website.

We have ~ 15,000 ‘summary’ reports, which I will be importing into Glide Big Tables.

We also have copies of the ‘Full Text’ that we used to create our summaries. Some ‘Full Texts’ are small i.e. the original PDF was only a few pages, while others could be a copy of a 50+ page PDF, so the ‘Full Text’ is large. Note that we don’t have the original PDFs, only the extracted text.

What is the best way to import and store this data, so we can allow our Users to search the complete set of ‘full texts’ and get results?

Any advice would be great thanks!

I would look into exporting the content of the summary reports from WordPress into a CSV file. My only concern would be that 15,000 summary reports feels like a big number, and 50+ pages for some of these reports also sounds big. I have no idea if CSV files can contain data like that, but I would look into it.

Before doing the operation with for 15,000 reports, and would try to set up the feature you want with 2, or 5, or 10. To me, building the feature is one thing, getting the 15,000 summary reports into Glide another.

I would focus on building a mockup of the feature and testing it first.

2 Likes

As long as your full text doesn’t contain more than 1 million characters, I think you’re good to go.

2 Likes

Great thanks very much @ThinhDinh and @nathanaelb for your replies. I will check out the ‘Trebuchet Method’ with a small number of examples first, and build it up.

Will let you know if it works out.

1 Like

Why do you need a “Trebuchet Method” here? Wouldn’t each document have its own row?

1 Like

@ThinhDinh sorry, I misunderstood the quote from Robert Petito that I should check out the Trebuchet Method (I didn’t know what it was so I thought maybe it’s a way to import such data)! But I see you were referring to the 1M character limit per cell.

From checking sample files today, I now know that some ‘Full Texts’ will exceed the 1M characters limit.

Do you know if the rest of the ‘Full text’ for that report will be truncated during the import, or if it will just ‘drop’ that record? Hopefully it’ll be truncated and imported.

Also, is there any way to do a batch import, or an alternative approach to importing that I could consider? Otherwise, I may have to split the CSV file and do ~20 separate imports to keep the file size under 5MB, which I believe is Glide’s max. import size for a CSV.

Many thanks for your help.

I don’t have a sample here that I can use, but I think that would be truncated.

You can use the approach above to split your file.

1 Like

Amazing, so helpful, thank you @ThinhDinh.

It looks like after the split, I will still manually need to import each of the ~20 files, which is fine. I was thinking there might be a trick for that also, to add the files to a queue for importing one after the other.

1 Like

It would require interacting with the API. I have only done one file at once, haven’t handled data that large before.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.