perl - What are some options for keeping track of temporary results and re-use them after restart, in case the program dies while running? -


(Suggestions are recommended for improving the title of this question.)

A Perl script uses the Web API to fetch a user's "likes" post on various sites (Tumblr, Reddit, etc.), then download a portion of each post (for example, an image which Post)

Right now, I have a JSON-encoded file that tracks the posts that have already been received (for tumblr, it only records the total number of red, it records Is the "ID" of the last post that was received), so that the script can choose the next "new" favorite items. This means that after the program the collection of a new batch of links is over, the new "stop point" has been entered in the JSON file.

However, if for some reason the program (or ctrl + c, say), progress was not recorded (because progress is only recorded at the end of "fetching"), so the next time the program , It appears in the tracking file, and completes the final recording point (the last time successfully fetching and progress), and raises there again, downloads the duplicate to that point where it is Last time

My question is that to take the best (easiest, most efficient, pick up your pick - I am open here for options) to record progress with each incremental stored object, so that if for some reason For this program dies, it always knows where it will run, where did it go from? After each object, optimizing the current method (literally by print - the tracking file at the end of each load) is definitely not the best solution because it is quite incompatible.

Edit for clarity

Make me clear that the file used to track downloaded posts is not large, and every " Not bringing with appreciable "Operation. There is only one element for each API (Tumble, etc.) in which the total number of choices for the account (in other words, the number that we have already downloaded, so we do query the AP for the current total, decrease it The number in the file, and we know how many new items to get), or the last item of the received id (reddit uses this, so we have the file for "all" Ask PI and only new things).

My problem is not a growing list of the posts freed, but every time a single post is downloaded then the tracking file is written (and even thousands of posts can be downloaded. Runs).

I just use a hash and is not what's full.

When you start a new batch of URLs, you remove the NDBM file.

Then, in your code, at the beginning of the program, you must

  tie (% visited, 'Ndbm_file', 'visited', 'o' (O_CREAT, 0666)  

(Do not worry about o CREAT, if the file will remain shaky unless you also pass O_TRUNC)

Suppose your main loop looks like this:

  while ($ id = & lt; INFILE & gt;) {my $ url = id_to_url ($ id); My $ result = fetch ( $ Url); Save_results ($ url, $ result);}  

you change it

  while ($ i D = & lt; INFILE & gt;) {my $ url = id_to_url ($ id); My $ result; if ($ $ {$ url} has been visited) {$ result = $ visited {$ url} };} And {$ result = get ($ url); $ tour {$ url} = $ result;} Save_results ($ url, $ result);}  

When you bring a new URL, you write the result in the NDBM file, and whenever you restart your program, the results that are already brought in will be in the NDBM file and

believe it That $ result Keller, otherwise you will not be able to recover it stores such /. But as you are producing JSON in any form, "Partial JSON" for each URL is probably what you want to store.


Comments