python - Efficient way to store millions of arrays, and peform IN check -


There are approximately 3 million arrays - or Python lists \ tuples (does not really make any difference). Each array has the following elements:

  ['string1', 'string2', 'string3', ...] # Fully, 10000 element  

These arrays should be stored in some type of key-value storage.

Then, 3 million keys, each key represents 10000-element array.

Lists / Tuples or any other custom thing - it really does not matter what the arrays contain strings - UTF 8 or Unicode string, every 5 to 50 characters. There are about 3 million possible strings in them if they are really necessary, but more efficient and for operation, I have to like strings.

Although it is difficult to give you a full description of the data (it is complex and weird), it's similar to the synonyms - say we have 3 million words - like the dick key - and every word 10k synonyms for - or element of the list.

Similarly (not the actual synonyms but it will give you this idea):

  {'computer': ['pc', 'mac', 'laptop' , ...], # (10k fully) home ': [' building ',' cottage ',' tavern ', ...], # (another 10k) ...}  

element - 'synonyms' - it can be sorted if it is sorted.

Later, after the arrays populated, there is a loop: we go through all the keys and check if something is in its value. For example, the user inputs the words 'computer' and 'laptop' - and we should immediately answer that the word 'laptop' is a synonym of 'computer'. The point here is that we have to check it millions of times, maybe maybe 20 lakh or just think that we have some random words - users of 'computer' and 'car', 'phone' and 'building' etc. There are lots . They can 'match' or they 'match'

So, in short - what do I need:

  • Store these data structures efficiently,
  • Quickly to be able Check whether something is in the array or not.

I should be able to keep the memory usage under 30 GB. Also I should be able to demonstrate all iterations in Xeon CPU in less than 10 hours.

About 0.1% of its incorrect answers - both are positive and negative - although they will not be able to reduce them or have them at all.

What's the best way here? Algorithms, links to code, anything is really appreciated. Apart from this - a friend of mine suggests using bluff filters or marisa - is that right? I have not done any of these work.

I mapped each unique string into a numeric ID of its & lt; For every 0.1% error rate, add approximately 20 bit per 20 bits * 10000 element * 3 lakh keys are 75 GB, if you have limited space, then store a little less accurate filter in memory and more accurate filter on the disk. Which the first filter says it can be a match. / P>

There are only 1.44 and midode; N and midod; N & Midot from ln 2 (1 / and epsilon; ); > In your case & epsilon; = 0.001 per key> 2 (1 / and epicelon; ), therefore theoretical limit is the data structure of 99658 bit per key, or 10 bit per element, which is 298 , 9 74, 000,000 bits or 38 GB.

Then below the theoretical limit for 30 GB data structure, which is the number and performance of the entries you want, but


Comments