There are approximately 3 million arrays - or Python lists \ tuples (does not really make any difference). Each array has the following elements:
['string1', 'string2', 'string3', ...] # Fully, 10000 element
These arrays should be stored in some type of key-value storage.
Then, 3 million keys, each key represents 10000-element array.
Lists / Tuples or any other custom thing - it really does not matter what the arrays contain strings - UTF 8 or Unicode string, every 5 to 50 characters. There are about 3 million possible strings in them if they are really necessary, but more efficient and for operation, I have to like strings.
Although it is difficult to give you a full description of the data (it is complex and weird), it's similar to the synonyms - say we have 3 million words - like the dick key - and every word 10k synonyms for - or element of the list.
Similarly (not the actual synonyms but it will give you this idea):
{'computer': ['pc', 'mac', 'laptop' , ...], # (10k fully) home ': [' building ',' cottage ',' tavern ', ...], # (another 10k) ...}
element - 'synonyms' - it can be sorted if it is sorted.
Later, after the arrays populated, there is a loop: we go through all the keys and check if something is in its value. For example, the user inputs the words 'computer' and 'laptop' - and we should immediately answer that the word 'laptop' is a synonym of 'computer'. The point here is that we have to check it millions of times, maybe maybe 20 lakh or just think that we have some random words - users of 'computer' and 'car', 'phone' and 'building' etc. There are lots . They can 'match' or they 'match'
So, in short - what do I need:
- Store these data structures efficiently,
- Quickly to be able Check whether something is in the array or not.
I should be able to keep the memory usage under 30 GB. Also I should be able to demonstrate all iterations in Xeon CPU in less than 10 hours.
About 0.1% of its incorrect answers - both are positive and negative - although they will not be able to reduce them or have them at all.
What's the best way here? Algorithms, links to code, anything is really appreciated. Apart from this - a friend of mine suggests using bluff filters or marisa - is that right? I have not done any of these work.
I mapped each unique string into a numeric ID of its & lt; For every 0.1% error rate, add approximately 20 bit per 20 bits * 10000 element * 3 lakh keys are 75 GB, if you have limited space, then store a little less accurate filter in memory and more accurate filter on the disk. Which the first filter says it can be a match. / P>
There are only 1.44 and midode; N and midod; N & Midot from ln 2 (1 / and epsilon; ); > In your case & epsilon; = 0.001 per key> 2 (1 / and epicelon; ), therefore theoretical limit is the data structure of 99658 bit per key, or 10 bit per element, which is 298 , 9 74, 000,000 bits or 38 GB.
Then below the theoretical limit for 30 GB data structure, which is the number and performance of the entries you want, but
Comments
Post a Comment