大象传媒

芦 Previous | Main | Next 禄

Trialling search on message boards: technical details

Post categories:

Mark Neves | 15:15 UK time, Monday, 22 November 2010

Messageboard search (as mentioned in David's post) has been implemented using the full text search engine that ships with 2005. We have found this engine to be very efficient and reliable and a massive improvement over the offering in SQL Server 2000. We currently use this engine to provide article search on H2G2 and Memoryshare.

We were unable to support messageboard search across all services in the same way we support article search because of the huge amount of posts in our system. We had to tackle the problem in a different way.

The messageboard search system has been architected to use a dedicated "Search" database. Resources dictate that this new database will initially live on the same server as the main messageboard database, but the architecture allows us to easily move the solution to one or more separate servers, as we bring search to more messageboards in the future.

One of the design goals was to make it easy to bring search to messageboards one by one, allowing us to provide effective and fast search without compromising the quality of existing services.

Diagram explaining how search works on the 大象传媒 message boards

The diagram above shows the two databases. The messageboard database stores all the posts across all messageboards in a single table. The search database fetches the latest posts belonging to the messageboards that have been configured to support search. The posts for each messageboard are stored in dedicated tables in the Search database. We create a separate full text index on each table that allows us to efficiently gather search results on a per-messageboard basis, without having to filter the search results that come back from the engine. We make use of the ranking values the engine provides to order the results by relevance.

In order to provide the fastest and least resource-hungry searches, we've adopted a simple "AND" based search. We take the search term the user gives us, create a list of search words using the space character as the separator, remove any Stop words (i.e. common words that are no use in terms of search, such as "the", "and", etc.), and then ask the search engine to find all the posts that contain all the words in the term.

For example, if the term "Fish and chips" was passed in, we would first create a list of search words using the space character as the separator, yielding "Fish", "and" and "chips". We would then remove the Stop words, causing "and" to be removed, leaving "Fish" and "chips". We then ask the engine to find all posts that contain the words "Fish" and "chips". Incidentally, all searches are case insensitive.

The number of months' worth of posts that are searchable is also configurable, on a per messageboard basis. Again, this is to allow us to control the amount of server resource that messageboard search requires. The number of months of searchable content will be decided on how busy the board is, and the nature of the board. Some boards may be more interested in current posts, whereas other boards may have content that's more historical and therefore still valuable as time passes. Ultimately it is our intention to allow all posts on all messageboards to be searchable.

Mark Neves is lead database engineer, DNA team, Audience Publishing Services, Programmes and On-Demand, 大象传媒 Future Media & Technology

Comments

  • Comment number 1.

    __ Initial thoughts _
    Thanks, any search is an improvement on nothing.
    However one prediction is that your search will cause chaos on the new 'bogrolls'* as users decide to repeat whole posts so that their reply may be indexed & searched efficiently.

    __ Point of Information __
    Your RSS & URL contain the word 'Trailing', instead of Trialing'
    I must say I sometimes find my own typos useful in searching as it enables me to find my own badly typed comments.

    __ Choice of trial date __
    Would it have been easier to launch this once the messageboards have been sorted out. The messageboards (mb) are buggy and I imagine that could make your job harder. Some mb links probably default rather than going to the intended location.

    __ Board By Board __
    bring search to messageboards one by one, I hope that you publicise the proposed search feature immediately on all mb. That would give users of the other boards a chance to watch the progress. The alternative being to repeat the fiasco of mb improvements whereby the same or similar problems were recreated in turn on each mb, with users getting more and more vociferous as busier boards had to suffer.
    Please remember many messageboarders may not read blogs, and 大象传媒 has made that more difficult by recently segregating users mb and blog profiles.

    __ Case Sensitive __
    I am not sure that is a good idea.
    I would have thought on many mb the users will have to change their ways drastically if case sensitive search is introduced and considered as posts are made.
    Example
    Maybe some will then start shouting FISH & CHIPS
    But many may simply be talking about: oil / lard, rice cones, batter and the fillet, if for instance discussing preparation, and not mention FISH & CHIPS within the post whatsoever.
    Many posts are replies to other posts; currently; with the original NOT being quoted.

    __ Thread Title __
    - Is any additional weight given to the thread title ?
    The reason I ask is because many posts will be in long threads, the thread title will, or may contain highly relevant keywords, these words may not be may not be repeated in the post.
    - in fact an initial look suggests the title may not even be indexed or searched ?!

    - Search by Thread title
    Rather a simple and presumably easily implemented idea.
    I personally would like to be able to Search or List by thread title. Any chance of that ?

    - Are you ranking whole Threads or Single Posts ?
    A long thread may be highly relevant to a subject,
    Are you currently ranking isolated individual posts ?
    ( as an initial glance seems to suggest)
    - Long threads - remember some threads will be hundreds or even over 1000 posts long
    but how do you then display the result
    if for instance there is a 100 post thread on Fish & Chips do you display each post in that thread as a result - I hope not - so do you then indicate a relevant post within that thread
    maybe along with some indication of the thread length
    or maybe the thread as an entry, with an option to expand that entry for individual posts

    __ Comments __ & Food Search
    link is
    /dna/mbfood/NF2670471?thread=7897247
    The thread could be made a "sticky" so that it is easily seen, however I suppose it will get many comments, maybe the best policy would be a 大象传媒 informational thread as a closed 'sticky', and a separate open comments thread.
    Presumably you will accept comments both on this blog and on the mb.


    [ *'bogrolls' - affectionate term used on The Archers mb
    ( the busiest single mb of the 大象传媒 if I am not mistaken)
    Archers scripters are right now working on replacement de-bogroll scripts to apply to 大象传媒 improved messageboards, /dna/mbarchers/NF2693944?thread=7059736 ]

  • Comment number 2.

    A question, will the transfer of message board postings to search server and subsequent indexing be instant (less than a second) or will the transfer be delayed?

    My thinking being that some subjects in particular TV programmes become hot topics the moment they end and if well behaved posters check the search and find nothing they will run the risk of joining the somewhat derided posters that feel their thoughts need a new post rather than join an established thread.

    I point you at the plethora of Strictly Come Dancing posts that arrive every Saturday night on the Points of View board. Please see here /dna/mbpointsofview/NF1951574?thread=7694537&latest=1#p103343935

  • Comment number 3.

    Here is a bit of testing that seems to indicate a rising cockerel situation.

    Search for "bangers" check, found.
    Search for "mash" check, found.
    Search for "bangers and mash" No results found, please refine your search.
    Search for "bangers mash and" No results found, please refine your search.
    Search for "bangers mash the" No results found, please refine your search.

    Possibility is that if any 'Stop words' exist in the search it goes straight to the compost bin.

    Oh and to head of any thought that there are no threads with bangers and mash, there are a number findable by searching for either term.

  • Comment number 4.

    John99 - blame me for the misspelling of "trialled".

    Guv-nor - this is a glitch which has now been fixed:

    Thanks

  • Comment number 5.

    @ Nick Reynolds - You call it a glitch.

    It made search totally unfit for purpose.

    It also shows that no in place testing even of the example "Fish and Chips" was carried out on an "as used by the public" computer.

    This shows that the system of in house testing is unfit for purpose.

    Which explains why so much released in the last six months has caused such irritation to people. (News site, font issues, lack of margin, Facebook snooping: Message boards, too many even to list what has been fixed after release: iPlayer both site and desktop, I lose the will to live.)

    However please rest assured that I complain because I care.

  • Comment number 6.



    "This shows that the system of in house testing is unfit for purpose."

    I think WE have all known this for some time!

  • Comment number 7.

    A search engine would be good to find threads in The Bull

  • Comment number 8.

    I knew it would end in tears.;)

    Egg waves at Squirrel, Guv and John 99

  • Comment number 9.

    The Archers Message Boards should be closed down because its full of people used to getting there own way and they should realize that they can't get there own way in life

  • Comment number 10.

    I think this is starting to drift a little off topic. Stay on topic please.

    Thanks.

  • Comment number 11.

    This comment was removed because the moderators found it broke the house rules. Explain.

  • Comment number 12.

    This does not seem to have received many comments either on this blogpost or on the Food site. Maybe publicising it on more boards would generate more interest, as I suggested in comment#1.

    Is there in fact a silent group starting to use it on the Food board ?
    any usage stats yet ?
    How often is it used on the food board 1 in 1000 visits for instance
    and how many searches are tried on an average day ?

    I did note the bangers and mash fix seems to have worked. (see #3 & #5 above)
    Any chance of feedback especially re searching thread titles ?

  • Comment number 13.

    Thank you for the feedback.

    We do indeed index the title and the post text. At the moment we don鈥檛 add any addition weight to the title, but we think this is a great idea and plan to add additional weight to the title in the future.

    A 鈥淭hread title only鈥 search is also a great idea. That would enable users to find relevant discussions quicker, rather than being swamped by many posts that could belong to the same discussion in the search results.

    We think case-insensitivity is very important. Most users would not understand the reasons why 鈥淔ish鈥 might return different results to 鈥渇ish鈥 if we provided a case-sensitive search.

    We currently rank posts individually. We don鈥檛 take into account the thread it belongs to, or how many posts within the same thread contain the search term.

    The current search facility is basic, but we took the view that any search at all was better than nothing. The analogy I offer is this: The difference between not having a mobile phone and having a basic Nokia is far greater than the difference between a basic Nokia and an iPhone. We wanted to get basic search out as quickly as possible, with a view of improving the feature in the future.

    BTW The reason that the example in the text, 鈥淔ish and chips鈥, and 鈥渂angers and mash鈥 didn鈥檛 work on the day the blog post was published was an unfortunate timing issue. We released v1 of messageboard search to the live servers on the 24th November, which is when the blog post should have been published. I can assure you it wasn鈥檛 due to lack of testing.

    Mark

  • Comment number 14.



    we took the view that any search at all was better than nothing.


    I agree with that...


    We wanted to get basic search out as quickly as possible


    So how long was that...?

    I don't need to know to the month, just years will be fine!



  • Comment number 15.

    Squirrel - sarcasm is not particularly helpful. It has taken longer to get search on message boards than I personally would have liked but that's not Mark's fault. He and the other members of the DNA team have worked hard to get to this point and they deserve credit for that.

    So please take a more civil tone.

    Thanks

  • Comment number 16.

    @Mark Neves
    Thanks for replying.

    - case insensitive search - [ idiot smileys not available :-) ]
    My mistake.
    My original comments in #1 were based on me misreading your post.
    I thought the searches were case sensitive, which did not seem a good idea. I re-read what was written after reading your comment.

    As you say any search feature is better than nothing.
    But we can already search using an external search engine, unless the 大象传媒 plans to prevent that.
    It is would therfore be an advantage if any internal 大象传媒 search facility had features that complement or are somehow an advantage over an external search. At present a 大象传媒 search may be slightly easier to do than using an external search engine but, I think, returns much the same results.

    I note you say the title is searched, and i have confirmed that* but I note text in the message that matches is highlighted by being shown in red, that does not happen to text in titles.

    * For instance I searched for the term 'cod'
    and obtained posts where cod was in the title but not within the post body.

  • Comment number 17.

    I did notice it was mentioned that it is not possible to search for "[v]" which the poster thought would be useful on the food board as they use that to indicate vegetarian.

    Of course unless a search within results feature or the search has an OR option this will not be particularly useful as it will merely list all vegetarian related posts. But at least it would produce a list that could then be searched using the browsers find feature.

More from this blog...

大象传媒 iD

大象传媒 navigation

大象传媒 漏 2014 The 大象传媒 is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.