banner
Nanomoa

Nanomoa's Blog

"Bytes and Brackets: My Journey in Go and C++ Exploration."
github
zhihu
telegram
twitter

Jaccard相似係數

瞎扯一通#

最近在寫數據結構的課程設計,我們組的題目是校園導遊系統,因為太菜了想不到好的 idea😭,在查資料 (偷窺別人 idea) 的時候看到了一些推薦算法的介紹,遂加在了我們的課設裡

Jaccard 相似系數#

定義#

Jaccard(A,B)=ABAB=ABA+BABJaccard(A, B) = \frac{ | A \cap B | }{ | A \cup B | } = \frac{ | A \cap B | }{ | A | + | B | - | A \cap B | }, 若 A=A = \varnothingB=B = \varnothing, Jaccard(A,B)=1Jaccard(A, B) = 1

注: Jaccard(A,B)[0,1]Jaccard(A, B) \in [0, 1]

Jaccard 距離#

用於描述兩集合的不相似度

DistJaccard=1Jaccard(A,B)=AΔBABDist_{Jaccard} = 1 - Jaccard(A, B) = \frac{ A \Delta B }{ | A \cup B | }

簡單的推薦算法實現#

修改前的建築 / 景點節點定義:#

struct Location {
    double longitude;
    double latitude;

    [[nodiscard]] std::string toString() const {
        return std::to_string(longitude) + "," + std::to_string(latitude);
    }
};

template<typename IdType>
struct Spot {
    IdType id;
    std::string title;
    std::string description;
    Location location{};
};

實現思路#

要實現的效果是根據當前建築推薦其他相似的建築,所以我想到了給Spot結構體設置一個tags字段 (集合),給每個建築根據其使用性質設置一些標籤,比如: "教學樓" , "實驗樓", "公共課", "專業課", "理論課", "實驗課" 等等,使用Jaccard 相似系數作為其相似度

對於每個建築,推薦和它的相似度大於 similarityThreshold(相似度下限) 的建築,且優先推薦距離更小的建築,similarityThreshold的值需要綜合考慮全局的標籤數量決定

修改後的定義:#

struct Location {
    double longitude;
    double latitude;

    [[nodiscard]] std::string toString() const {
        return std::to_string(longitude) + "," + std::to_string(latitude);
    }
};

template<typename IdType>
struct Spot {
    IdType id;
    std::string title;
    std::unordered_set<std::string> tags;
    std::string description;
    Location location{};
};

代碼實現:#

template<typename IdType>
class Recommend {
private:
    // 計算Jaccard相似系數
    static double jaccardSimilarity(
            const std::unordered_set<std::string>& set1,
            const std::unordered_set<std::string>& set2
    ) {
        std::unordered_set<std::string> intersection;
        std::unordered_set<std::string> union_set;

        std::copy_if(set1.begin(), set1.end(),
                     std::inserter(union_set, union_set.end()),
                     [&](const std::string& s) {
                         return set2.find(s) != set2.end();
                     }
        );

        std::set_intersection(set1.begin(), set1.end(),
                              set2.begin(), set2.end(),
                              std::inserter(
                                      intersection,
                                      intersection.end()
                              )
        );

        std::set_union(set1.begin(), set1.end(),
                       set2.begin(), set2.end(),
                       std::inserter(
                               union_set,
                               union_set.end()
                       )
        );

        return static_cast<double>(intersection.size()) /
               static_cast<double>(union_set.size());
    }

public:
    static std::vector<std::pair<IdType, double>> recommendSimilarSpots(
            Graph<IdType> graph,
            const Spot<IdType>& referenceSpot,
            double similarityThreshold = 0.3
    ) {
        std::vector<std::pair<IdType, double>> similarityScores;

        for (const auto& spotPair : graph.getSpots()) {
            const auto& spot = spotPair.second;
            if (spot.id != referenceSpot.id) {
                double similarity = jaccardSimilarity(
                        referenceSpot.tags,
                        spot.tags
                );
                if (similarity >= similarityThreshold) {
                    similarityScores.push_back(
                            std::make_pair(spot.id, similarity)
                    );
                }
            }
        }

        Sort(similarityScores.begin(), similarityScores.end(),
             [&graph, &referenceSpot](
                     const std::pair<IdType, double>& a,
                     const std::pair<IdType, double>& b
             ) {
                 if (a.second == b.second) {
                     return distanceHaversine(referenceSpot.location, graph.getSpot(a.first).location) <
                            distanceHaversine(referenceSpot.location, graph.getSpot(b.first).location);
                 }
                 return a.second > b.second;
             }
        );

        return similarityScores;
    }
};
載入中......
此文章數據所有權由區塊鏈加密技術和智能合約保障僅歸創作者所有。